用JAVA实现一个IMDB数据解析器

将凌乱的列表文件转换为csv

Posted by maybelence on 2021-03-31

在IMDB数据库中,至少有2000万条数据可以查询。在使用这些数据之前还有许多前置工作需要完成。

数据库文件

我们只取120行的样例数据来测试我们的解析器。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
Andre, Kent		The Mask: The Origin (1995) (VG)  (voice)

Andre, Kevin Come Fly with Us (1974) [Tony] <4>
Deep Throat Part II (1974) (as Carter Courtney Jr.) [Russian Agent] <8>
Fringe Benefits (1974) [Dr. Charles Cherrypopper] <1>
Invasion of the Love Drones (1977) [Pseudo Drone] <26>
Is There Sex After Death? (1971) (as Carter Courtney Jr.) [Sex Bowl Contestant] <82>
Lady on the Couch (1974) [Dr. Miller]
Naked Came the Stranger (1975) [Party Guy with Candle] <6>
Sex Fantasies (1975)
Stigma (1972) (as Carter Courtney) [Homosexual] <10>
Teenage Hitchhikers (1975) (as Carter Courtney Jr.) [Farquart (Bruce)] <8>
The $50,000 Climax Show (1975) (as Kevin Andre Darby)
The Big Con (1975) (as Carter Courtney) [Sam Browne] <7>
The Defiance of Good (1975) [Dr. Hirsch] <8>
The Love Bus (1974) (as Kevin André) [Ralph Kramden] <7>
The Mount of Venus (1975) (as Kevin André) [Bacchus] <5>
The New York City Woman (1977) (archive footage) (uncredited) [Dr. Chartop]
The Passions of Carol (1975) (as Kevin André) [The Spirits] <5>
The Private Afternoons of Pamela Mann (1974) [Frank's First Client] <6>
The Switch or How to Alter Your Ego (1974) [Seymour] <17>
Whatever Happened to Miss September? (1973) (as Knah-Knah) [Bentley]

Andre, Kurt My Politics, My Country (2012)
World Bender (2014) [Network Executive] <8>

Andre, L.J. "Disneyland" (1954) {Willie and the Yank: The Deserter (#13.15)} [Uncle Ferd] <10>
"Disneyland" (1954) {Willie and the Yank: The Mosby Raiders (#13.16)} [Uncle Ferd] <10>

Andre, LeClerc StreetSmart Guide (2014) (V) [Himself - Guest]
"Laughs" (2014) {(#1.22)}
"Laughs" (2014) {(#1.5)} [Himself]

Andre, Lorin-Pierre Cirque du Soleil: Worlds Away (2012) [Viva Elvis Cast Member] <198>
The Neighborhood Ball: An Inauguration Celebration (2009) (TV) [Antigravity Performer]

Andre, Lukka Armando Rene: Start Running (2015) (V)
Making of Salir Corriendo (2015) (archive footage)

Andre, Mandi Memoirs of a Lifeguard (2010)

Andre, Marcus The Grad Film (2015) [Rob]

Andre, Mario (I) Love Your Mama (1990) [Bartender]

Andre, Mario (II) Market Value (2016) [Triage Patient]
The Breakout Dream (2015) [Pastor Mark]
The Penny (2010) [Darrell Watts] <6>

Andre, Mario (III) Blactose Intolerance (2015) [Pa Pa]
Life Outside the Rhyme (2016) [The Muffin Man]
Preacher Man (2015) [Elder Jones] <4>
School for Gods (????) [St. Louis Cartel]
Under Pressure (2016) [Uncle Ray]
"The Natural" (2016) [The Manager]

Andre, Martin Vasilissa maimou (2000)

Andre, Mat (I) Lights Camera Blood! (2015) [Hank the Hobo]
"Todd and the Book of Pure Evil" (2010) {B.Y.O.B.O.P.E. (#2.11)} [Dancerman] <18>

Andre, Mathias (II) Guilty Pleasures (2016) (TV) [Groomer 3] <26>

Andre, Matthew Santa Rosa (2004) (V)

Andre, Michael (II) "Here 2 Help" (2011) {(#1.2)} [Himself] <3>
"Here 2 Help" (2011) {(#1.3)} [Himself] <2>
"Here 2 Help" (2011) {(#1.4)} [Himself] <2>
"Here 2 Help" (2011) {(#1.5)} [Himself] <2>
"Here 2 Help" (2011) {(#1.6)} [Himself] <2>
"Peter Andre: My Life" (2011) {(#1.2)} [Himself]
"Peter Andre: My Life" (2011) {(#1.3)} [Himself]
"Peter Andre: My Life" (2011) {(#1.4)} [Himself]
"Peter Andre: My Life" (2011) {(#1.5)} [Himself]
"Peter Andre: My Life" (2011) {(#3.4)} [Himself]
"Peter Andre: My Life" (2011) {(#3.5)} [Himself]
"Peter Andre: My Life" (2011) {He's Found His Happy Ever After (#5.4)} [Himself]
"Peter Andre: My Life" (2011) {I Can't Remember When I Was 15 (#4.4)} [Himself]
"Peter Andre: My Life" (2011) {It's a New Chapter, Isn't It? (#5.1)} [Himself]
"Peter Andre: My Life" (2011) {Some Days Are Better Than Others (#4.1)} [Himself]
"Peter Andre: My Life" (2011) {There's Only One Problem... She Doesn't Like Coffee (#3.1)} [Himself]
"Peter Andre: My Life" (2011) {We're Off to Zanzib-Andre (#5.3)} [Himself]
"Peter Andre: My Life" (2011) {Why Wait, Why Hesitate? (#5.5)} [Himself]

Andre, Miguel Journey Among Women (1977) [Soldier] <21>

Andre, Mikail 1 Lawan Satu (2013) [Radhi]
Dua Kalimah (2013)
Gangster Wars (2013) [Romeo]
Mukasurat Cinta (2014)
Sniper (2014) [Mat Jambang]
Tokak (2013) [Boy]

Andre, Mike A Mystery in Carmine (2012) (as Mikey Andre) [Toby]

Andre, Mohd Mikail Hantu dalam botol kicap (2012) [Azri] <4>

Andre, Mohd Pierre 3 Temujanji (2012) [Sein] <2>
3, 2, 1 cinta (2011) (as Pierre Andre) [Fariz] <3>
9 September (2007) (as Pierre Andre) [Kogi] <1>
Aku, Kau & Dia (2012) (as Pierre Andre) [Abang Harris]
Al-Hijab (2011) (as Pierre Andre) [Rafael] <1>
Chantek (2012) [Ad] <1>
Cinta (2006) [Taufiq]
Cinta fotokopi (2005) (as Pierre Andre) [Din]
Gol & Gincu (2005) [Reza] <3>
Jangan pandang belakang (2007) [Darma]
Jangan tegur (2009) [Kamal] <2>
Krazy crazy krezy... (2009) (as Pierre André) <3>
Pontianak harum sundal malam 2 (2005) [Purnama]
Potret mistik (2005) (as Pierre Andre) [Badrul]
Salon (2005) [Ezra Fernandez] <2>
Sepi (2008) [Khalil]
Seru (2011) [Bob]
Strawberi cinta (2012) [Hakimi]
X (2012/I) [Hafiz]
"Gol & Gincu: The Series" (2006) [Reza] <2>

用正则分析数据库文件

我们使用正则来得到正确的数据,如果你之前没有接触过正则表达式,我推荐RegexR网站提供给你学习。

我们将使用下面的正则字符串:

1
([A-Za-z,.'$& ]*)?([\t]*)(.+?)([ ]*)\\(([0-9,?]{4})(.+?\\n{2})?

这样会将所有的数据从演员中删除,然后你可以单独的使用这些数据。

创建JAVA项目

如果你之前没有接触过JAVA,希望你先学习一下JAVA的基础课程。

我们仅仅需要4个类:

  • Main
  • Reader
  • Parser
  • Writer

创建Reader类

我们应该做的第一件事是创建一个Reader类,该类将读取其中包含所有演员的数据库文件(database.txt)。在Java中,我们可以使用File和Scanner类读取文件的数据并将其放入ArrayList中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
import java.util.ArrayList;

public class Reader {

ArrayList<String> lines = new ArrayList<String>();
public ArrayList<String> Read(String dbFile){
try {

File readable = new File(Reader.class.getResource(dbFile).getFile());
Scanner dbScanner = new Scanner(readable);
while (dbScanner.hasNextLine()) {
String data = dbScanner.nextLine();
lines.add(data);
}
dbScanner.close();


} catch(FileNotFoundException e) {
System.out.println("File not found");
e.printStackTrace();
}
return lines;

}

}

首先,创建保存String类型的ArrayList。公共方法Read()将采用我们要读取的文件的路径,然后Scanner将每一行数据放入ArrayList中。如果没有剩余的行,关闭Scanner。

创建Parser类

为了将每一行的数据解析为结构化的csv文件,我们必须创建一个匹配模板并检查是否能够匹配刚刚读取的文件。

ParseActors 方法

ParseActors方法接收两个参数,文件存储的数据list,和一个正则表达式。

1
2
3
public ArrayList<ArrayList<String>> ParseActors(ArrayList<String> input, String regex) {
...
}

这个方法返回一个String泛型的一个ArrayList,这样可以保证数据库中的每个演员和电影能被分别注册。

创建匹配模式
1
Pattern pattern = Pattern.compile(regex);
组装响应值

为了确保我们的csv文件是结构化的。我们创建一个ArrayList并且添加csv的header (Actor, Title, Year)。

1
2
3
4
5
6
7
ArrayList<ArrayList<String>> collections = new ArrayList<ArrayList<String> >();

collections.add(new ArrayList<String>());

collections.get(0).add("Actor");
collections.get(0).add("Title");
collections.get(0).add("Year");
查找匹配项并填充collections

现在我们已经准备好列表并创建好了匹配模板,我们可以搜索匹配项。再次查看正则表达式字符串时,我们发现可以匹配一些内容。我们只需要演员的姓名,他上演的电影以及年份。

让我们创建一个映射所有结果的for循环,然后使用匹配模板找到正确的strings
现在,该数据库存在一个大问题,那就是我们只能看到演员一次,然后才能看到他所播放的电影的完整列表。

一旦将演员放入collections,我们假设在找到另一个演员之前,找到的所有电影都是他所扮演的电影。例如:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import java.util.ArrayList;
import java.util.List;
import java.util.Arrays;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Parser {

public ArrayList<ArrayList<String>> ParseActors(ArrayList<String> input, String regex) {
Pattern pattern = Pattern.compile(regex);

//ArrayList<String> lijst = new ArrayList<String>();

ArrayList<ArrayList<String>> collections = new ArrayList<ArrayList<String> >();

collections.add(new ArrayList<String>());

collections.get(0).add("Actor");
collections.get(0).add("Title");
collections.get(0).add("Year");

for (int i = 0; i < input.size(); i++) {
Matcher matcher = pattern.matcher(input.get(i));

collections.add(new ArrayList<String>());

if(matcher.find()) {

if (matcher.group(1) != null) {
String actor = matcher.group(1).toString();
//lijst.add(actor.replaceAll("[,]", ""));

if(actor == ""){
collections.get(i+1).add(collections.get(i).get(0).replaceAll("[,]", "").toString());
System.out.println(collections.get(i+1).get(0));

} else {
collections.get(i+1).add(actor.replaceAll("[,]", ""));
System.out.println(collections.get(i+1).get(0));
}
}

if (matcher.group(3) != null) {
//lijst.add(matcher.group(3));

collections.get(i+1).add(matcher.group(3).replaceAll("[\",]", "").replaceAll("[,]", "").toString());
}

if (matcher.group(5) != null) {
//lijst.add(matcher.group(5));
collections.get(i+1).add(matcher.group(5));
}

if (matcher.group(6) != null) {
//lijst.add("null");
}

}

}

return collections;
}
}

创建Writer类

Writer类的作用是将输出的数据写入到csv文件中,我们将在此类中创建的两种方法比Parser类简单得多。
我们需要再次使用File类,这一次,我们创建一个名为data.csv的文件。

1
2
3
4
5
6
7
8
9
10
11
12
13
public static void createFile() {
try {
File myObj = new File("data.csv");
if (myObj.createNewFile()) {
System.out.println("File created: " + myObj.getName());
} else {
System.out.println("File already exists.");
}
} catch (IOException e) {
System.out.println("An error occurred.");
e.printStackTrace();
}
}

我们再创建一个writeToFile()方法,该方法将所有数据逐行写入CSV文件中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
public static void writeToFile(ArrayList<ArrayList<String>> parserOutput) {
try {
FileWriter myWriter = new FileWriter("data.csv");

System.out.println(parserOutput);

for(int i = 0; i < parserOutput.size(); i++){
if(parserOutput.get(i).isEmpty()){
//nothing
} else {

myWriter.append(parserOutput.get(i).get(0) + "," + parserOutput.get(i).get(1) + "," + parserOutput.get(i).get(2) + "\n");
}

}

myWriter.close();

System.out.println("Successfully wrote to the file.");
} catch (IOException e) {
System.out.println("An error occurred.");
e.printStackTrace();
}
}

整合代码

为了将所有的代码整合在一起,测试我们的解析器,我们需要:

  • 创建一个Main.java并添加我们的正则字符串和数据库文件地址
  • 创建 Writer, Reader, 和 Parser的实例
  • 创建一个list存储读取数据
  • 创建一个list存储解析数据

最后,我们将所有的数据写入到csv文化中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public static void main(String[] args) {
String file = "/databases/actorstest.txt";
String regex = "([A-Za-z,.'$& ]*)?([\t]*)(.+?)([ ]*)\\(([0-9,?]{4})(.+?\\n{2})?";

Reader reader = new Reader();
Parser parser = new Parser();
Writer writer = new Writer();

ArrayList<String> lijst = reader.Read(file);
ArrayList<ArrayList<String>> parsed = parser.ParseActors(lijst, regex);


writer.createFile();
writer.writeToFile(parsed);
}

运行代码测试解析器

运行代码之后,会自动创建一个data.csv的文件,如下:

如你所见,它存储了所有的数据,并按照标题, 演员, 和年份分组。

总结

我们为演员列表创建了一个解析器。如果要解析更多IMDB列表或找到的任何数据库/列表,则可以创建一个新的Parse()方法。
这样的解析器使您可以更好地理解数据的外观。


Copyright by @maybelence.

...

...

00:00
00:00