Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Processing wikipedia dump file

Tags:

java

I want to process wikipedia dump file. In other meaning i want to extract title , category and text content for each article. what i want to ask about it is there any java api/tool that can help me in doing that. thanks in advance

like image 832
user1212009 Avatar asked Dec 16 '22 03:12

user1212009


1 Answers

The Wikipedia dump file is in XML format. Therefore, you can use any available XML tools for this purpose.

Note that due to the size of the dump file, a SAX parser will generally be much more efficient than a DOM parser (since a DOM parser will try to load the entire thing into a memory representation).

like image 147
Greg Hewgill Avatar answered Dec 18 '22 17:12

Greg Hewgill