I want to process wikipedia dump file. In other meaning i want to extract title , category and text content for each article. what i want to ask about it is there any java api/tool that can help me in doing that. thanks in advance
The Wikipedia dump file is in XML format. Therefore, you can use any available XML tools for this purpose.
Note that due to the size of the dump file, a SAX parser will generally be much more efficient than a DOM parser (since a DOM parser will try to load the entire thing into a memory representation).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With