how to use information provided in wiki download's index file?

Tags:

I am trying to do some research about chinese persons by using wiki data. Other than using dbpedia (as info about chinese person is bit limited comparing to zh.wikipedia.org), I found that I can download directly from zhwiki http://download.wikipedia.com/zhwiki/20150301/.

I see there is an index file, from the file I can see row such as: 966576:291:人物

Which I assume is a lookup key? Can someone tell me how to use this lookup key to search the main file or database?

533

asked Mar 12 '15 21:03

daxu

1 Answers

There are two files

zhwiki-20150301-pages-articles-multistream.xml.bz2 1.1 GB - it has multiple bz2 streams, 100 pages per stream
zhwiki-20150301-pages-articles-multistream-index.txt.bz2 18.8 MB - index file

index file has lines

offset1:pageId1:title1
offset1:pageId2:title2
..
offset2:pageId101:title101 and so on.

offset is starting offset of bz2 stream. You need to read bytes from offset1 to offset2 from bz2 file and pass them to bz2 decoder and it will give you xml dump of 100 pages from that stream

141

answered Oct 21 '22 03:10

Intracer

Related questions
                            
                                SVN + PROJECT MANAGEMENT + WIKI + TODO LIST [closed]
                            
                                Data Dictionary generators for PostgreSQL to Confluence Wiki
                            
                                ruby markdown parser with WikiWord support?
                            
                                Is there a way to display all sub wikis in an index page in gitlab?
                            
                                drupal 7 - wiki page
                            
                                What methods do wikis use for merging concurrent edits?
                            
                                Can I link to a file for downloading (other than PDF) in a GitHub wiki?
                            
                                How do I enable mathjax in bitbucket wiki?
                            
                                Suggest a good PHP wiki engine [closed]
                            
                                How to Create Folders in a SharePoint Wiki Library?
                            
                                Wikipedia : Java library to remove wikipedia text markup removal
                            
                                redmine wiki - heading with auto numbering
                            
                                Create a wiki page in gitlab
                            
                                How to push wiki to github
                            
                                How to add google webfonts to mediawiki?
                            
                                Best practice for Cloning and Committing the "Wiki" repo for a BitBucket project
                            
                                What is the best Wiki solution for Django? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to use information provided in wiki download's index file?

Tags:

wiki

wikipedia

daxu

People also ask

1 Answers

Intracer

Recent Activity

Donate For Us