Scraping and Parsing a Wikipedia Page

Tags:

I'm wondering if there are any existing libraries in or accessible from Objective-C that would allow me to scrape pages formatted like this one. Specifically, all of the dates and all of the text next to each date. If not, what would be the best way to go about doing this? Regular expressions? I heard that NSString might already have built-in methods for this. Is this true?

I was looking around to see if there were any alternative to scraping, such as an XML file or API. I did find an API but the only clients I see available are in other languages and they seem to just be able to post content to pages, not retrieve it.

EDIT: So I found more information regarding the API at these links:

MediaWiki API
API:Query

And I was able to come up with this request which returns some HTML encoded text (Well the format is XML, but it includes the page's text such as »a href= etc. I'll keep looking through the docs to see if I can make this come out a bit better, if not though, are there any recommendations on parsing this?

EDIT 2: Alright so thanks to this doc page, the simplest and cleanest way I've been able to retrieve the data is using this constructed link which returns the raw data (In wiki markup) of the relevant section. However, I guess I would then need to parse that, though if that really is the case, it should be a lot easier than the entire article.

Does anyone have any recommendations on parsing wiki markup such as the following in Objective-C?

==Events==
* [[710]] &ndash; [[Saracen]] invasion of [[Sardinia]].
*[[1275]] &ndash; Traditional founding of the city of [[Amsterdam]].
*[[1682]] &ndash; [[Philadelphia]], [[Pennsylvania]] is founded.

What I want to end up having is, I guess an NSDictionary or similar collection that will store the date with the accompanying snippet of information. Thanks!

967

asked Oct 27 '09 19:10

Jorge Israel Peña

5 Answers

Add a &format=fmt to the end of your query, as described at API:Data_formats. Your query becomes: JSON query, for example. You can specify XML, JSON, or many other formats.

You can easily parse the overall sections, and then just display the HTML formatted output into a webview.

144

answered Oct 02 '22 13:10

mbauman

Given that pages on Wikipedia are stored as plaintext, and input by users as plaintext, you're not going to get a structured data set from it.

answered Oct 02 '22 13:10

kprevas

I have scraped a lot of data from WP in various ways. the format depends on a lot of things including what type of subdomain the information is in and when it was entered. The main text is free format and there is no simple way to scrape it. The infoboxes are in a special WP format which has changed over the years. It wasn't designed to be scraped.

There is a database backing WP which is somewhat more structured.

By far your best strategy is to contact the Wikipedians in the domain you wish to scrape - they will know about the database format and may well be able to help - they will certainly want to help as they will want to see WP in semantic form (such as DBPedia - http://dbpedia.org/About).

answered Oct 02 '22 15:10

peter.murray.rust

Does Python count? ;) It is accessible from Objective-C. And there are great modules for scraping purposes: Beautiful Soap and/or mechanize, you can also consider lxml.

answered Oct 02 '22 14:10

Piotr Byzia

I'm going to go with suggesting regex for targeted data extraction in a mixed HTML data stream.

There are already RegEx libraries on the phone, they are sort of hidden though - you can expose them with a few simple calls using RegexKitLite (make sure to scroll down and get the light version). It ends up being a class with a few extensions on NSString that lets you do regexs, then you would define a regex with two captured matches - one for the number, and one for the content, along with a number of non-captured matches for the enclosing and intermediate tags. Even though it's a "lite" version of standard RegEX it sill supports just about any ability you would need.

The API approach is promising but once you get the raw markup you're probably going to have to take a similar regex approach to parsing data out of that. It still might make sense if it reduces regex complexity and data transfer time though, no reason you can't combine both approaches.

answered Oct 02 '22 15:10

Kendall Helmstetter Gelner

Related questions
                            
                                Attemped to add a SKNode which already has a parent error
                            
                                iOS reverse audio through AVAssetWriter
                            
                                iOS 8 push notification action buttons - code in handleActionWithIdentifier does not always run when app is in background
                            
                                How to get correct battery level and battery status-ios?
                            
                                Animate UIView and keep corner radius as a circle
                            
                                UIBezierPath on UILabel in UITableViewCell has screwed up sizes
                            
                                Custom keyboard appears with ~5seconds delay
                            
                                Objective - C stringByAddingPercentEncodingWithAllowedCharacters not working
                            
                                OSX Status Bar Image Sizing - Cocoa
                            
                                Navigation bar is hiding when i open my search bar
                            
                                FBSDKLoginButton weird background
                            
                                Reordering UICollectionView in iOS7
                            
                                How to add android like toast in iOS?
                            
                                Open URL Schemes in IOS
                            
                                What is the best place to add AVPlayer or MPMoviePlayerController in UITableViewCell?
                            
                                APNS rejects notification with reason "DeviceTokenNotForTopic"
                            
                                Java equivalent of Cocoa NSNotification?
                            
                                Permanently ignoring warnings
                            
                                Can presentModalViewController work at startup?
                            
                                Objective-C: When to call self.myObject vs just calling myObject

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scraping and Parsing a Wikipedia Page

Tags:

parsing

objective-c

screen-scraping

wikipedia

wikipedia-api