there must be a very simple solution to this that I am overlooking - I have set import.io to extract from a wikipedia page here and I cannot specify each entry within the alphabetised sections to be on a separate row when extracted - when training it selects everything to be on the same row so it's not useable.. Any ideas?
Wikipedia is just a very hard website, not to extract data from (all html, no Javascript or AJAX), but to automatically extract from. This is due to Wikipedia being free and open to edit, which results in millions of different page structures.
There's a few ways around it though, although how easily they can be applied depends on each use case. Instead of using our point and click type of training, you can manually train it by specifying an XPath. For example if the data is always structured in a table you can use the XPath: //table Which will just scan the entire site for any tables, and extract it. However this most likely also get unwanted tables, so you'd have to specify which table. For example the table on this site has the class "wikitable". So we specify it as: //table[@class="wikitable"]
And then you'd of course need to make sure that the same applies to all other pages for strikers. The data is easily recognizable by humans, but to make a machine understand is an arduous task of finding a common element between the data you are looking for, and telling the robot that things with this common element is what it should extract.
Thanks,
Meg
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With