Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Specifying separate rows when extracting on import.io

Tags:

import.io

there must be a very simple solution to this that I am overlooking - I have set import.io to extract from a wikipedia page here and I cannot specify each entry within the alphabetised sections to be on a separate row when extracted - when training it selects everything to be on the same row so it's not useable.. Any ideas?

like image 240
Thomas Sharp Avatar asked May 24 '26 00:05

Thomas Sharp


1 Answers

Wikipedia is just a very hard website, not to extract data from (all html, no Javascript or AJAX), but to automatically extract from. This is due to Wikipedia being free and open to edit, which results in millions of different page structures.

There's a few ways around it though, although how easily they can be applied depends on each use case. Instead of using our point and click type of training, you can manually train it by specifying an XPath. For example if the data is always structured in a table you can use the XPath: //table Which will just scan the entire site for any tables, and extract it. However this most likely also get unwanted tables, so you'd have to specify which table. For example the table on this site has the class "wikitable". So we specify it as: //table[@class="wikitable"]

And then you'd of course need to make sure that the same applies to all other pages for strikers. The data is easily recognizable by humans, but to make a machine understand is an arduous task of finding a common element between the data you are looking for, and telling the robot that things with this common element is what it should extract.

Thanks,
Meg

like image 165
Meg Ainsley Avatar answered Jun 01 '26 06:06

Meg Ainsley