Is there an easy way to extract data from specific HTML tables using Mathematica? Import
seems to be pretty powerful, and Mathematica appears to be capable of handling formats such as XML pretty well.
Here's an example: http://en.wikipedia.org/wiki/Unemployment_by_country
For general examples of this there are these How tos:
For this specific example just import it
tmp = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Data"]
Cleaning it up is fairly straight forward with this import. The table is 3 columns so extract it from the rest of the stuff:
tmp1 = Cases[tmp, {_, _?NumberQ, _}, \[Infinity]]
You will presumably want to remove the square bracket references (??):
tmp1[[All, 3]] = Flatten[If[StringQ[#],
StringCases[#, x__ ~~ Whitespace ~~ "[" ~~ __ :> x], #] & /@ tmp1[[All, 3]]]
Grid[tmp1, Frame -> All]
Note also you can add the header back if you want it in your table, which you probably do
Grid[Join[{{"Country / Region", "Unemployment rate (%)",
"Source / date of information"}}, tmp1], Frame -> All]
purists might object to the last step but when you are scraping data generally you just want to get the job done and each site is a case by case prospect. So some manual inspection and flexibility gets you the fastest overall result.
Edit
if you wanted the flags you could also get them from CountryData
. Some further cleaning up is needed otherwise a lot of misses will occur. The cleanup involves removing the reference to the "sovereign country" in parenthesis. e.g. "Guam ( United States )" -> "Gaum".
tmp2 = Flatten[
If[StringMatchQ[#, __ ~~ "(" ~~ __],
StringCases[#,
z__ ~~ Shortest["(" ~~ __ ~~ ")" ~~ EndOfString] :>
StringTrim@z], StringTrim[#]] & /@ tmp1[[All, 1]]]
This will still produce some output that CountryData
does not recognize.
flags = CountryData[#, "Flag"] & /@ tmp2;
Cases[flags, _CountryData]
6 misses out of 190. Remove those misses from the output:
flags = If[Head[#] === CountryData, {""}, {#}] & /@ flags; (*much faster than rule replacement*)
tmp2 = Join[flags, tmp1, 2];
Grid[tmp2, Frame -> All]
Note that this takes a while to render.
You can obviously style the Grid
as desired using Grid
options and also resize the images if needed.
While the use of Import
is probably a better and more robust way, I found that, at least for this particular problem, my own HTML parser (published in this thread), works fine with a small amount of post-processing. If you take the code from there and execute it, augmenting it with this function:
Clear[findAndParseTables];
findAndParseTables[text_String] :=
Module[{parsed = postProcess@parseText[text]},
DeleteCases[
Cases[parsed, _tableContainer, Infinity],
_attribContainer | _spanContainer, Infinity
] //.
{(supContainer | tdContainer | trContainer | thContainer)[x___] :> {x},
iContainer[x___] :> x,
aContainer[x_] :> x,
"\n" :> Sequence[],
divContainer[] | ulContainer[] | liContainer[] | aContainer[] :> Sequence[]}];
Then you get, I think, a pretty much complete data by this code:
text = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Text"];
myData = First@findAndParseTables[text];
Here is how the result looks:
In[92]:= Short[myData,5]
Out[92]//Short=
tableContainer[{{Country / Region},{Unemployment rate (%)},{Source / date of information}},
{{Afghanistan},{35.0},{2008,{3}}},{{Albania},{13.49},{2010 (Q4),{4}}},
{{Algeria},{10.0},{2010 (September),{5}}},<<188>>,{{West Bank},{17.2},{2010,{43}}},
{{Yemen},{35.0},{2009 (June),{128}}},{{Zambia},{16.0},{2005,{129}}},{{Zimbabwe},{97.0},{2009}}]
What I like about this approach (as opposed to say, Import->XMLObject
) is that, since I convert the web page into Mathematica expression with minimal syntax (unlike e.g. XML objects), it is often very easy to establish a set of replacement rules which does the right post-processing in each given case. A final disclaimer is that my parser is not robust and does for sure contain a number of bugs, so be warned.
Not a direct answer to how to import HTML (which others have explained nicely), but getting data from HTML tables is precisely why I originally made my table paste palette.
If your aim is to just get the data, this is probably going to be easier and faster than trying to parse the page.
Instructions on using the palette
Evaluate the expression that creates the palette, go to Palettes -> Install Palette... and save it permanently for later use (if you wish).
Select a part of the table on the webpage. If you are working with Firefox, hold down CTRL to select any rectangular section of the table (very useful!) Copy it.
If you are using Firefox or Chrome, press the TSV
button on the palette to paste the data into the notebook at the current insertion point. I'm not sure if other browsers also separate items with tabs when copying.
The result will look like this:
{{"Afghanistan", 35.`, "2008[3]"}, {"Albania", 13.49`,
"2010 (Q4)[4]"}, {"Algeria", 10.`,
"2010 (September)[5]"}, {"American Samoa (United States)", 23.8`,
"2010[3]"}, {"Andorra", 2.9`, 2009}}
As you can see, some post-processing is needed to convert years to a proper format (string or integer?)
This is the old palette code. I realize it's in need of cleanup, but it works as it is, and I haven't had time to fix it up yet. Report any issues in comments below.
CreatePalette@Column@{Button["TSV",
Module[{data, strip},
data = NotebookGet[ClipboardNotebook[]][[1, 1, 1]];
strip[s_String] :=
StringReplace[s, RegularExpression["^\\s*(.*?)\\s*$"] -> "$1"];
strip[e_] := e;
If[Head[data] === String,
NotebookWrite[InputNotebook[],
ToBoxes@Map[strip, ImportString[data, "TSV"], {2}]]
]
]
],
Button["CSV",
Module[{data, strip},
data = NotebookGet[ClipboardNotebook[]][[1, 1, 1]];
strip[s_String] :=
StringReplace[s, RegularExpression["^\\s*(.*?)\\s*$"] -> "$1"];
strip[e_] := e;
If[Head[data] === String,
NotebookWrite[InputNotebook[],
ToBoxes@Map[strip, ImportString[data, "CSV"], {2}]]
]
]
],
Button["Table",
Module[{data},
data = NotebookGet[ClipboardNotebook[]][[1, 1, 1]];
If[Head[data] === String,
NotebookWrite[InputNotebook[],
ToBoxes@ImportString[data, "Table"]]
]
]
]}
Import[
"http://en.wikipedia.org/wiki/Unemployment_by_country",
"Data"]
Of course, the result will frequently need further processing. How do you want to visualize it?
You can find all Import
types using
Import[
"http://en.wikipedia.org/wiki/Unemployment_by_country",
"Elements"]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With