Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract information from HTML using Mathematica

Is there an easy way to extract data from specific HTML tables using Mathematica? Import seems to be pretty powerful, and Mathematica appears to be capable of handling formats such as XML pretty well.

Here's an example: http://en.wikipedia.org/wiki/Unemployment_by_country

like image 763
C. E. Avatar asked Jan 10 '12 20:01

C. E.


4 Answers

For general examples of this there are these How tos:

  • How to | Clean Up Data Imported from a ZIP File
  • How to | Clean Up Data Imported from a Website

For this specific example just import it

tmp = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Data"]

Cleaning it up is fairly straight forward with this import. The table is 3 columns so extract it from the rest of the stuff:

tmp1 = Cases[tmp, {_, _?NumberQ, _}, \[Infinity]]

You will presumably want to remove the square bracket references (??):

tmp1[[All, 3]] = Flatten[If[StringQ[#], 
StringCases[#, x__ ~~ Whitespace ~~ "[" ~~ __ :> x], #] & /@ tmp1[[All, 3]]]

Grid[tmp1, Frame -> All]

Note also you can add the header back if you want it in your table, which you probably do

Grid[Join[{{"Country / Region", "Unemployment rate (%)", 
   "Source / date of information"}}, tmp1], Frame -> All]

purists might object to the last step but when you are scraping data generally you just want to get the job done and each site is a case by case prospect. So some manual inspection and flexibility gets you the fastest overall result.

Edit

if you wanted the flags you could also get them from CountryData. Some further cleaning up is needed otherwise a lot of misses will occur. The cleanup involves removing the reference to the "sovereign country" in parenthesis. e.g. "Guam ( United States )" -> "Gaum".

tmp2 = Flatten[
  If[StringMatchQ[#, __ ~~ "(" ~~ __], 
     StringCases[#, 
      z__ ~~ Shortest["(" ~~ __ ~~ ")" ~~ EndOfString] :> 
       StringTrim@z], StringTrim[#]] & /@ tmp1[[All, 1]]]

This will still produce some output that CountryData does not recognize.

flags = CountryData[#, "Flag"] & /@ tmp2;
Cases[flags, _CountryData]

6 misses out of 190. Remove those misses from the output:

flags = If[Head[#] === CountryData, {""}, {#}] & /@ flags; (*much faster than rule replacement*)
tmp2 = Join[flags, tmp1, 2];
Grid[tmp2, Frame -> All]

Note that this takes a while to render.

enter image description here

You can obviously style the Grid as desired using Grid options and also resize the images if needed.

like image 70
Mike Honeychurch Avatar answered Nov 07 '22 14:11

Mike Honeychurch


While the use of Import is probably a better and more robust way, I found that, at least for this particular problem, my own HTML parser (published in this thread), works fine with a small amount of post-processing. If you take the code from there and execute it, augmenting it with this function:

Clear[findAndParseTables];
findAndParseTables[text_String] :=
  Module[{parsed = postProcess@parseText[text]},
    DeleteCases[
      Cases[parsed, _tableContainer, Infinity],
      _attribContainer | _spanContainer, Infinity
    ] //.
    {(supContainer | tdContainer | trContainer | thContainer)[x___] :> {x},
        iContainer[x___] :> x,
        aContainer[x_] :> x,
        "\n" :> Sequence[],
       divContainer[] | ulContainer[] | liContainer[] | aContainer[] :> Sequence[]}];

Then you get, I think, a pretty much complete data by this code:

text = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Text"];
myData = First@findAndParseTables[text];

Here is how the result looks:

In[92]:= Short[myData,5]
Out[92]//Short= 
tableContainer[{{Country / Region},{Unemployment rate (%)},{Source / date of information}},
{{Afghanistan},{35.0},{2008,{3}}},{{Albania},{13.49},{2010 (Q4),{4}}},
{{Algeria},{10.0},{2010 (September),{5}}},<<188>>,{{West Bank},{17.2},{2010,{43}}},
{{Yemen},{35.0},{2009 (June),{128}}},{{Zambia},{16.0},{2005,{129}}},{{Zimbabwe},{97.0},{2009}}]

What I like about this approach (as opposed to say, Import->XMLObject) is that, since I convert the web page into Mathematica expression with minimal syntax (unlike e.g. XML objects), it is often very easy to establish a set of replacement rules which does the right post-processing in each given case. A final disclaimer is that my parser is not robust and does for sure contain a number of bugs, so be warned.

like image 41
Leonid Shifrin Avatar answered Nov 07 '22 15:11

Leonid Shifrin


Not a direct answer to how to import HTML (which others have explained nicely), but getting data from HTML tables is precisely why I originally made my table paste palette.

If your aim is to just get the data, this is probably going to be easier and faster than trying to parse the page.

Instructions on using the palette

  1. Evaluate the expression that creates the palette, go to Palettes -> Install Palette... and save it permanently for later use (if you wish).

  2. Select a part of the table on the webpage. If you are working with Firefox, hold down CTRL to select any rectangular section of the table (very useful!) Copy it.

  3. If you are using Firefox or Chrome, press the TSV button on the palette to paste the data into the notebook at the current insertion point. I'm not sure if other browsers also separate items with tabs when copying.

The result will look like this:

{{"Afghanistan", 35.`, "2008[3]"}, {"Albania", 13.49`, 
  "2010 (Q4)[4]"}, {"Algeria", 10.`, 
  "2010 (September)[5]"}, {"American Samoa (United States)", 23.8`, 
  "2010[3]"}, {"Andorra", 2.9`, 2009}}

As you can see, some post-processing is needed to convert years to a proper format (string or integer?)


This is the old palette code. I realize it's in need of cleanup, but it works as it is, and I haven't had time to fix it up yet. Report any issues in comments below.

CreatePalette@Column@{Button["TSV",
    Module[{data, strip},
     data = NotebookGet[ClipboardNotebook[]][[1, 1, 1]];
     strip[s_String] := 
      StringReplace[s, RegularExpression["^\\s*(.*?)\\s*$"] -> "$1"];
     strip[e_] := e;
     If[Head[data] === String,
      NotebookWrite[InputNotebook[],
       ToBoxes@Map[strip, ImportString[data, "TSV"], {2}]]
      ]
     ]
    ],
   Button["CSV",
    Module[{data, strip},
     data = NotebookGet[ClipboardNotebook[]][[1, 1, 1]];
     strip[s_String] := 
      StringReplace[s, RegularExpression["^\\s*(.*?)\\s*$"] -> "$1"];
     strip[e_] := e;
     If[Head[data] === String,
      NotebookWrite[InputNotebook[],
       ToBoxes@Map[strip, ImportString[data, "CSV"], {2}]]
      ]
     ]
    ],
   Button["Table",
    Module[{data},
     data = NotebookGet[ClipboardNotebook[]][[1, 1, 1]];
     If[Head[data] === String,
      NotebookWrite[InputNotebook[],
       ToBoxes@ImportString[data, "Table"]]
      ]
     ]
    ]}
like image 24
Szabolcs Avatar answered Nov 07 '22 15:11

Szabolcs


Import[
  "http://en.wikipedia.org/wiki/Unemployment_by_country",
  "Data"]

Of course, the result will frequently need further processing. How do you want to visualize it?

You can find all Import types using

Import[
  "http://en.wikipedia.org/wiki/Unemployment_by_country",
  "Elements"]
like image 24
Mark McClure Avatar answered Nov 07 '22 13:11

Mark McClure