Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting and Constructing Tables from HTML Files using Julia

Here's a public link to an example html file. I would like to extract each set of CAN and yearly tax information (example highlighted in red in the image below) from the file and construct a dataframe that looks like the one below.

Target Fields

enter image description here

Example DataFrame

| Row |          CAN | Crtf_NoCrtf | Tax_Year | Land_Value | Improv_Value | Total_Value | Total_Tax |
|-----+--------------+-------------+----------+------------+--------------+-------------+-----------|
|   1 | 184750010210 | Yes         |     2016 |      16720 |       148330 |      165050 | 4432.24   |
|   2 | 184750010210 | Yes         |     2015 |      16720 |       128250 |      144970 | 3901.06   |
|   3 | 184750010210 | Yes         |     2014 |      16720 |       109740 |      126460 | 3412.63   |
|   4 | 184750010210 | Yes         |     2013 |      16720 |       111430 |      128150 | 3474.46   |
|   5 | 184750010210 | Yes         |     2012 |      16720 |        99340 |      116060 | 3146.17   |
|   6 | 184750010210 | Yes         |     2011 |      16720 |       102350 |      119070 | 3218.80   |
|   7 | 184750010210 | Yes         |     2010 |      16720 |       108440 |      125160 | 3369.97   |
|   8 | 184750010210 | Yes         |     2009 |      16720 |       113870 |      130590 | 3458.14   |
|   9 | 184750010210 | Yes         |     2008 |      16720 |       122390 |      139110 | 3629.85   |
|  10 | 184750010210 | Yes         |     2007 |      16720 |       112820 |      129540 | 3302.72   |
|  11 | 184750010210 | Yes         |     2006 |      12380 |       112760 |             | 3623.12   |
|  12 | 184750010210 | Yes         |     2005 |      19800 |       107400 |             | 3882.24   |

Additional Information

If it is not possible to insert the CAN to each row that is okay, I can export the CAN numbers separately and find a way to attach them to the dataframe containing the tax values. I have looked into using beautiful soup for python, but I am an absolute novice with python and the rest of the scripts I am writing are in Julia, so I would prefer to keep everything in one language.

Is there any way to achieve what I am trying to achieve? I have looked at Gumbo.jl but can not find any detailed documentation/tutorials.

like image 430
Alejandro Braun Avatar asked Mar 20 '17 23:03

Alejandro Braun


2 Answers

The current answer is a bit out of date since the readall() function no longer exists. I'll update his answer below.

Here's a general breakdown of the package ecosystem for Julia (as of the time of writing this answer):

  • Requests.jl is used to download the HTML file itself (note that in avik's answer, he reads the HTML file from his local machine)
  • Cascadia.jl is required to search for CSS tags (e.g. the tag that you would find if you were to use Selector Gadget).
  • Gumbo.jl is required to parse the resulting HTML

The key thing to remember is that Gumbo stores objects in tree format as HTMLNodes or HTMLElements. So most objects have "parents" and "children." To get the data you need, it's simply a matter of filtering with the right selector (using Cascadia) and then going to the correct point in the Gumbo tree.

An updated version of avik's answer:

using Requests, Cascadia, Gumbo

# r = get(url) # Normally, you'd put a url here, but I couldn't find a way to grab it without having to download it and read it locally
# h = parsehtml(String(r.data)) # Then normally you'd execute this

# Instead, I'm going to read in the html file as a string and give it to Gumbo
h = parsehtml(readstring("z1.html"))

# Exploring with the various structure of Gumbo objects:
println(fieldnames(h.root))
println(fieldnames(h.root.children))
println(size(h.root.children))

# aviks code:
c = matchall(Selector("td:containsOwn(\"CAN:\") + td span"), h.root);
for x in c
    println( x.children[1].text )
end

This particular webpage is more difficult to scrape than most, since it doesn't have a great CSS structure.

There's some nice documentation on workflow on the Cascadia README, but I still had some questions after reading it. For anyone else (like me, yesterday) who comes to this page looking for guidance on web scraping in Julia, I've created a jupyter notebook with a simple example that will hopefully help you understand the workflow in greater detail.

like image 186
Tyler R. Avatar answered Oct 15 '22 17:10

Tyler R.


So Gumbo.jl will parse the HTML and give you a programatic representation of the structure of the HTML file (called a DOM - Document Object Model). This is typically a tree of html tags, which you can traverse and extract the data you need.

To make this easier, what you really want is a way to query the DOM, so that you can extract the data you need without having to traverse the entire tree yourself. The Cascadia.jl project does this for you. It is built on top of Gumbo, and uses CSS selectors as the query language.

So for your example, you could use something like the following to extract all the CAN fields:

julia> using Gumbo

julia> using Cascadia

julia> h=parsehtml(read("/Users/aviks/Download/z1.html", String))

julia> c = matchall(Selector("td:containsOwn(\"CAN:\") + td span"), h.root)
  13-element Array{Gumbo.HTMLNode,1}:
  Gumbo.HTMLElement{:span}:
  <span class="value">184750010210</span> 
   ...
#print all the CAN values
julia> for x in c
               println( x.children[1].text )
            end

  184750010210
  186170040070
  175630130020
  172640020290
  168330020230
  156340030160
  118210000020
  190490040500
  173480080430
  161160010050
  153510060090
  050493000250
  050470630910

Hopefully this gives you an idea of how to extract all the data you need.

like image 30
aviks Avatar answered Oct 15 '22 16:10

aviks