Julia 0.4.0-dev+7053 html parsing is extremely fast

Question

Edit: ... well, became fast after the gracious help by @Ismael VC. The solution was wiping my Julia v0.4 first, reinstalling it from the most recent nightly, and then a certain amount of package juggling: Pkg.init(), Pkg.add("Gumbo"). This addition of Gumbo produces a build error at first:

INFO: Installing Gumbo v0.1.0
INFO: Building Gumbo

WARNING: deprecated syntax "[a=>b, ...]" at /Users/szalmaf/.julia/v0.4/Gumbo/deps/build.jl:19.
Use "Dict(a=>b, ...)" instead.
INFO: Attempting to Create directory /Users/szalmaf/.julia/v0.4/Gumbo/deps/downloads
INFO: Downloading file http://jamesporter.me/static/julia/gumbo-1.0.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (22) The requested URL returned error: 404 Not Found
================================[ ERROR: Gumbo ]================================

LoadError: failed process: Process(`curl -f -o /Users/szalmaf/.julia/v0.4/Gumbo/deps/downloads/gumbo-1.0.tar.gz -L http://jamesporter.me/static/julia/gumbo-1.0.tar.gz`, ProcessExited(22)) [22]
while loading /Users/szalmaf/.julia/v0.4/Gumbo/deps/build.jl, in expression starting on line 19

================================================================================

================================[ BUILD ERRORS ]================================

WARNING: Gumbo had build errors.

 - packages with build errors remain installed in /Users/szalmaf/.julia/v0.4
 - build the package(s) and all dependencies with `Pkg.build("Gumbo")`
 - build a single package by running its `deps/build.jl` script

================================================================================
INFO: Package database updated

, so one needs to check out the latest Gumbo from the master branch Pkg.update(), Pkg.build("Gumbo"), which in turn produces a Gumbo whose parsehtml is blazing fast.

Note: the problem was not what gets mentioned by a commenter (who did not read previous comments carefully enough), namely, the claim that the JIT compiler makes 'it' slow. If you read the back and forth discussion between me and @Ismael VC, you can see that I ran his exact test code as he did and I got the results in the first two of my comments, which with my original installation was indeed too slow. Anyway, the important thing is that parsehtml is as fast as it gets with Ismael's help in our private chat. Thanks again!

Original post:

Julia 0.4.0-dev+7053 html parsing is extremely slow?

Though the Julia language is being sold as fast at many things, it looks very slow at basic things of life like parsing web pages.

Profiling the http://julialang.org web page, which shows how fast Julia is against C, Fortran, R, Matlab, etc.

# using HTTPClient, Gumbo
julia_url = "http://julialang.org"
println("  scrape start: ", Dates.unix2datetime(time()))
julia_pageg = julia_url |> get
println("  scrape end: ", Dates.unix2datetime(time()))
julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml
println("  parsed: ", Dates.unix2datetime(time()))

gives

scrape start: 2015-09-05T16:47:03.843
scrape end: 2015-09-05T16:47:04.044
parsed: 2015-09-05T16:47:04.41

which shows that getting this webpage takes ~100ms, which is reasonable over my wifi connection, however, parsing this simple page takes ~400ms, which sounds prohibitive by today's standards.

Doing the same test for a somewhat more complex webpage

julia_url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
println("  scrape start: ", Dates.unix2datetime(time()))
julia_pageg = julia_url |> get
println("  scrape end: ", Dates.unix2datetime(time()))
julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml
println("  parsed: ", Dates.unix2datetime(time()))

gives

scrape start: 2015-09-05T16:57:52.054
scrape end: 2015-09-05T16:57:52.736
parsed: 2015-09-05T16:57:53.699

where parsing takes almost a full second.

I am probably missing something, but is there a better/faster way in Julia of parsing a webpage or getting an html element out of it? If so, how?

HarmonicaMuse · Accepted Answer

First of all, have you red the performance tips in te manual? Which Julia version are you using? (versioninfo())

http://julia.readthedocs.org/en/latest/manual/performance-tips/

You could start by reading that, and putting your code inside a function as suggested in the documentation, there is an @time macro, that also hints you about memory allocation, something like this:

Julia v0.3.11

Tested at: https://juliabox.org

using HTTPClient, Gumbo

function test(url::String)
    @show url

    print("Scraping: ")
    @time page = get(url)

    print("Parsing: ")
    @time page = parsehtml(bytestring(page.body))
end

let
    gc_disable()

    url =  "http://julialang.org"

    println("First run:")
    test(url)    # first run JITed

    println("
Second run:")
    test(url)

    url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"    

    println("
Third run:")
    test(url)

    println("
Fourth run:")
    test(url)

    gc_enable()
end

First run:
url => "http://julialang.org"
Scraping: elapsed time: 0.248092469 seconds (3971912 bytes allocated)
Parsing: elapsed time: 0.850927483 seconds (27207516 bytes allocated)

Second run:
url => "http://julialang.org"
Scraping: elapsed time: 0.055722638 seconds (73952 bytes allocated)
Parsing: elapsed time: 0.005446998 seconds (821800 bytes allocated)

Third run:
url => "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping: elapsed time: 0.282382774 seconds (619324 bytes allocated)
Parsing: elapsed time: 0.227427243 seconds (9728620 bytes allocated)

Fourth run:
url => "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping: elapsed time: 0.288903961 seconds (400272 bytes allocated)
Parsing: elapsed time: 0.017787089 seconds (1516560 bytes allocated)

This are the timings of your code with `@time`:

julia_url = "http://julialang.org"
@time julia_pageg = julia_url |> get
@time julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml

First run:

elapsed time: 0.361194892 seconds (11108960 bytes allocated)
elapsed time: 0.996812988 seconds (34546156 bytes allocated, 4.04% gc time)

Second run:

elapsed time: 0.018920084 seconds (77952 bytes allocated)
elapsed time: 0.006632215 seconds (823256 bytes allocated)

julia_url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
@time julia_pageg = julia_url |> get
@time julia_page = julia_pageg |> x->x.body |> bytestring |> parsehtml

First run:

elapsed time: 0.33795947 seconds (535916 bytes allocated)
elapsed time: 0.224386491 seconds (9729852 bytes allocated)

Second run:

elapsed time: 0.276848452 seconds (584944 bytes allocated)
elapsed time: 0.018806686 seconds (1517856 bytes allocated)

Edit v0.4.0-dev+7053

In version 0.4+ make sure to first do a Pkg.checkout("Gumbo") in order to pull the latest commits, after doing that and then doing a Pkg.build("Gumbo") in JuliaBox I get:

http://nbviewer.ipython.org/gist/Ismael-VC/4c241228f04ed54c70e2

First run:
url = "http://julialang.org"
Scraping:   0.227681 seconds (85.11 k allocations: 3.585 MB)
Parsing:   0.696063 seconds (799.12 k allocations: 29.450 MB)

Second run:
url = "http://julialang.org"
Scraping:   0.018953 seconds (571 allocations: 69.344 KB)
Parsing:   0.007132 seconds (15.91 k allocations: 916.313 KB)

Third run:
url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping:   0.313128 seconds (4.86 k allocations: 608.850 KB)
Parsing:   0.196110 seconds (270.17 k allocations: 10.356 MB)

Fourth run:
url = "http://www.quora.com/How-powerful-and-faster-is-Julia-Language"
Scraping:  0.307949 seconds (1.41 k allocations: 470.953 KB)
Parsing:   0.019801 seconds (23.82 k allocations: 1.627 MB)

Julia 0.4.0-dev+7053 html parsing is extremely fast

Tags:

html

web-scraping

julia

Ferenc

1 Answers

Julia v0.3.11

This are the timings of your code with `@time`:

Edit v0.4.0-dev+7053

HarmonicaMuse

Recent Activity

Donate For Us

Julia 0.4.0-dev+7053 html parsing is extremely fast

Tags:

html

web-scraping

julia

Ferenc

1 Answers

Julia v0.3.11

This are the timings of your code with @time:

Edit v0.4.0-dev+7053

HarmonicaMuse

Related questions

Recent Activity

Donate For Us

This are the timings of your code with `@time`: