How can I do web scraping in Julia?

Tags:

julia

I want to extract the names of universities and their websites from this site into lists.

In Python I did it with BeautifulSoup v4:

import requests
from bs4 import BeautifulSoup
import pandas as pd

page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/')
content = BeautifulSoup(page.text, 'html.parser')

college_name = []
college_link = []
college_name_list = content.find_all('h3',class_='college')
for college in college_name_list:
    if college.find('a'):
        college_name.append(college.find('a').text)
        college_link.append(college.find('a')['href'])

I really like programming in Julia and since it's very similar to Python, I wanted to know if I can do web scraping in Julia too. Any help would be appreciated.

948

asked Jan 20 '20 14:01

2 Answers

Your python code doesn't quite work. I guess the website has been updated recently. Since they have removed the links as far as i can tell,. Here is a similar example using Gumbo.jl and Cascadia.jl.

I am using the built in download command to download the webpage. which writes it to disk in a temp-file, which i then read into String. It might be cleaner to use HTTP.jl, which could read it straight into a String. But for this simple example it's fine

using Gumbo
using Cascadia

url = "https://thebestschools.org/features/best-computer-science-programs-in-the-world/"

page = parsehtml(read(download(url), String))


college_name = String[]
college_location = String[]


sections = eachmatch(sel"section", page.root)
for section in sections
    maybe_col_heading = eachmatch(sel"h3.college", section)
    if length(maybe_col_heading) == 0
        continue
    end
    col_heading = first(maybe_col_heading)

    name = strip(text(last(col_heading.children)))
    push!(college_name, name)

    loc = first(eachmatch(sel".school-location", section))
    push!(college_location, text(loc[1]))
end


[college_name college_location]

Outputs

julia> [college_name college_location]
51×2 Array{String,2}:
 "Massachusetts Institute of Technology (MIT)"  "Cambridge, Massachusetts"
 "Massachusetts Institute of Technology (MIT)"  "Cambridge, Massachusetts"
 "Stanford University"                          "Stanford, California"
 "Carnegie Mellon University"                   "Pittsburgh, Pennsylvania"
 ⋮

 "Shanghai Jiao Tong University"                "Shanghai, China"
 "Lomonosov Moscow State University"            "Moscow, Russia"
 "City University of Hong Kong"                 "Hong Kong"

Seems like it listed MIT twice. probably the filtering code in my demo isn't quiet right. But :shrug: MIT is a great university I hear. Julia was invented there :joy:

113

answered Oct 19 '22 07:10

Lyndon White

Yes.

For the purpose of web-scraping, Julia has three libraries:

HTTP.jl to download the frontend source code of the website (this is comparable to python's requests library) ,
Gumbo.jl to parse the downloaded source code into a hierarchical structured object,
and Cascadia.jl to finally scrape using a CSS selector API.

I saw that you're young (16) from your profile and your python implementation is also correct.

Therefore, I'd suggest you to try to do a web-scraping task with these three libraries to better understand how they work.

The task that you wish to do, unfortunately, cannot be yet accomplished with Cascadia since the h3 is in a <span> which is currently not an implemented SelectorType in Cascadia.jl
Source

answered Oct 19 '22 06:10

PseudoCodeNerd

Related questions
                            
                                graph.facebook.com/username does not work
                            
                                How to work with the scrapy contracts?
                            
                                selenium webdriver to find the anchor tag and click that
                            
                                How do I scrape pages with dynamically generated URLs using Python?
                            
                                Running selenium behind a proxy server
                            
                                CasperJS loop or iterate through multiple web pages?
                            
                                Beautiful Soup and Table Scraping - lxml vs html parser
                            
                                How to make Scrapy show user agent per download request in log?
                            
                                Remove all backslashes in Javascript
                            
                                Webscraping with Julia? [closed]
                            
                                R: extracting "clean" UTF-8 text from a web page scraped with RCurl
                            
                                Can't fetch the profile name using Selenium after logging in using requests
                            
                                Click an item in autocomplete list with VBA and HTML
                            
                                Can't click on some dots to scrape information
                            
                                What is the Java equivalent to PhantomJS? [closed]
                            
                                HtmlAgilityPack & Selenium Webdriver returns random results
                            
                                Website does not recognize my inputs [how to fire IE dom event manually from VBA]
                            
                                Python 3: using requests does not get the full content of a web page
                            
                                Python's requests triggers Cloudflare's security while urllib does not
                            
                                Unable to make my script work asynchronously

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I do web scraping in Julia?

Tags:

web-scraping

julia

PseudoCodeNerd

People also ask

2 Answers

Outputs

Lyndon White

PseudoCodeNerd

Recent Activity

Donate For Us