Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Julia: website scraping?

I have been trying for days to move forward with this little code for getting the headers and the links of the news from a journal website.

using HTTP
function website_parser(website_url::AbstractString)
r = readstring(get(website_url))
splitted = split(r, "\n")
end

website_parser("https://www.nature.com/news/newsandviews")

The problem is that I could not figure out how to proceed on once I got the text from the website. How can I retrieve specific elements (as header and link of the news in this case)?

Any help is very much appreciated, thank you

like image 872
flavinsky Avatar asked Apr 28 '18 15:04

flavinsky


1 Answers

You need some kind of HTML parsing. For only extracting the header, you probably can get away with regex, which are built in.

If it gets more complicated than that, regular expressions don't generalize, and you should use a full-fledged HTML parser. Gumbo.jl seems to be state of the art in Julia and has a rather simple interface.

In the latter case, it's unneccessary to split the document; in the former, it at least makes things more complicated, since then you have to think about line breaks. So, better parse first, then split.

Specific elements can be extracted using the library Cascadia git repo for instance, the class attribute for elements in the HTML page can be extracted via qs = eachmatch(Selector(".classID"),h.root) so that all the class elements such as <div class="classID"> get selected/extracted for the returned query string (qs).

like image 84
phipsgabler Avatar answered Jan 03 '23 22:01

phipsgabler