I have been trying for days to move forward with this little code for getting the headers and the links of the news from a journal website.
using HTTP
function website_parser(website_url::AbstractString)
r = readstring(get(website_url))
splitted = split(r, "\n")
end
website_parser("https://www.nature.com/news/newsandviews")
The problem is that I could not figure out how to proceed on once I got the text from the website. How can I retrieve specific elements (as header and link of the news in this case)?
Any help is very much appreciated, thank you
You need some kind of HTML parsing. For only extracting the header, you probably can get away with regex, which are built in.
If it gets more complicated than that, regular expressions don't generalize, and you should use a full-fledged HTML parser. Gumbo.jl
seems to be state of the art in Julia and has a rather simple interface.
In the latter case, it's unneccessary to split the document; in the former, it at least makes things more complicated, since then you have to think about line breaks. So, better parse first, then split.
Specific elements can be extracted using the library Cascadia
git repo
for instance, the class attribute for elements in the HTML page can be extracted via qs = eachmatch(Selector(".classID"),h.root)
so that all the class elements such as <div class="classID">
get selected/extracted for the returned query string (qs).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With