Is there any API for Node.js to get and query html from URLs and static html?
I like to do something like this to use with webscrape:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
I have a look at this Question and looked most of those APIs, but I haven't found (perhaps I couldn't identify) anything so similar.
You can extract data by using CSS selectors, or by navigating and modifying the Document Object Model directly - just like a browser does, except you do it in Java code. You can also modify and write HTML out safely too. jsoup will not run JavaScript for you - if you need that in your app I'd recommend looking at JCEF.
jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.
clean. Creates a new, clean document, from the original dirty document, containing only elements allowed by the safelist. The original document is not modified. Only elements from the dirty document's body are used.
Jsdom
is probably what you want https://github.com/tmpvar/jsdom
You can use it in combination with jquery to query the dom. Here's an example on how I've been using it on one of my projects https://github.com/gabesoft/seryth/blob/master/lib/sanitizer.js
You'll probably also need request
to get the html from urls https://github.com/request/request
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With