Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jsoup like library for Node.js [closed]

Is there any API for Node.js to get and query html from URLs and static html?

I like to do something like this to use with webscrape:

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

I have a look at this Question and looked most of those APIs, but I haven't found (perhaps I couldn't identify) anything so similar.

like image 899
alexpfx Avatar asked Mar 24 '16 15:03

alexpfx


People also ask

Does jsoup support JavaScript?

You can extract data by using CSS selectors, or by navigating and modifying the Document Object Model directly - just like a browser does, except you do it in Java code. You can also modify and write HTML out safely too. jsoup will not run JavaScript for you - if you need that in your app I'd recommend looking at JCEF.

What is jsoup library?

jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.

What does jsoup clean do?

clean. Creates a new, clean document, from the original dirty document, containing only elements allowed by the safelist. The original document is not modified. Only elements from the dirty document's body are used.


1 Answers

Jsdom is probably what you want https://github.com/tmpvar/jsdom You can use it in combination with jquery to query the dom. Here's an example on how I've been using it on one of my projects https://github.com/gabesoft/seryth/blob/master/lib/sanitizer.js You'll probably also need request to get the html from urls https://github.com/request/request

like image 82
gabesoft Avatar answered Sep 21 '22 16:09

gabesoft