Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML-parser on Node.js [closed]

Is there something like Ruby's nokogiri on nodejs? I mean a user-friendly HTML-parser.

I'd seen on Node.js modules page some parsers, but I can't find something pretty and fresh.

like image 684
asci Avatar asked Nov 02 '11 09:11

asci


People also ask

Can you run node js in HTML?

So you cannot run nodejs application in a browser, because it simply won't do anything unless the code is similar to a "front-end" javascript.

What does parsing HTML mean?

Parsing means analyzing and converting a program into an internal format that a runtime environment can actually run, for example the JavaScript engine inside browsers. The browser parses HTML into a DOM tree. HTML parsing involves tokenization and tree construction.

What is react HTML parser?

A utility for converting HTML strings into React components. Avoids the use of dangerouslySetInnerHTML and converts standard HTML elements, attributes and inline styles into their React equivalents.


3 Answers

If you want to build DOM you can use jsdom.

There's also cheerio, it has the jQuery interface and it's a lot faster than older versions of jsdom, although these days they are similar in performance.

You might wanna have a look at htmlparser2, which is a streaming parser, and according to its benchmark, it seems to be faster than others, and no DOM by default. It can also produce a DOM, as it is also bundled with a handler that creates a DOM. This is the parser that is used by cheerio.

parse5 also looks like a good solution. It's fairly active (11 days since the last commit as of this update), WHATWG-compliant, and is used in jsdom, Angular, and Polymer.

If the website you're trying to scrape is dynamic then you should be using a headless browser like phantomjs. Also have a look at casperjs, if you're considering phantomjs. And you can control casperjs from node with SpookyJS.

Beside phantomjs there's zombiejs. Unlike phantomjs that cannot be embedded in nodejs, zombiejs is just a node module.

There's a nettuts+ toturial for the latter solutions.

like image 195
Farid Nouri Neshat Avatar answered Oct 05 '22 06:10

Farid Nouri Neshat


Try https://github.com/tmpvar/jsdom - you give it some HTML and it gives you a DOM.

like image 37
thejh Avatar answered Oct 05 '22 05:10

thejh


You can also take a look at x-ray: https://github.com/lapwinglabs/x-ray

like image 20
png Avatar answered Oct 05 '22 05:10

png