How do I parse a HTML page with Node.js

Tags:

I need to parse (server side) big amounts of HTML pages.
We all agree that regexp is not the way to go here.
It seems to me that javascript is the native way of parsing a HTML page, but that assumption relies on the server side code having all the DOM ability javascript has inside a browser.

Does Node.js have that ability built in?
Is there a better approach to this problem, parsing HTML on the server side?

346

asked Sep 10 '11 16:09

Itay Moav -Malimovka

2 Answers

You can use the npm modules jsdom and htmlparser to create and parse a DOM in Node.JS.

Other options include:

BeautifulSoup for python
you can convert you html to xhtml and use XSLT
HTMLAgilityPack for .NET
CsQuery for .NET (my new favorite)
The spidermonkey and rhino JS engines have native E4X support. This may be useful, only if you convert your html to xhtml.

Out of all these options, I prefer using the Node.js option, because it uses the standard W3C DOM accessor methods and I can reuse code on both the client and server. I wish BeautifulSoup's methods were more similar to the W3C dom, and I think converting your HTML to XHTML to write XSLT is just plain sadistic.

120

answered Oct 03 '22 10:10

kzh

Use Cheerio. It isn't as strict as jsdom and is optimized for scraping. As a bonus, uses the jQuery selectors you already know.

❤ Familiar syntax: Cheerio implements a subset of core jQuery. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API.

ϟ Blazingly fast: Cheerio works with a very simple, consistent DOM model. As a result parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM.

❁ Insanely flexible: Cheerio wraps around @FB55's forgiving htmlparser. Cheerio can parse nearly any HTML or XML document.

answered Oct 03 '22 11:10

Meekohi

Related questions
                            
                                How to protect the password field in Mongoose/MongoDB so it won't return in a query when I populate collections?
                            
                                How is Node.js evented system different than the actor pattern of Akka?
                            
                                Template String As Object Property Name
                            
                                Idiomatic way to wait for multiple callbacks in Node.js
                            
                                is node.js' console.log asynchronous?
                            
                                Trying to use the DOMParser with node js
                            
                                Node.js console.log() not logging anything
                            
                                What exactly is a Node.js event loop tick?
                            
                                What is the proper way to use the node.js postgresql module?
                            
                                Convert string to buffer Node
                            
                                Why can't you modify the data returned by a Mongoose Query (ex: findById)
                            
                                window.performance.now() equivalent in nodejs?
                            
                                node.js: cannot find module 'request'
                            
                                npm WARN npm npm does not support Node.js v9.1.0
                            
                                npm install Error: rollbackFailedOptional
                            
                                Mongoose delete array element in document and save
                            
                                Install NPM into home directory with distribution nodejs package (Ubuntu)
                            
                                Function to convert timestamp to human date in javascript
                            
                                'npm install' extremely slow on Windows
                            
                                node.js + mysql connection pooling

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I parse a HTML page with Node.js

Tags:

node.js

html-parsing

server-side

Itay Moav -Malimovka

People also ask

2 Answers

kzh

Meekohi

Recent Activity

Donate For Us