Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing HTML from a web page

Tags:

java

html

android

I have to extract some information from a web page, and reformat it for the user.

Since the web page is somewhat regular, now I use HttpClient to retrive the HTML as a string, and I extract substrings in given locations with the relevant data.

Anyhow I'm wondering if there is a better way, maybe an HTML-aware way. How would you do it?

Cheers

like image 696
Mascarpone Avatar asked Jan 21 '11 16:01

Mascarpone


People also ask

How do you parse HTML?

HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.

Can we parse HTML?

HTML is a markup language with a simple structure. It would be quite easy to build a parser for HTML with a parser generator. Actually, you may not need even to do that, if you choose a popular parser generator, like ANTLR. That is because there are already available grammars ready to be used.


1 Answers

Ideally, you should use a real HTML-parser. I've used Jsoup successfully in the past on Android:

http://jsoup.org/

like image 103
Computerish Avatar answered Sep 28 '22 07:09

Computerish