I am very new to Erlang and as part of my learning exercise, I would like to write an HTML parser in Erlang.
I want to extract certain values from a web page, perhaps using a pattern to describe what data I want to extract.
Can anybody offer me some high level advice as to how they would approach this problem in Erlang?
I think I need to turn the document into a stack of tokens perhaps using a finite state machine to track where I am with regards to nesting and where I am in the element.
HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.
HTML Parser in C/C++ HTML Parser is a program/software by which useful statements can be extracted, leaving html tags (like <h1>, <span>, <p> etc) behind. Examples: Input: <h1>Geeks for Geeks</h1> Output: Geeks for Geeks.
The HTML parser is a structured markup processing tool. It defines a class called HTMLParser, which is used to parse HTML files. It comes in handy for web crawling.
I would suggest you to have a look to the one included in Mochiweb:
http://github.com/mochi/mochiweb/blob/master/src/mochiweb_html.erl
The parse/1
function is probably the entry point you're interested into.
This is a big job if you plan to be complete about it. You are best to use the one that Roberto suggest, but if you are determined to write your own as a project to get familiar with Erlang here are some suggestions...
You should first decide whether you are going to hand-code your parser or use leex and yecc to generate your parser from a grammar. Hand coding might be a better learning experience if you want to learn how to write idiomatic Erlang. Writing a parser is an excellent way to introduce yourself to Erlang; functional programming languages excel at implementing parsers.
Second, you should decide if you want to generate a DOM-like structure or do a SAX-like callback model known as a behaviour in Erlang. If you do the latter, you could simply implement the behaviour to create a DOM.
If you look at behaviours, you may also want to look into parametrized modules. This is an experimental feature that can complement behaviours, allowing immutable state to be stored within the an "instance of a module". It is not known whether or not this new feature will be supported by the community or not. (For some people it just looks too OO).
Another excellent resource is the xmerl code. Pay close at to how it determines the character encoding and parses accordingly. HTML (varioust standards) work slightly different, but it's important that you take into account the proper character encoding when you read the file.
Also from xmerl, you can see how that library constructs a DOM using Erlang tuples. You might want to do something similar.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With