C/CPP version of BeautifulSoup especially at handling malformed HTML

Tags:

Are there any recommendations for a c/cpp lib which can be used to easily (as much as that possible) parse / iterate / manipulate HTML streams/files assuming some might be malformed, i.e. tags not closed etc.

BeautifulSoup

384

asked May 24 '12 15:05

Tzury Bar Yochay

1 Answers

HTMLparser from Libxml is easy to use (simple tutorial below) and works great even on malformed HTML.

Edit : Original blog post is no longer accessible, so I've copy pasted the content here.

Parsing (X)HTML in C is often seen as a difficult task. It's true that C isn't the easiest language to use to develop a parser. Fortunately, libxml2's HTMLParser module come to the rescue. So, as promised, here's a small tutorial explaining how to use libxml2's HTMLParser to parse (X)HTML.

First, you need to create a parser context. You have many functions for doing that, depending on how you want to feed data to the parser. I'll use htmlCreatePushParserCtxt(), since it work with memory buffers.
htmlParserCtxtPtr parser = htmlCreatePushParserCtxt(NULL, NULL, NULL, 0, NULL, 0);
Then, you can set many options on that parser context.
htmlCtxtUseOptions(parser, HTML_PARSE_NOBLANKS | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET);
We are now ready to parse an (X)HTML document.
// char * data : buffer containing part of the web page
// int len : number of bytes in data
// Last argument is 0 if the web page isn't complete, and 1 for the final call.
htmlParseChunk(parser, data, len, 0);
Once you've pushed it all your data, you can call that function again with a NULL buffer and 1 as the last argument. This will ensure that the parser have processed everything.

Finally, how to get the data you parsed? That's easier than it seems. You simply have to walk the XML tree created.
void walkTree(xmlNode * a_node)
{ 
    xmlNode *cur_node = NULL;
    xmlAttr *cur_attr = NULL;
    for (cur_node = a_node; cur_node; cur_node = cur_node->next)
    {
        // do something with that node information, like... printing the tag's name and attributes
        printf("Got tag : %s\n", cur_node->name)
        for (cur_attr = cur_node->properties; cur_attr; cur_attr = cur_attr->next)
        {
            printf("  ->; with attribute : %s\n", cur_attr->name);
        }
        walkTree(cur_node->children);
    }
}
walkTree(xmlDocGetRootElement(parser->myDoc));
And that's it! Isn't that simple enough? From there, you can do any kind of stuff, like finding all referenced images (by looking at img tag) and fetching them, or anything you can think of doing.

Also, you should know that you can walk the XML tree anytime, even if you haven't parsed the whole (X)HTML document yet.

If you have to parse (X)HTML in C, you should use libxml2's HTMLParser. It will save you a lot of time.

172

answered Oct 09 '22 16:10

Laurent Parenteau

Related questions
                            
                                Anyone used the MATLAB tool to produce C/C++ code? Is the resulting code viable for production use?
                            
                                Debugging of image processing code
                            
                                Visual C++ Volatile
                            
                                Strange warning behavior with gcc and signed/unsigned comparisons
                            
                                quick-sorts iterator requirements
                            
                                How does std::string allocate memory in GCC with -fwhole-program?
                            
                                MS Visual Studio 2010 how to use the .asm generated file
                            
                                What's the simplest way to satisfy a pure abstract method with methods from other base classes
                            
                                Cannot convert from one iterator type to another but both are the exact same
                            
                                addition instead of subtraction in Kahan algorithm
                            
                                Performance of copying a file with fread/fwrite to USB
                            
                                Process being killed by a third party application (Sprint Smartview)
                            
                                CPP WINDOWS : is there a sleep function in microseconds?
                            
                                Is there a contradiction between these two sources about the `auto_ptr` template class?
                            
                                Ctest/CDash workflow : deploying nightly builds
                            
                                Why does the following code compile even though I have undefined member functions?
                            
                                invoking copy constructor inside other constructor
                            
                                g++ warning options for casting pair?
                            
                                OpenCV and Qt VideoCapture does not open the correct camera on windows
                            
                                JNI_CreateJavaVM() fails every other time I run my application (exactly)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

C/CPP version of BeautifulSoup especially at handling malformed HTML

Tags:

c++

c

html

html-parsing

Tzury Bar Yochay

People also ask

1 Answers

Laurent Parenteau

Recent Activity

Donate For Us