The html maybe dirty such as premature end of data in tag How can i do it? Thanks

I faced so much trouble due to lack of knowledge. So I write whole demo program to parse HTML using libxml2 library. <pre class="prettyprint"><code>#include <stdio.h> #include <string.h> #include <stdlib.h> #include <libxml/HTMLparser.h> void traverse_dom_trees(xmlNode * a_node) { xmlNode *cur_node = NULL; if(NULL == a_node) { //printf("Invalid argument a_node %p\n", a_node); return; } for (cur_node = a_node; cur_node; cur_node = cur_node->next) { if (cur_node->type == XML_ELEMENT_NODE) { /* Check for if current node should be exclude or not */ printf("Node type: Text, name: %s\n", cur_node->name); } else if(cur_node->type == XML_TEXT_NODE) { /* Process here text node, It is available in cpStr :TODO: */ printf("node type: Text, node content: %s, content length %d\n", (char *)cur_node->content, strlen((char *)cur_node->content)); } traverse_dom_trees(cur_node->children); } } int main(int argc, char **argv) { htmlDocPtr doc; xmlNode *roo_element = NULL; if (argc != 2) { printf("\nInvalid argument\n"); return(1); } /* Macro to check API for match with the DLL we are using */ LIBXML_TEST_VERSION doc = htmlReadFile(argv[1], NULL, HTML_PARSE_NOBLANKS | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET); if (doc == NULL) { fprintf(stderr, "Document not parsed successfully.\n"); return 0; } roo_element = xmlDocGetRootElement(doc); if (roo_element == NULL) { fprintf(stderr, "empty document\n"); xmlFreeDoc(doc); return 0; } printf("Root Node is %s\n", roo_element->name); traverse_dom_trees(roo_element); xmlFreeDoc(doc); // free document xmlCleanupParser(); // Free globals return 0; } </code></pre>

Using the libxml2 HTML parser it will normalize "dirty" HTML into a normalized tree. see <code>htmlDocPtr htmlParseFile(const char * filename, const char * encoding)</code> http://xmlsoft.org/html/libxml-HTMLparser.html

how to use libxml2 to parse dirty html in C programing

2 Answers

I faced so much trouble due to lack of knowledge. So I write whole demo program to parse HTML using libxml2 library.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <libxml/HTMLparser.h>

void traverse_dom_trees(xmlNode * a_node)
{
    xmlNode *cur_node = NULL;

    if(NULL == a_node)
    {
        //printf("Invalid argument a_node %p\n", a_node);
        return;
    }

    for (cur_node = a_node; cur_node; cur_node = cur_node->next) 
    {
        if (cur_node->type == XML_ELEMENT_NODE) 
        {
            /* Check for if current node should be exclude or not */
            printf("Node type: Text, name: %s\n", cur_node->name);
        }
        else if(cur_node->type == XML_TEXT_NODE)
        {
            /* Process here text node, It is available in cpStr :TODO: */
            printf("node type: Text, node content: %s,  content length %d\n", (char *)cur_node->content, strlen((char *)cur_node->content));
        }
        traverse_dom_trees(cur_node->children);
    }
}

int main(int argc, char **argv) 
{
    htmlDocPtr doc;
    xmlNode *roo_element = NULL;

    if (argc != 2)  
    {
        printf("\nInvalid argument\n");
        return(1);
    }

    /* Macro to check API for match with the DLL we are using */
    LIBXML_TEST_VERSION    

    doc = htmlReadFile(argv[1], NULL, HTML_PARSE_NOBLANKS | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET);
    if (doc == NULL) 
    {
        fprintf(stderr, "Document not parsed successfully.\n");
        return 0;
    }

    roo_element = xmlDocGetRootElement(doc);

    if (roo_element == NULL) 
    {
        fprintf(stderr, "empty document\n");
        xmlFreeDoc(doc);
        return 0;
    }

    printf("Root Node is %s\n", roo_element->name);
    traverse_dom_trees(roo_element);

    xmlFreeDoc(doc);       // free document
    xmlCleanupParser();    // Free globals
    return 0;
}

185

answered Oct 12 '22 07:10

Pankaj Vavadiya

Using the libxml2 HTML parser it will normalize "dirty" HTML into a normalized tree. see htmlDocPtr htmlParseFile(const char * filename, const char * encoding)

http://xmlsoft.org/html/libxml-HTMLparser.html

answered Oct 12 '22 07:10

Not_a_Golfer

Related questions
                            
                                vectorized strlen getting away with reading unallocated memory
                            
                                How to deallocate 2d array?
                            
                                Does a C pointer refer to the physical or virtual address [duplicate]
                            
                                Why mmap cannot allocate memory?
                            
                                CMake won't Link C library to C++ program
                            
                                Who is responsible for the stack and heap in C++?
                            
                                What is the use of Struct Tag name in C programming?
                            
                                Why are functions not considered first class citizens in C
                            
                                Where are the C headers in MacOS Mojave?
                            
                                How does scanf() work inside the OS?
                            
                                Using strftime in C, how can I format time exactly like a Unix timestamp?
                            
                                strcmp on a line read with fgets
                            
                                Adding printf to the starting of all functions in a file
                            
                                Finding Bit Positions in an unsigned 32-bit integer
                            
                                Why is initialization of integer member variable (which is not const static) not allowed in C++?
                            
                                Advantages and disadvantages of Open Watcom [closed]
                            
                                Compiler optimization causing program to run slower
                            
                                Duplicate Symbol in C
                            
                                How do you cope with signed char -> int issues with standard library?
                            
                                Does "static/extern uint8_t array[2] = {0};" conform to the ANSI C specification?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to use libxml2 to parse dirty html in C programing

Tags:

c

libxml2

bloody numen

People also ask

2 Answers

Pankaj Vavadiya

Not_a_Golfer

Recent Activity

Donate For Us