How do I scrape a web page using C?

Tags:

So I've written a web site scraper program in C# using the HTML Agility pack. This was fairly straight forward. Even accounting for inconsistencies in formatting on the web page, it still only took me a couple of hours to get working.

Now, I have to re-implement this program in C so it can be run in a linux environment. This is a major nightmare.

I'm able to pull back the page but when it comes to tracking through it to pull out the parts I'm interested in - I'm drawing a lot of blanks. Originally, I was dead set on trying to implement a solution similar to my HTML Agility option in C# except using Tidy and some other XML library so I could keep my logic more or less the same.

This hasn't worked out so well. The XML library I have access to doesn't appear to support xpath and I'm not able to install one that does. So I've resorted to trying to figure out a way to read through the page using string matching to find the data I want. I can't help but feel that there has to be a better way to do this.

Here is what I have:

#define HTML_PAGE "codes.html"

int extract()
{

    FILE *html;

    int found = 0;
    char buffer[1000];
    char searchFor[80], *cp;

    html = fopen(HTML_PAGE, "r");

    if (html)
    {

        // this is too error prone, if the buffer cuts off half way through a section of the string we are looking for, it will fail!
        while(fgets(buffer, 999, html))
        {
            trim(buffer);

            if (!found)
            {
                sprintf(searchFor, "<strong>");
                cp = (char *)strstr(buffer, searchFor);
                if(!cp)continue;

                if (strncmp(cp + strlen(searchFor), "CO1", 3) == 0 || strncmp(cp + strlen(searchFor), "CO2", 3) == 0)
                {
                    got_code(cp + strlen(searchFor));
                }
            }
        }
    }

    fclose(html);

    return 0;
}

got_code(html)
    char    *html;
{
    char    code[8];
    char    *endTag;
    struct  _code_st    *currCode;
    int i;  

    endTag = (char *)strstr(html, "</strong>");
    if(!endTag)return;

    sprintf(code, "%.7s", html);

    for(i=0 ; i<Data.Codes ; i++)
        if(strcasecmp(Data.Code[i].Code, code)==0)
           return;

    ADD_TO_LIST(currCode, _code_st, Data.Code, Data.Codes);
    currCode->Code = (char *)strdup(code);

    printf("Code: %s\n", code);
}

The above doesn't work properly. I get a lot of the codes I'm interested in but as I mention above, if the buffer cuts off at the wrong spots I miss some.

I did try just reading the entire chunk of html I'm interested in into a string but I wasn't able to figure out how to cycle through that - I couldn't get any codes displayed.

Does anyone know how I can solve this issue?

EDIT: I've been thinking about this some more. Is there any way I can look ahead in the file and search for the end of each 'block' of text I am parsing and set the buffer size to be that before I read it? Would I need another file pointer to the same file? This would (hopefully) prevent the problem of the buffer cutting off at inconvenient places.

367

asked Oct 17 '14 01:10

Adam Jones

1 Answers

Okay, so after much banging of head against the wall trying to come up with a way to make my above code work, I decided to try a slightly different approach.

Since I knew that the data on the page I'm scraping is contained on one huge line, I changed my code to search through the file till it found it. Then I progress down the line looking for the blocks I wanted. This worked surprisingly well and once I had the code reading some of the blocks, it was easy to make minor modifications to account for inconsistencies in the HTML. The part that took the longest was figuring out how to bail out once I reached the end of the line and I solved that by peaking ahead to make sure that there was another block to read.

Here is my code (which is ugly but functional):

#define HTML_PAGE "codes.html"
#define START_BLOCK "<strong>"
#define END_BLOCK "</strong>"

int extract()
{

    FILE *html;

    int found = 0;
    char *line = NULL, *endTag, *startTag;
    size_t len = 0;
    ssize_t read;

    char searchFor[80];

    html = fopen(HTML_PAGE, "r");

    if (html)
    {
        while((read = getline(&line, &len, html)) != -1)
        {
            if (found) // found line with codes we are interested in
            {
                char   *ptr = line;
                size_t nlen = strlen (END_BLOCK);

                while (ptr != NULL) 
                {
                    sprintf(searchFor, START_BLOCK);
                    startTag = (char *)strstr(ptr, searchFor);
                    if(!startTag)
                    {
                        nlen = strlen (START_BLOCK);
                        ptr += nlen;
                        continue;
                    }

                    if (strncmp(startTag + strlen(searchFor), "CO1", 3) == 0 || strncmp(startTag + strlen(searchFor), "CO2", 3) == 0)
                        got_code(startTag + strlen(searchFor), code);
                    else {
                        nlen = strlen (START_BLOCK);
                        ptr += nlen;
                        continue;
                    }

                    sprintf(searchFor, END_BLOCK);
                    ptr = (char *)strstr(ptr, searchFor);

                    if (!ptr) { found = 0; break; }

                    nlen = strlen (END_BLOCK);                  
                    ptr += nlen;

                    if (ptr)
                    {
                        // look ahead to make sure we have more to pull out
                        sprintf(searchFor, END_BLOCK);
                        endTag = (char *)strstr(ptr, searchFor);
                        if (!endTag) { break; }
                    }
                }

                found = 0;
                break;
            }

            // find the section of the downloaded page we care about
            // the next line we read will be a blob containing the html we want
            if (strstr(line, "wiki-content") != NULL)
            {
                found = 1;
            }
        }

        fclose(html);
    }

    return 0;
}

got_code(char *html)
{
    char    code[8];
    char    *endTag;
    struct  _code_st    *currCode;
    int i;  

    endTag = (char *)strstr(html, "</strong>");
    if(!endTag)return;

    sprintf(code, "%.7s", html);

    for(i=0 ; i<Data.Codes ; i++)
        if(strcasecmp(Data.Code[i].Code, code)==0)
            return;

    ADD_TO_LIST(currCode, _code_st, Data.Code, Data.Codes);
    currCode->Code = (char *)strdup(code);

    printf("Code: %s\n", code);
}

Not nearly as elegant or robust as my C# program but at least it pulls back all the information I want.

144

answered Oct 05 '22 16:10

Adam Jones

Related questions
                            
                                representation of MAC address in C code
                            
                                C - pipe without using popen
                            
                                C/C++ Code Compiler in C#
                            
                                Coin flip simulation never exceeding a streak of 15 heads
                            
                                calloc() slower than malloc() & memset()
                            
                                Segmentation fault from a function that is not called at all
                            
                                setuid on an executable doesn't seem to work
                            
                                Can a pointer point to an address after 4GB?
                            
                                Is there a way to guarantee alignment of members of a malloc()-ed structs
                            
                                Faking an IO Error on Linux
                            
                                bit count function in K&R [closed]
                            
                                STDERR_FILENO undeclared on ubuntu
                            
                                Least significant bits in function pointer
                            
                                Guaranteed precision of sqrt function in C/C++
                            
                                Ubuntu - #include <curl/curl.h> no such file or directory
                            
                                sleep function in C11
                            
                                gcc on Windows: generated "a.exe" file vanishes
                            
                                Why the int type takes up 8 bytes in BSS section but 4 bytes in DATA section
                            
                                Custom malloc implementation
                            
                                Gcc inline assembly what does "'asm' operand has impossible constraints" mean?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I scrape a web page using C?

Tags:

c

web-scraping

Adam Jones

People also ask

1 Answers

Adam Jones

Recent Activity

Donate For Us