So I've written a web site scraper program in C# using the HTML Agility pack. This was fairly straight forward. Even accounting for inconsistencies in formatting on the web page, it still only took me a couple of hours to get working.
Now, I have to re-implement this program in C so it can be run in a linux environment. This is a major nightmare.
I'm able to pull back the page but when it comes to tracking through it to pull out the parts I'm interested in - I'm drawing a lot of blanks. Originally, I was dead set on trying to implement a solution similar to my HTML Agility option in C# except using Tidy and some other XML library so I could keep my logic more or less the same.
This hasn't worked out so well. The XML library I have access to doesn't appear to support xpath and I'm not able to install one that does. So I've resorted to trying to figure out a way to read through the page using string matching to find the data I want. I can't help but feel that there has to be a better way to do this.
Here is what I have:
#define HTML_PAGE "codes.html"
int extract()
{
FILE *html;
int found = 0;
char buffer[1000];
char searchFor[80], *cp;
html = fopen(HTML_PAGE, "r");
if (html)
{
// this is too error prone, if the buffer cuts off half way through a section of the string we are looking for, it will fail!
while(fgets(buffer, 999, html))
{
trim(buffer);
if (!found)
{
sprintf(searchFor, "<strong>");
cp = (char *)strstr(buffer, searchFor);
if(!cp)continue;
if (strncmp(cp + strlen(searchFor), "CO1", 3) == 0 || strncmp(cp + strlen(searchFor), "CO2", 3) == 0)
{
got_code(cp + strlen(searchFor));
}
}
}
}
fclose(html);
return 0;
}
got_code(html)
char *html;
{
char code[8];
char *endTag;
struct _code_st *currCode;
int i;
endTag = (char *)strstr(html, "</strong>");
if(!endTag)return;
sprintf(code, "%.7s", html);
for(i=0 ; i<Data.Codes ; i++)
if(strcasecmp(Data.Code[i].Code, code)==0)
return;
ADD_TO_LIST(currCode, _code_st, Data.Code, Data.Codes);
currCode->Code = (char *)strdup(code);
printf("Code: %s\n", code);
}
The above doesn't work properly. I get a lot of the codes I'm interested in but as I mention above, if the buffer cuts off at the wrong spots I miss some.
I did try just reading the entire chunk of html I'm interested in into a string but I wasn't able to figure out how to cycle through that - I couldn't get any codes displayed.
Does anyone know how I can solve this issue?
EDIT: I've been thinking about this some more. Is there any way I can look ahead in the file and search for the end of each 'block' of text I am parsing and set the buffer size to be that before I read it? Would I need another file pointer to the same file? This would (hopefully) prevent the problem of the buffer cutting off at inconvenient places.
As you saw in this tutorial, C++, which is normally used for system programming, also works well for web scraping because of its ability to parse HTTP.
Good news for archivists, academics, researchers and journalists: Scraping publicly accessible data is legal, according to a U.S. appeals court ruling.
Okay, so after much banging of head against the wall trying to come up with a way to make my above code work, I decided to try a slightly different approach.
Since I knew that the data on the page I'm scraping is contained on one huge line, I changed my code to search through the file till it found it. Then I progress down the line looking for the blocks I wanted. This worked surprisingly well and once I had the code reading some of the blocks, it was easy to make minor modifications to account for inconsistencies in the HTML. The part that took the longest was figuring out how to bail out once I reached the end of the line and I solved that by peaking ahead to make sure that there was another block to read.
Here is my code (which is ugly but functional):
#define HTML_PAGE "codes.html"
#define START_BLOCK "<strong>"
#define END_BLOCK "</strong>"
int extract()
{
FILE *html;
int found = 0;
char *line = NULL, *endTag, *startTag;
size_t len = 0;
ssize_t read;
char searchFor[80];
html = fopen(HTML_PAGE, "r");
if (html)
{
while((read = getline(&line, &len, html)) != -1)
{
if (found) // found line with codes we are interested in
{
char *ptr = line;
size_t nlen = strlen (END_BLOCK);
while (ptr != NULL)
{
sprintf(searchFor, START_BLOCK);
startTag = (char *)strstr(ptr, searchFor);
if(!startTag)
{
nlen = strlen (START_BLOCK);
ptr += nlen;
continue;
}
if (strncmp(startTag + strlen(searchFor), "CO1", 3) == 0 || strncmp(startTag + strlen(searchFor), "CO2", 3) == 0)
got_code(startTag + strlen(searchFor), code);
else {
nlen = strlen (START_BLOCK);
ptr += nlen;
continue;
}
sprintf(searchFor, END_BLOCK);
ptr = (char *)strstr(ptr, searchFor);
if (!ptr) { found = 0; break; }
nlen = strlen (END_BLOCK);
ptr += nlen;
if (ptr)
{
// look ahead to make sure we have more to pull out
sprintf(searchFor, END_BLOCK);
endTag = (char *)strstr(ptr, searchFor);
if (!endTag) { break; }
}
}
found = 0;
break;
}
// find the section of the downloaded page we care about
// the next line we read will be a blob containing the html we want
if (strstr(line, "wiki-content") != NULL)
{
found = 1;
}
}
fclose(html);
}
return 0;
}
got_code(char *html)
{
char code[8];
char *endTag;
struct _code_st *currCode;
int i;
endTag = (char *)strstr(html, "</strong>");
if(!endTag)return;
sprintf(code, "%.7s", html);
for(i=0 ; i<Data.Codes ; i++)
if(strcasecmp(Data.Code[i].Code, code)==0)
return;
ADD_TO_LIST(currCode, _code_st, Data.Code, Data.Codes);
currCode->Code = (char *)strdup(code);
printf("Code: %s\n", code);
}
Not nearly as elegant or robust as my C# program but at least it pulls back all the information I want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With