I'm trying to scrape a website, and I'm running into the problem where, using Request JS, I'm getting an HTML string I'm passing to Cheerio.
The confusing part is there's parts that do exist in the HTML string that I'm trying to scrape, but when I use Cheerio to try to scrape them, it can't find it although it's there...
For instance, there is a table with an ID inside the last row of another table, and I would expect that, through using that ID as a selector, I'd be able to get all the children TR's, but instead what I'm getting is the first TR, then another one with one TD inside, then an abrupt closing of the second TR, then closing tag for the table.
From console.logging the HTML string before sending it to cheerio.load, I can see there is obviously more information in that second TR, followed by multiple other TR's before the closing of that table. But when I run it through cheerio, this isn't what it tells me.
Looking at the rest of it right now (although no idea if this is significant or not...) I'm noticing it has href="javascript:void(0)" in it -- could something like that be throwing Cheerio off?
Thank you for any help.
Upon further investigation, if I use a selector such as 'td:contains("this text")', then this will show up. But not any other way, so far as I can tell.
Also that javascript:void(0) was not causing the problem, I used a regex to remove all instances of it, and still have the mystery.
It turns out the code was written with an error, perhaps in efforts to deter scraping. There was a tag <font> that was closed with a </div>, and this somehow broke cheerio's parsing ability.
html = html.replace(/[<]font size[=]["]1["][>]\d[<]\/div[>]/g, ""));
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With