Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Webcrawling script producing different results on two different machines

I created a webcrawler in Python using BeautifulSoup's API. The webcrawler uses the same header information/user-agent when crawling certain sites.I notice that when I run the same exact script (one on my laptop and another on a server) on two different machines to crawl a given site, they produce different results. By "different results," I mean that the script ran on the server does not crawl to all the links on the site.

For example, if I wanted to crawl Macys.com, the script on my laptop would crawl to each department(home, bedbath, womens,mens,etc.) while the script running on the server would miss bedbath department. This is really confusing me since they both use the same script with the same header information/user-agent to crawl the same site. I cannot think of any other setting that could be causing this

Here is how I am defining my user-agent in python and creating a soup object

user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
hdr={'User-Agent':user_agent} 
response = urlopen(Request(current_url, headers=hdr))
html = response.read()
soup = BeautifulSoup(html, "lxml")
like image 435
lollerskates Avatar asked Mar 19 '26 09:03

lollerskates


1 Answers

If you don't specify the parser explicitly, BeautifulSoup will pick up the underlying parser automatically:

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

The problem here is that it chooses different parsers locally and on the server depending on the modules that are available/installed in the python environment. And, since there are differences between parsers, you see different results.

Explicitly specify the parser, that fits your needs, for example:

soup = BeautifulSoup(html, "lxml")
like image 94
alecxe Avatar answered Mar 21 '26 22:03

alecxe



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!