Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pros and Cons of Python Web Scraping using BeautifulSoup vs XPath [closed]

I've been learning about web scraping using BeautifulSoup in Python recently, but earlier today I was advised to consider using XPath expressions instead.

How does the way XPath and BeautifulSoup both work differ from each other?

like image 823
DanielSon Avatar asked Oct 02 '15 16:10

DanielSon


2 Answers

I have used both BeautifulSoup and lxml and incline towards the use of lxml based on experience. See performance comparison here. One thing to be wary of when using BeautifulSoup is the explicit election of a parser. The default parser chosen for you may incorrectly parse results without warnings that can lead to nightmares - my experience here.

Having said that, I find it often easier to write a bs4 snippet than the corresponding lxml.

like image 84
Spade Avatar answered Sep 24 '22 08:09

Spade


I would suggest bs4, its usage and docs were more friendly, will save your time and increase confidence which is very important when you are self learning string manipulation.

However in practice, it will require a strong CPU. I once scrape with not more than 30 connections on my 1core VPS, and CPU usage of python process keeps at 100%. It could be result of bad implementation, but later I chaned all to re.compile and performance issue was gone.

As for performance, regex > lxml >> bs4. As for get things done, no difference.

like image 38
SnoopyGuo Avatar answered Sep 22 '22 08:09

SnoopyGuo