I've been learning about web scraping using BeautifulSoup in Python recently, but earlier today I was advised to consider using XPath expressions instead.
How does the way XPath and BeautifulSoup both work differ from each other?
I have used both BeautifulSoup and lxml and incline towards the use of lxml based on experience. See performance comparison here. One thing to be wary of when using BeautifulSoup is the explicit election of a parser. The default parser chosen for you may incorrectly parse results without warnings that can lead to nightmares - my experience here.
Having said that, I find it often easier to write a bs4 snippet than the corresponding lxml.
I would suggest bs4, its usage and docs were more friendly, will save your time and increase confidence which is very important when you are self learning string manipulation.
However in practice, it will require a strong CPU. I once scrape with not more than 30 connections on my 1core VPS, and CPU usage of python process keeps at 100%. It could be result of bad implementation, but later I chaned all to re.compile and performance issue was gone.
As for performance, regex > lxml >> bs4. As for get things done, no difference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With