Best way for a beginner to learn screen scraping by Python [closed]

Tags:

This might be one of those questions that are difficult to answer, but here goes:

I don't consider my self programmer - but I would like to :-) I've learned R, because I was sick and tired of spss, and because a friend introduced me to the language - so I am not a complete stranger to programming logic.

Now I would like to learn python - primarily to do screen scraping and text analysis, but also for writing webapps with Pylons or Django.

So: How should I go about learning to screen scrape with python? I started going through the scrappy docs but I feel to much "magic" is going on - after all - I am trying to learn, not just do.

On the other hand: There is no reason to reinvent the wheel, and if Scrapy is to screen scraping what Django is to webpages, then It might after all be worth jumping straight into Scrapy. What do you think?

Oh - BTW: The kind of screen scraping: I want to scrape newspaper sites (i.e. fairly complex and big) for mentions of politicians etc. - That means I will need to scrape daily, incrementally and recursively - and I need to log the results into a database of sorts - which lead me to a bonus question: Everybody is talking about nonSQL DB. Should I learn to use e.g. mongoDB right away (I don't think I need strong consistency), or is that foolish for what I want to do?

Thank you for any thoughts - and I apologize if this is to general to be considered a programming question.

385

asked Dec 01 '10 19:12

Andreas

1 Answers

I agree that the Scrapy docs give off that impression. But, I believe, as I found for myself, that if you are patient with Scrapy, and go through the tutorials first, and then bury yourself into the rest of the documentation, you will not only start to understand the different parts to Scrapy better, but you will appreciate why it does what it does the way it does it. It is a framework for writing spiders and screen scrappers in the real sense of a framework. You will still have to learn XPath, but I find that it is best to learn it regardless. After all, you do intend to scrape websites, and an understanding of what XPath is and how it works is only going to make things easier for you.

Once you have, for example, understood the concept of pipelines in Scrapy, you will be able to appreciate how easy it is to do all sorts of stuff with scrapped items, including storing them into a database.

BeautifulSoup is a wonderful Python library that can be used to scrape websites. But, in contrast to Scrapy, it is not a framework by any means. For smaller projects where you don't have to invest time in writing a proper spider and have to deal with scrapping a good amount of data, you can get by with BeautifulSoup. But for anything else, you will only begin to appreciate the sort of things Scrapy provides.

answered Oct 05 '22 01:10

ayaz

Related questions
                            
                                Is there a Perl equivalent to Python's `if __name__ == '__main__'`?
                            
                                Python serializable objects json
                            
                                How to have two models reference each other Django
                            
                                Returning rendered template with Flask-Restful shows HTML in browser
                            
                                Scapy installation fails due to invalid token
                            
                                Pandas groupby for zero values
                            
                                How to reinitialize the Python console in PyCharm?
                            
                                Delete rows if there are null values in a specific column in Pandas dataframe [duplicate]
                            
                                Python: Strip everything but spaces and alphanumeric
                            
                                Python: Why should 'from <module> import *' be prohibited?
                            
                                What's exactly happening in infinite nested lists?
                            
                                Openpyxl setting number format
                            
                                X and Y axis labels for Bokeh figure
                            
                                Is there a built-in javascript function similar to os.path.join?
                            
                                How to sort list of lists according to length of sublists [duplicate]
                            
                                How do I select and store columns greater than a number in pandas?
                            
                                Django error. Cannot assign must be an instance
                            
                                Recursively access dict via attributes as well as index access?
                            
                                boost::python: Python list to std::vector
                            
                                Python: How do I make temporary files in my test suite?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best way for a beginner to learn screen scraping by Python [closed]

Tags:

python

beautifulsoup

lxml

scrapy

screen-scraping

Andreas

People also ask

1 Answers

ayaz

Recent Activity

Donate For Us