How can I see all notes of a Tumblr post from Python?

Tags:

Say I look at the following Tumblr post: http://ronbarak.tumblr.com/post/40692813…
It (currently) has 292 notes.

I'd like to get all the above notes using a Python script (e.g., via urllib2, BeautifulSoup, simplejson, or tumblr Api). Some extensive Googling did not produce any items relating to notes' extraction in Tumblr.

Can anyone point me in the right direction on which tool will enable me to do that?

679

asked Jan 19 '13 14:01

user1850727

3 Answers

Unfortunately looks like the Tumblr API has some limitations (lacks of meta information about Reblogs, notes limited by 50), so you can't get all the notes.

It is also forbidden to do page scraping according to the Terms of Service.

"You may not do any of the following while accessing or using the Services: (...) scrape the Services, and particularly scrape Content (as defined below) from the Services, without Tumblr's express prior written consent;"

Source:

https://groups.google.com/forum/?fromgroups=#!topic/tumblr-api/ktfMIdJCOmc

113

answered Oct 28 '22 04:10

Fábio Hiroki

Without JS you get separate pages that only contain the notes. For the mentioned blog post the first page would be:

http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy

Following pages are linked at the bottom, e.g.:

http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358403506
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358383221
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358377013
…

(See my answer on how to find the next URL in a’s onclick attribute.)

Now you could use various tools to download/parse the data.

The following wget command should download all notes pages for that post:

wget --recursive --domains=ronbarak.tumblr.com --include-directories=notes http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy

answered Oct 28 '22 04:10

unor

Like Fabio implies, it is better to use the API.

If for whatever reasons you cannot, then the tools you will use will depend on what you want to do with the data in the posts.

for a data dump: urllib will return a string of the page you want
looking for a specific section in the html: lxml is pretty good
looking for something in unruly html: definitely beautifulsoup
looking for a specific item in a section: beautifulsoup, lxml, text parsing is what you need.
need to put the data in a database/file: use scrapy

Tumblr url scheme is simple: url/scheme/1, url/scheme/2, url/scheme/3, etc... until you get to the end of the posts and the servers just does not return any data anymore.

So if you are going to brute force your way to scraping, you can easily tell your script to dump all the data on your hard drive until, say the contents tag, is empty.

One last word of advice, please remember to put a small sleep(1000) in your script, because you could put some stress on Tumblr servers.

answered Oct 28 '22 02:10

Lynx-Lab

Related questions
                            
                                How to set up a resource shared by several unit tests?
                            
                                Is it possible to generate correct PKCS12 (.pfx) file in Python?
                            
                                Frequency of global variables in python?
                            
                                Why is lambda asking for 2 arguments despite being given 2 arguments?
                            
                                WSGIPythonPath is not working
                            
                                Sphinx LaTeX markup limitations
                            
                                How to properly quit a program in python
                            
                                Efficient ways to duplicate array/list in Python
                            
                                How do I split an ndarray based on array of indexes?
                            
                                Python 'startswith' equivalent for SqlAlchemy
                            
                                Have MySQLdb installed, works outside of virtualenv but inside it doesn't exist. How to resolve?
                            
                                Python Timer Callback Method
                            
                                Scipy : fourier transform of a few selected frequencies
                            
                                Expression evaluating to None when substr is not found
                            
                                Bessel functions in Python that work with large exponents
                            
                                Is there something in Python similar to quantstrat in R?
                            
                                Python regular expression question mark operator not working?
                            
                                Numpy cross-product on rectangular grid
                            
                                How do I refactor 100s of Class Methods in Python?
                            
                                Python Docx Carriage Return

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I see all notes of a Tumblr post from Python?

Tags:

python

beautifulsoup

urllib2

tumblr