Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I see all notes of a Tumblr post from Python?

Say I look at the following Tumblr post: http://ronbarak.tumblr.com/post/40692813…
It (currently) has 292 notes.

I'd like to get all the above notes using a Python script (e.g., via urllib2, BeautifulSoup, simplejson, or tumblr Api). Some extensive Googling did not produce any items relating to notes' extraction in Tumblr.

Can anyone point me in the right direction on which tool will enable me to do that?

like image 679
user1850727 Avatar asked Jan 19 '13 14:01

user1850727


People also ask

How do you see all notes in a Tumblr post?

At the top, you'll see a summary of the total number of likes and reblogs—tap or click it for a chronological list of all your other notes.

What post on Tumblr has the most notes?

Today, I accidentally stumbled across the most popular Tumblr post of all time. When it came to me, it simply said "Mitt Romney sucks pass it on." When I was done with it, it said "Randy Newman for President." And on it goes, for almost 8 million notes.

Can you hide notes on Tumblr?

In some themes, you can hide the Notes section with a simple check mark under the "Appearance" heading of the theme customizing area.

What is Tumblr API?

The Tumblr API allows users to read and write Tumblr blog and post data, retrieve posts by tags, get user information, follow blogs and like posts. Data is formatted in JSON and support for JSONP is included.


3 Answers

Unfortunately looks like the Tumblr API has some limitations (lacks of meta information about Reblogs, notes limited by 50), so you can't get all the notes.

It is also forbidden to do page scraping according to the Terms of Service.

"You may not do any of the following while accessing or using the Services: (...) scrape the Services, and particularly scrape Content (as defined below) from the Services, without Tumblr's express prior written consent;"

Source:

https://groups.google.com/forum/?fromgroups=#!topic/tumblr-api/ktfMIdJCOmc

like image 113
Fábio Hiroki Avatar answered Oct 28 '22 04:10

Fábio Hiroki


Without JS you get separate pages that only contain the notes. For the mentioned blog post the first page would be:

http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy

Following pages are linked at the bottom, e.g.:

  • http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358403506
  • http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358383221
  • http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358377013

(See my answer on how to find the next URL in a’s onclick attribute.)

Now you could use various tools to download/parse the data.

The following wget command should download all notes pages for that post:

wget --recursive --domains=ronbarak.tumblr.com --include-directories=notes http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy
like image 22
unor Avatar answered Oct 28 '22 04:10

unor


Like Fabio implies, it is better to use the API.

If for whatever reasons you cannot, then the tools you will use will depend on what you want to do with the data in the posts.

  • for a data dump: urllib will return a string of the page you want
  • looking for a specific section in the html: lxml is pretty good
  • looking for something in unruly html: definitely beautifulsoup
  • looking for a specific item in a section: beautifulsoup, lxml, text parsing is what you need.
  • need to put the data in a database/file: use scrapy

Tumblr url scheme is simple: url/scheme/1, url/scheme/2, url/scheme/3, etc... until you get to the end of the posts and the servers just does not return any data anymore.

So if you are going to brute force your way to scraping, you can easily tell your script to dump all the data on your hard drive until, say the contents tag, is empty.

One last word of advice, please remember to put a small sleep(1000) in your script, because you could put some stress on Tumblr servers.

like image 26
Lynx-Lab Avatar answered Oct 28 '22 02:10

Lynx-Lab