Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Print more than 20 posts from Tumblr API

Good afternoon,

I'm very new to Python, but I'm trying to write a code which will allow me to download all of the posts (including the "notes") from a specified Tumblr account to my computer.

Given my inexperience with coding, I was trying to find a pre-made script which would allow me to do this. I found several brilliant scripts on GitHub, but none of them actually return the notes from Tumblr posts (as far as I can see, although please do correct me if anyone knows of one that does!).

Therefore, I tried to write my own script. I've had some success with the code below. It prints the most recent 20 posts from the given Tumblr (albeit in a rather ugly format -- essentially hundreds of lines of texts all printed into one line of a notepad file):

#This script prints all the posts (including tags, comments) and also the 
#first 20notes from all the Tumblr blogs.

import pytumblr

# Authenticate via API Key
client = pytumblr.TumblrRestClient('myapikey')

#offset = 0

# Make the request
client.posts('staff', limit=2000, offset=0, reblog_info=True, notes_info=True, 
filter='html')
#print out into a .txt file
with open('out.txt', 'w') as f:
print >> f, client.posts('staff', limit=2000, offset=0, reblog_info=True, 
notes_info=True, filter='html')

However, I want the script to continuously print posts until it reaches the end of the specified blog.

I searched this site and found a very similar question (Getting only 20 posts returned through PyTumblr), which has been answered by the stackoverflow user poke. However, I can't seem to actually implement poke's solution so that it works for my data. Indeed, when I run the following script, no output at all is produced.

import pytumblr

# Authenticate via API Key
client = pytumblr.TumblrRestClient('myapikey')
blog = ('staff')
def getAllPosts (client, blog):
offset = 0
while True:
    posts = client.posts(blog, limit=20, offset=offset, reblog_info=True, notes_info=True)
    if not posts:
        return

    for post in posts:
        yield post


    offset += 20

I should note that there are several posts on this site (e.g.Getting more than 50 notes with Tumblr API) about Tumblr notes, most of them asking how to download more than 50 notes per posts. I'm perfectly happy with just 50 notes per post, it is the number of posts that I would like to increase.

Also, I've tagged this post as Python, however, if there is a better way to get the data I require using another programming language, that would be more than okay.

Thank you very much in advance for your time!

like image 356
Izzy Avatar asked Nov 15 '17 15:11

Izzy


1 Answers

tl;dr If you'd like to just see the answer, it's at the bottom after the heading A Corrected Version

The second code snippet is a generator that yields posts one by one, so you have to use it as part of something like a loop and then do something with the output. Here's your code with some additional code that iterates over the generator and prints out the data it gets back.

import pytumblr

def getAllPosts (client, blog):
    offset = 0
    while True:
        posts = client.posts(blog, limit=20, offset=offset, reblog_info=True, notes_info=True)
        if not posts:
            return

        for post in posts:
            yield post

        offset += 20

# Authenticate via API Key
client = pytumblr.TumblrRestClient('myapikey')
blog = ('staff')

# use the generator getAllPosts
for post in getAllPosts(client, blog):
    print(post)

However, that code has a couple bugs in it. getAllPosts won't yield just each post, it will also return other things because it will iterate over the API response, as you can see from this example I ran in my ipython shell.

In [7]: yielder = getAllPosts(client, 'staff')

In [8]: next(yielder)
Out[8]: 'blog'

In [9]: next(yielder)
Out[9]: 'posts'

In [10]: next(yielder)
Out[10]: 'total_posts'

In [11]: next(yielder)
Out[11]: 'supply_logging_positions'

In [12]: next(yielder)
Out[12]: 'blog'

In [13]: next(yielder)
Out[13]: 'posts'

In [14]: next(yielder)
Out[14]: 'total_posts'

This happens because the posts object in getAllPosts is a dictionary that contains much more than just each post from the staff blog - it also has items like how many posts the blog contains, the blog's description, when it was last updated, etc. The code as-is could potentially result in an infinite loop because the following conditional:

if not posts:
    return

Would never be true because of the response structure, because an empty Tumblr API response from pytumblr looks like this:

{'blog': {'ask': False,
  'ask_anon': False,
  'ask_page_title': 'Ask me anything',
  'can_send_fan_mail': False,
  'can_subscribe': False,
  'description': '',
  'followed': False,
  'is_adult': False,
  'is_blocked_from_primary': False,
  'is_nsfw': False,
  'is_optout_ads': False,
  'name': 'asdfasdf',
  'posts': 0,
  'reply_conditions': '3',
  'share_likes': False,
  'subscribed': False,
  'title': 'Untitled',
  'total_posts': 0,
  'updated': 0,
  'url': 'https://asdfasdf.tumblr.com/'},
 'posts': [],
 'supply_logging_positions': [],
 'total_posts': 0}

if not posts would be checked against that structure, rather than the posts field (which is an empty list here), so the condition would never fail because the response dictionary isn't empty (see: Truth Value Testing in Python).


A Corrected Version

Here's code (mostly tested/verified) that fixes the loop from your getAllPosts implementation, and then uses the function to retrieve posts and dumps them to a file with the name (BLOG_NAME)-posts.txt.

import pytumblr


def get_all_posts(client, blog):
    offset = 0
    while True:
        response = client.posts(blog, limit=20, offset=offset, reblog_info=True, notes_info=True)

        # Get the 'posts' field of the response        
        posts = response['posts']

        if not posts: return

        for post in posts:
            yield post

        # move to the next offset
        offset += 20


client = pytumblr.TumblrRestClient('secrety-secret')
blog = 'staff'

# use our function
with open('{}-posts.txt'.format(blog), 'w') as out_file:
    for post in get_all_posts(client, blog):
        print >>out_file, post
        # if you're in python 3.x, use the following
        # print(post, file=out_file)

This will just be a straight text dump of the API's post responses, so if you need to make it look nicer or anything, that's up to you.

like image 89
逆さま Avatar answered Nov 02 '22 23:11

逆さま