Good afternoon,
I'm very new to Python, but I'm trying to write a code which will allow me to download all of the posts (including the "notes") from a specified Tumblr account to my computer.
Given my inexperience with coding, I was trying to find a pre-made script which would allow me to do this. I found several brilliant scripts on GitHub, but none of them actually return the notes from Tumblr posts (as far as I can see, although please do correct me if anyone knows of one that does!).
Therefore, I tried to write my own script. I've had some success with the code below. It prints the most recent 20 posts from the given Tumblr (albeit in a rather ugly format -- essentially hundreds of lines of texts all printed into one line of a notepad file):
#This script prints all the posts (including tags, comments) and also the
#first 20notes from all the Tumblr blogs.
import pytumblr
# Authenticate via API Key
client = pytumblr.TumblrRestClient('myapikey')
#offset = 0
# Make the request
client.posts('staff', limit=2000, offset=0, reblog_info=True, notes_info=True,
filter='html')
#print out into a .txt file
with open('out.txt', 'w') as f:
print >> f, client.posts('staff', limit=2000, offset=0, reblog_info=True,
notes_info=True, filter='html')
However, I want the script to continuously print posts until it reaches the end of the specified blog.
I searched this site and found a very similar question (Getting only 20 posts returned through PyTumblr), which has been answered by the stackoverflow user poke. However, I can't seem to actually implement poke's solution so that it works for my data. Indeed, when I run the following script, no output at all is produced.
import pytumblr
# Authenticate via API Key
client = pytumblr.TumblrRestClient('myapikey')
blog = ('staff')
def getAllPosts (client, blog):
offset = 0
while True:
posts = client.posts(blog, limit=20, offset=offset, reblog_info=True, notes_info=True)
if not posts:
return
for post in posts:
yield post
offset += 20
I should note that there are several posts on this site (e.g.Getting more than 50 notes with Tumblr API) about Tumblr notes, most of them asking how to download more than 50 notes per posts. I'm perfectly happy with just 50 notes per post, it is the number of posts that I would like to increase.
Also, I've tagged this post as Python, however, if there is a better way to get the data I require using another programming language, that would be more than okay.
Thank you very much in advance for your time!
The second code snippet is a generator that yields posts one by one, so you have to use it as part of something like a loop and then do something with the output. Here's your code with some additional code that iterates over the generator and prints out the data it gets back.
import pytumblr
def getAllPosts (client, blog):
offset = 0
while True:
posts = client.posts(blog, limit=20, offset=offset, reblog_info=True, notes_info=True)
if not posts:
return
for post in posts:
yield post
offset += 20
# Authenticate via API Key
client = pytumblr.TumblrRestClient('myapikey')
blog = ('staff')
# use the generator getAllPosts
for post in getAllPosts(client, blog):
print(post)
However, that code has a couple bugs in it. getAllPosts
won't yield just each post, it will also return other things because it will iterate over the API response, as you can see from this example I ran in my ipython
shell.
In [7]: yielder = getAllPosts(client, 'staff')
In [8]: next(yielder)
Out[8]: 'blog'
In [9]: next(yielder)
Out[9]: 'posts'
In [10]: next(yielder)
Out[10]: 'total_posts'
In [11]: next(yielder)
Out[11]: 'supply_logging_positions'
In [12]: next(yielder)
Out[12]: 'blog'
In [13]: next(yielder)
Out[13]: 'posts'
In [14]: next(yielder)
Out[14]: 'total_posts'
This happens because the posts
object in getAllPosts
is a dictionary that contains much more than just each post from the staff
blog - it also has items like how many posts the blog contains, the blog's description, when it was last updated, etc. The code as-is could potentially result in an infinite loop because the following conditional:
if not posts:
return
Would never be true because of the response structure, because an empty Tumblr API response from pytumblr
looks like this:
{'blog': {'ask': False,
'ask_anon': False,
'ask_page_title': 'Ask me anything',
'can_send_fan_mail': False,
'can_subscribe': False,
'description': '',
'followed': False,
'is_adult': False,
'is_blocked_from_primary': False,
'is_nsfw': False,
'is_optout_ads': False,
'name': 'asdfasdf',
'posts': 0,
'reply_conditions': '3',
'share_likes': False,
'subscribed': False,
'title': 'Untitled',
'total_posts': 0,
'updated': 0,
'url': 'https://asdfasdf.tumblr.com/'},
'posts': [],
'supply_logging_positions': [],
'total_posts': 0}
if not posts
would be checked against that structure, rather than the posts
field (which is an empty list here), so the condition would never fail because the response dictionary isn't empty (see: Truth Value Testing in Python).
Here's code (mostly tested/verified) that fixes the loop from your getAllPosts
implementation, and then uses the function to retrieve posts and dumps them to a file with the name (BLOG_NAME)-posts.txt
.
import pytumblr
def get_all_posts(client, blog):
offset = 0
while True:
response = client.posts(blog, limit=20, offset=offset, reblog_info=True, notes_info=True)
# Get the 'posts' field of the response
posts = response['posts']
if not posts: return
for post in posts:
yield post
# move to the next offset
offset += 20
client = pytumblr.TumblrRestClient('secrety-secret')
blog = 'staff'
# use our function
with open('{}-posts.txt'.format(blog), 'w') as out_file:
for post in get_all_posts(client, blog):
print >>out_file, post
# if you're in python 3.x, use the following
# print(post, file=out_file)
This will just be a straight text dump of the API's post responses, so if you need to make it look nicer or anything, that's up to you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With