Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting more than 100 search results with PRAW?

I'm using the following code to obtain reddit search results with PRAW 4.4.0:

params = {'sort':'new', 'time_filter':'year'}
return reddit.subreddit(subreddit).search('', **params)

I'd like to scrape an indefinite amount of posts from the subreddit, for a period of up to a year. Reddit's search functionality (and correspondingly, their API) achieves this with the 'after' parameter. However, the above search function doesn't accept 'after' as a parameter. Is there a way to use PRAW's .search() to obtain more than 100 search results?

like image 800
Dreadnaught Avatar asked Mar 07 '17 21:03

Dreadnaught


People also ask

What is Useragent PRAW?

User Agent. A user agent is a unique identifier that helps Reddit determine the source of network requests. To use Reddit's API, you need a unique and descriptive user agent. The recommended format is <platform>:<app ID>:<version string> (by u/<Reddit username>) .

What is async PRAW?

Async PRAW, an abbreviation for "Asynchronous Python Reddit API Wrapper", is a Python package that allows for simple access to Reddit's API. Async PRAW aims to be easy to use and internally follows all of Reddit's API rules. With Async PRAW there's no need to introduce sleep calls in your code.

What is PRAW?

PRAW (Python Reddit API Wrapper) is a Python module that provides a simple access to Reddit's API. PRAW is easy to use and follows all of Reddit's API rules.


1 Answers

Yes, by sending parameter limit=None will increase that to 1000, but will not guarantee any timeframe and no way to grab more that 1000. However you can use cloudsearch syntax. It is described in detail in reddit wiki https://www.reddit.com/wiki/search#wiki_cloudsearch_syntax and is pretty powerful search enhancer.

To support it with some code, example usage like this case can be achieved in this way:

import datetime
params = {'sort':'new', 'limit':None, 'syntax':'cloudsearch'}
time_now = datetime.datetime.now()
return reddit.subreddit(subreddit).search('timestamp:{0}..{1}'.format(
    int((time_now - datetime.timedelta(days=365)).timestamp()),
    int(time_now.timestamp())),
    **params)

This has limit of 1000 results per query, but due to specified timeframe you can query multiple times for different timeframes. I.e. grab 1000 submissions, get utc_time from oldest one and send that time as first parameter for timestamp, which will give you results starting at the point in time that your last query stopped.

like image 158
Tomasz Plaskota Avatar answered Oct 25 '22 14:10

Tomasz Plaskota