Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape all subreddit posts in a given time period

I have a function to scrape all the posts in the Bitcoin subreddit between 2014-11-01 and 2015-10-31.

However, I'm only able to extract about 990 posts that go back only to October 25. I don't understand what's happening. I included a Sys.sleep of 15 seconds between each extract after referring to https://github.com/reddit/reddit/wiki/API, to no avail.

Also, I experimented with scraping from another subreddit (fitness), but it also returned around 900 posts.

require(jsonlite)
require(dplyr)

getAllPosts <- function() {
    url <- "https://www.reddit.com/r/bitcoin/search.json?q=timestamp%3A1414800000..1446335999&sort=new&restrict_sr=on&rank=title&syntax=cloudsearch&limit=100"
    extract <- fromJSON(url)
    posts <- extract$data$children$data %>% dplyr::select(name, author,   num_comments, created_utc,
                                             title, selftext)  
    after <- posts[nrow(posts),1]
    url.next <- paste0("https://www.reddit.com/r/bitcoin/search.json?q=timestamp%3A1414800000..1446335999&sort=new&restrict_sr=on&rank=title&syntax=cloudsearch&after=",after,"&limit=100")
    extract.next <- fromJSON(url.next)
    posts.next <- extract.next$data$children$data

    # execute while loop as long as there are any rows in the data frame
    while (!is.null(nrow(posts.next))) {
        posts.next <- posts.next %>% dplyr::select(name, author, num_comments, created_utc, 
                                    title, selftext)
        posts <- rbind(posts, posts.next)
        after <- posts[nrow(posts),1]
        url.next <- paste0("https://www.reddit.com/r/bitcoin/search.json?q=timestamp%3A1414800000..1446335999&sort=new&restrict_sr=on&rank=title&syntax=cloudsearch&after=",after,"&limit=100")
        Sys.sleep(15)
        extract <- fromJSON(url.next)
        posts.next <- extract$data$children$data
    }
    posts$created_utc <- as.POSIXct(posts$created_utc, origin="1970-01-01")
    return(posts)
}

posts <- getAllPosts()

Does reddit have some kind of limit that I'm hitting?

like image 994
matsuo_basho Avatar asked Nov 24 '15 19:11

matsuo_basho


1 Answers

Yes, all reddit listings (posts, comments, etc.) are capped at 1000 items; they're essentially just cached lists, rather than queries, for performance reasons.

To get around this, you'll need to do some clever searching based on timestamps.

like image 106
Xiong Chiamiov Avatar answered Sep 24 '22 11:09

Xiong Chiamiov