Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use tidyr (or similar data wrangling) on "big" data in postgreSQL (Redshift) database

I have 4 billion rows of data in a 12 node redshift cluster. I am successfully able to connect to it with the Rpostgreqsql package and use dplyr to do basic data wrangling.

However, I'd like to do some data reshaping which I'd normally use reshape2 (dcast) or tidyr (spread) to do. I am finding that neither package executes on my database object. I could run "collect" but that would be problematic because that dataframe would be far too big to fit into memory (hence the reason I want to run in DB). My overall goal is to use dcast/spread to make the data wider while creating 0/1 flags in the process. This works like a charm with small samples of data on my machine, but not so well on DB.

Below is my code that I have working. Connecting to DB and doing basic filtering with dplyr. When I try to use tidyr/reshape2, R throws syntax errors that "type not recognized"

    Redshift <- src_postgres('dev',
                         host = 'xxx.aws.com',
                         port = 5439,
                         user = "user", 
                         password = "pwd")


### create table reference ###
df <- tbl(Redshift, "df_cj_allact")

# simple and default R commands analyzing data frames
dim(df)
colnames(df)
head(df)

df2 <- df %>% filter(id != '0') %>% arrange(id, timestamp, category)  # seems to work!
# 2157398, was 2306109 (6% loss)
like image 880
leaRningR909 Avatar asked Feb 10 '16 21:02

leaRningR909


1 Answers

tidyr package doesn't support database backend. You could only manipulate in memory data. dplyr works with database tables as well as with in memory objects. You could try either use machine with larger memory (say on AWS) and use data.table, or think about splitting your data.

like image 99
kismsu Avatar answered Nov 10 '22 13:11

kismsu