How to use tidyr (or similar data wrangling) on "big" data in postgreSQL (Redshift) database

Question

I have 4 billion rows of data in a 12 node redshift cluster. I am successfully able to connect to it with the Rpostgreqsql package and use dplyr to do basic data wrangling.

However, I'd like to do some data reshaping which I'd normally use reshape2 (dcast) or tidyr (spread) to do. I am finding that neither package executes on my database object. I could run "collect" but that would be problematic because that dataframe would be far too big to fit into memory (hence the reason I want to run in DB). My overall goal is to use dcast/spread to make the data wider while creating 0/1 flags in the process. This works like a charm with small samples of data on my machine, but not so well on DB.

Below is my code that I have working. Connecting to DB and doing basic filtering with dplyr. When I try to use tidyr/reshape2, R throws syntax errors that "type not recognized"

    Redshift <- src_postgres('dev',
                         host = 'xxx.aws.com',
                         port = 5439,
                         user = "user", 
                         password = "pwd")


### create table reference ###
df <- tbl(Redshift, "df_cj_allact")

# simple and default R commands analyzing data frames
dim(df)
colnames(df)
head(df)

df2 <- df %>% filter(id != '0') %>% arrange(id, timestamp, category)  # seems to work!
# 2157398, was 2306109 (6% loss)

kismsu · Accepted Answer

tidyr package doesn't support database backend. You could only manipulate in memory data. dplyr works with database tables as well as with in memory objects. You could try either use machine with larger memory (say on AWS) and use data.table, or think about splitting your data.

How to use tidyr (or similar data wrangling) on "big" data in postgreSQL (Redshift) database

Tags:

r

postgresql

tidyr

reshape2

rpostgresql

leaRningR909

1 Answers

kismsu

Recent Activity

Donate For Us

How to use tidyr (or similar data wrangling) on "big" data in postgreSQL (Redshift) database

Tags:

r

postgresql

tidyr

reshape2

rpostgresql

leaRningR909

1 Answers

kismsu

Related questions

Recent Activity

Donate For Us