I have 4 billion rows of data in a 12 node redshift cluster. I am successfully able to connect to it with the Rpostgreqsql package and use dplyr to do basic data wrangling.
However, I'd like to do some data reshaping which I'd normally use reshape2 (dcast) or tidyr (spread) to do. I am finding that neither package executes on my database object. I could run "collect" but that would be problematic because that dataframe would be far too big to fit into memory (hence the reason I want to run in DB). My overall goal is to use dcast/spread to make the data wider while creating 0/1 flags in the process. This works like a charm with small samples of data on my machine, but not so well on DB.
Below is my code that I have working. Connecting to DB and doing basic filtering with dplyr. When I try to use tidyr/reshape2, R throws syntax errors that "type not recognized"
Redshift <- src_postgres('dev',
host = 'xxx.aws.com',
port = 5439,
user = "user",
password = "pwd")
### create table reference ###
df <- tbl(Redshift, "df_cj_allact")
# simple and default R commands analyzing data frames
dim(df)
colnames(df)
head(df)
df2 <- df %>% filter(id != '0') %>% arrange(id, timestamp, category) # seems to work!
# 2157398, was 2306109 (6% loss)
tidyr package doesn't support database backend. You could only manipulate in memory data. dplyr works with database tables as well as with in memory objects. You could try either use machine with larger memory (say on AWS) and use data.table, or think about splitting your data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With