Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load large data to R data.table from Postgresql

I store my data in Postgresql server. I want to load a table which has 15mil rows to data.frame or data.table

I use RPostgreSQL to load data.

library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, ...)

# Select data from a table
system.time(
df <- dbGetQuery(con, "SELECT * FROM 15mil_rows_table")
)

It took 20 minutes to load data from DB to df. I use google cloud server which have 60GB ram and 16 Core CPU

What should I do to reduce load time?

like image 874
Minh Ha Pham Avatar asked Mar 29 '15 11:03

Minh Ha Pham


2 Answers

Not sure if this will reduce load time, for sure it may reduce load time as both processes are quite performance efficient. You can leave a comment about the timming.

  1. using bash run psql as dump table to csv:

COPY 15mil_rows_table TO '/path/15mil_rows_table.csv' DELIMITER ',' CSV HEADER;
  1. in R just fread it:

library(data.table)
DT <- fread("/path/15mil_rows_table.csv")
like image 65
jangorecki Avatar answered Oct 13 '22 00:10

jangorecki


I use the method as @Jan Gorecki with zip data to save memory.

1- Dump table to csv

psql -h localhost -U user -d 'database' -c "COPY 15mil_rows_table TO stdout DELIMITER ',' CSV HEADER" | gzip > 15mil_rows_table.csv.gz &

2- Load data in R

DT <- fread('zcat 15mil_rows_table.csv.gz')
like image 33
Minh Ha Pham Avatar answered Oct 13 '22 00:10

Minh Ha Pham