R: read.csv.sql from sqldf is able to successfully read one csv but not another

Tags:

I have a dataset which is around 20GB big and therefore, I am not able to read it into an R dataframe without running out of memory. After reading some posts here, I have decided to use read.csv.sql into a database. The code I used is:

read.csv.sql(
  "jobs.csv", 
  sql = "CREATE TABLE Jobs2 AS SELECT * FROM file", 
  dbname = "Test1.sqlite"
)

When I run the following:

sqldf(
  "select * from Jobs2", 
  dbname = "Test1.sqlite"
)

I get the heading of the columns, but then no value: <0 rows> (or 0-length row.names)

But when I try the same with a csv I created using the iris dataset, everything works fine.

What am I missing here?

Thanks in advance.

856

asked Apr 04 '15 07:04

user3259937

1 Answers

sqldf is primarily intended to process data frames so it creates databases and database tables transparently and removes them after completing the sql. Thus your first statement would not be expected to work since sqldf would remove the database after the statement completes.

If the SQL creates the database or table rather than sqldf itself then sqldf won't know about it so it won't delete it. Here we create the database using attach and the table using create table to fool sqldf. In the last line it won't remove the database oir table because they were already there before that line started and it never removes objects it did not create:

library(sqldf)

read.csv.sql("jobs.csv", sql = c("attach 'test1.sqlite' as new", 
              "create table new.jobs2 as select * from file"))
sqldf("select * from jobs2", dbname = "test1.sqlite")

The other thing that might go wrong would be line endings. Typically sqldf can figure it out but if not you may have to specify the eol character. The need to specify it might occur, for example, if you were trying to read a file created on one operating system in another operating system. See FAQ 11. Why am I having difficulty reading a data file using SQLite in the sqldf README.

Note: read.csv.sql is normally used to just read in a portion of the data. For example, this skips the first 100 rows and then reads columns a and b from the next 1000 rows but the query can be arbitrarily complex since you have all of SQLite's SQL to use:

read.csv.sql("jobs.csv", sql = "select a, b from file limit 1000 offset 100")

The entire file is read into a temporary sqlite database but only the requested portion is ever read into R so the entire file could be larger than R can handle.

Typically if one is trying to achieve persistence one uses RSQLite directly rather than sqldf.

178

answered Nov 15 '22 08:11

G. Grothendieck

Related questions
                            
                                grid.layout doesn't like respect and compound units
                            
                                RStudio doesn't load all Python modules via rPython call
                            
                                R - How to contrast code factors and retain meaningful labels in output summary
                            
                                Trouble with Pandoc installation on Ubuntu 14.04LTS for using with R Markdown
                            
                                Multiple Imputation of longitudinal data in MICE and statistical analyses of object type mids
                            
                                Get information about a promise without evaluating the promise
                            
                                Best practices for high-def animation videos in R
                            
                                How to automatically "right-size" ggplot in Shiny?
                            
                                How to retrieve formals of a primitive function?
                            
                                To access S3 bucket from R
                            
                                building classification tree having categorical variables using rpart
                            
                                Force certain parameters to have positive coefficients in lm()
                            
                                How to rank rows by two columns at once in R?
                            
                                Websites explicitly designed for testing Web Scraping applications [closed]
                            
                                data.table computing several column at once
                            
                                which.min by row without apply
                            
                                Change number format in renderDataTable
                            
                                pass grouped dataframe to own function in dplyr
                            
                                Rcurl: url.exists returns false when url does exist
                            
                                R: Problems with unloadNamespace(package) when installing a package

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With