Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: read.csv.sql from sqldf is able to successfully read one csv but not another

Tags:

r

csv

I have a dataset which is around 20GB big and therefore, I am not able to read it into an R dataframe without running out of memory. After reading some posts here, I have decided to use read.csv.sql into a database. The code I used is:

read.csv.sql(
  "jobs.csv", 
  sql = "CREATE TABLE Jobs2 AS SELECT * FROM file", 
  dbname = "Test1.sqlite"
)

When I run the following:

sqldf(
  "select * from Jobs2", 
  dbname = "Test1.sqlite"
)

I get the heading of the columns, but then no value: <0 rows> (or 0-length row.names)

But when I try the same with a csv I created using the iris dataset, everything works fine.

What am I missing here?

Thanks in advance.

like image 856
user3259937 Avatar asked Apr 04 '15 07:04

user3259937


People also ask

What is the difference between read table and read csv in R?

Remember that the read. csv() as well as the read. csv2() function are almost identical to the read. table() function, with the sole difference that they have the header and fill arguments set as TRUE by default.

What does the read csv () function in R do?

read. csv() is a wrapper function for read. table() that mandates a comma as separator and uses the input file's first line as header that specifies the table's column names. Thus, it is an ideal candidate to read CSV files.

What is the syntax to read csv files in R?

Read CSV File in R Once the data frame was created and to perform operations refer to R data frame tutorial for examples. # Syntax of read. csv() read. csv(file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.


1 Answers

sqldf is primarily intended to process data frames so it creates databases and database tables transparently and removes them after completing the sql. Thus your first statement would not be expected to work since sqldf would remove the database after the statement completes.

If the SQL creates the database or table rather than sqldf itself then sqldf won't know about it so it won't delete it. Here we create the database using attach and the table using create table to fool sqldf. In the last line it won't remove the database oir table because they were already there before that line started and it never removes objects it did not create:

library(sqldf)

read.csv.sql("jobs.csv", sql = c("attach 'test1.sqlite' as new", 
              "create table new.jobs2 as select * from file"))
sqldf("select * from jobs2", dbname = "test1.sqlite")

The other thing that might go wrong would be line endings. Typically sqldf can figure it out but if not you may have to specify the eol character. The need to specify it might occur, for example, if you were trying to read a file created on one operating system in another operating system. See FAQ 11. Why am I having difficulty reading a data file using SQLite in the sqldf README.

Note: read.csv.sql is normally used to just read in a portion of the data. For example, this skips the first 100 rows and then reads columns a and b from the next 1000 rows but the query can be arbitrarily complex since you have all of SQLite's SQL to use:

read.csv.sql("jobs.csv", sql = "select a, b from file limit 1000 offset 100")

The entire file is read into a temporary sqlite database but only the requested portion is ever read into R so the entire file could be larger than R can handle.

Typically if one is trying to achieve persistence one uses RSQLite directly rather than sqldf.

like image 178
G. Grothendieck Avatar answered Nov 15 '22 08:11

G. Grothendieck