I am relatively new in the "large data process" in r here, hope to look for some advise about how to deal with 50 GB csv file. The current problem is following: Table is looked like: <pre class="prettyprint"><code>ID,Address,City,States,... (50 more fields of characteristics of a house) 1,1,1st street,Chicago,IL,... # the first 1 is caused by write.csv, they created an index raw in the file </code></pre> I would like to find all rows that is belonging San Francisco, CA. It supposed to be an easy problem, but the csv is too large. I know I have two ways of doing it in R and another way to use database to handle it: (1) Using R's ffdf packages: since last time the file is saved, it was using write.csv and it contains all different types. <pre class="prettyprint"><code>all <- read.csv.ffdf( file="<path of large file>", sep = ",", header=TRUE, VERBOSE=TRUE, first.rows=10000, next.rows=50000, ) </code></pre> the console gives me this: <pre class="prettyprint"><code>Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : vmode 'character' not implemented </code></pre> Searching through online, I found several answers which did not fit my case, and I can't really make sense of how to transfer "character" into "factor" type as they mentioned. Then I tried using read.table.ffdf, this is even more disaster. I can't find a solid guide for that one. (2) Using R's readline: I know this is another good way, but can't find an effecient way to do this. (3) Using SQL: I am not sure how to transfer the file into SQL version, and how to handle this, if there is a good guide I would like to try. But in general, I would like to stick with R. Thanks for reply and help!

You can use R with SQLite behind the curtains with the sqldf package. You'd use the <code>read.csv.sql</code> function in the <code>sqldf</code> package and then you can query the data however you want to obtain the smaller data frame. The example from the docs: <pre class="prettyprint"><code>library(sqldf) iris2 <- read.csv.sql("iris.csv", sql = "select * from file where Species = 'setosa' ") </code></pre> I've used this library on VERY large CSV files with good results.

How to deal with a 50GB large csv file in r language?

Tags:

sql

r

csv

ff

I am relatively new in the "large data process" in r here, hope to look for some advise about how to deal with 50 GB csv file. The current problem is following:

Table is looked like:

ID,Address,City,States,... (50 more fields of characteristics of a house)
1,1,1st street,Chicago,IL,...
# the first 1 is caused by write.csv, they created an index raw in the file

I would like to find all rows that is belonging San Francisco, CA. It supposed to be an easy problem, but the csv is too large.

I know I have two ways of doing it in R and another way to use database to handle it:

(1) Using R's ffdf packages:

since last time the file is saved, it was using write.csv and it contains all different types.

all <- read.csv.ffdf(
  file="<path of large file>", 
  sep = ",",
  header=TRUE, 
  VERBOSE=TRUE, 
  first.rows=10000, 
  next.rows=50000,
  )

the console gives me this:

Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered,  
: vmode 'character' not implemented

Searching through online, I found several answers which did not fit my case, and I can't really make sense of how to transfer "character" into "factor" type as they mentioned.

Then I tried using read.table.ffdf, this is even more disaster. I can't find a solid guide for that one.

(2) Using R's readline:

I know this is another good way, but can't find an effecient way to do this.

(3) Using SQL:

I am not sure how to transfer the file into SQL version, and how to handle this, if there is a good guide I would like to try. But in general, I would like to stick with R.

Thanks for reply and help!

398

asked Sep 24 '16 17:09

windsound

2 Answers

You can use R with SQLite behind the curtains with the sqldf package. You'd use the read.csv.sql function in the sqldf package and then you can query the data however you want to obtain the smaller data frame.

The example from the docs:

library(sqldf)

iris2 <- read.csv.sql("iris.csv", 
    sql = "select * from file where Species = 'setosa' ")

I've used this library on VERY large CSV files with good results.

answered Nov 01 '22 08:11

Chris Townsend

R -- in its basic configuration -- loads data into memory. Memory is cheap. 50 GB still is not a typical configuration (and you would need more than that to load the data in and store it). If you are really good in R, you might be able to figure out another mechanism. If you have access to a cluster, you could use some parallel version of R or Spark.

You could also load the data into a database. For the task at hand, a database is very well suited to the problem. R easily connects to almost any database. And, you might find a database very useful for what you want to do.

Or, you could just process the text file in situ. Command line tools such as awk, grep, and perl are very suitable for this task. I would recommend this approach for a one-time effort. I would recommend a database if you want to keep the data around for analytic purposes.

answered Nov 01 '22 08:11

Gordon Linoff

Related questions
                            
                                Select random row for each group
                            
                                How to list tables where data was inserted deleted or updated in last week
                            
                                How to use SQL Server Management studio - "Execute Stored Procedure" for User Defined Table Types?
                            
                                MySQL MAX from SUM
                            
                                Building dynamic where condition in SQL statement
                            
                                DropDownList has a SelectedValue which is invalid because it does not exist in the list of items. Parameter name: value
                            
                                Postgresql: INSERT INTO using SELECT and values
                            
                                Performing an UPDATE with Union in SQL
                            
                                Pattern matching SQL on first 5 characters
                            
                                Postgresql operator class "varchar_pattern_ops" does not accept data type integer
                            
                                How to delete automatically all reference rows if parent row get deleted in mysql?
                            
                                Blank values in Date column returning as 1900/01/01 on running SELECT statement
                            
                                Execute scripts by relative path in Oracle SQL Developer
                            
                                SQL using CASE in count and group by
                            
                                Why is there no `select last` or `select bottom` in SQL Server like there is `select top`?
                            
                                Subtract hours from the now() function
                            
                                join on two foreign keys from same table in SQL
                            
                                MySQL indexes - what are the best practices according to this table and queries
                            
                                How to use "case-when" in Ecto Queries in elixir?
                            
                                What are the database requirements for HIPAA compliance?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With