Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to read huge csv file into R by row condition?

Tags:

r

I have a huge csv file about 15 million row, with size around 3G.

I would like to read this file into R by piece, each time only choose those rows fit into certain condition.

e.g. one of the column is called product type, so I only need to read one type of product into R, and process it then output the result, after that I move to another type of product...

so far I have read about different methods, such as upload the big file into database, or read column by column by colbycol, or read a chunk of rows by ff ...

is any pure R solution can solve my problem?

like image 264
linus Avatar asked Sep 13 '13 16:09

linus


People also ask

How do I read a row in a CSV file in R?

To import a CSV file into the R environment we need to use a pre-defined function called read. csv(). Pass filename.

How do I open a CSV file with too many rows?

So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.

How do I read a large file in R?

Loading a large dataset: use fread() or functions from readr instead of read. xxx() . If you really need to read an entire csv in memory, by default, R users use the read. table method or variations thereof (such as read.

What is csv row limit?

csv files have a limit of 32,767 characters per cell. Excel has a limit of 1,048,576 rows and 16,384 columns per sheet. CSV files can hold many more rows. You can read more about these limits and others from this Microsoft support article here.


1 Answers

You can use the RSQLite package:

library(RSQLite)
# Create/Connect to a database
con <- dbConnect("SQLite", dbname = "sample_db.sqlite")

# read csv file into sql database
# Warning: this is going to take some time and disk space, 
#   as your complete CSV file is transferred into an SQLite database.
dbWriteTable(con, name="sample_table", value="Your_Big_CSV_File.csv", 
    row.names=FALSE, header=TRUE, sep = ",")

# Query your data as you like
yourData <- dbGetQuery(con, "SELECT * FROM sample_table LIMIT 10")

dbDisconnect(con)

Next time you want to access your data you can leave out the dbWriteTable, as the SQLite table is stored on disk.

Note: the writing of the CSV data to the SQLite file does not load all data in memory first. So the memory you will use in the end will be limited to the amount of data that your query returns.

like image 90
ROLO Avatar answered Sep 28 '22 00:09

ROLO