I am trying to read a single column of a <code>CSV</code> file to <code>R</code> as quickly as possible. I am hoping to cut down on standard methods in terms of the time it takes to get the column into RAM by a factor of 10. What is my motivation? I have two files; one called <code>Main.csv</code> which is 300000 rows and 500 columns, and one called <code>Second.csv</code> which is 300000 rows and 5 columns. If I <code>system.time()</code> the command <code>read.csv("Second.csv")</code>, it will take 2.2 seconds. Now if I use either of the two methods below to read the first column of <code>Main.csv</code> (which is 20% the size of <code>Second.csv</code> since it is 1 column instead of 5), it will take over 40 seconds. This is the same amount of time as it takes to read the whole 600 Megabyte file -- clearly unacceptable. <ul> <li> Method 1 <pre class="prettyprint"><code>colClasses <- rep('NULL',500) colClasses[1] <- NA system.time( read.csv("Main.csv",colClasses=colClasses) ) # 40+ seconds, unacceptable </code></pre> </li> <li> Method 2 <pre class="prettyprint"><code> read.table(pipe("cut -f1 Main.csv")) #40+ seconds, unacceptable </code></pre> </li> </ul> How to reduce this time? I am hoping for an <code>R</code> solution.

I would suggest <pre class="prettyprint"><code>scan(pipe("cut -f1 -d, Main.csv")) </code></pre> This differs from the original proposal (<code>read.table(pipe("cut -f1 Main.csv"))</code>) in a couple of different ways: <ul> <li>since the file is comma-separated and <code>cut</code> assumes tab-separation by default, you need to specify <code>d,</code> to specify comma-separation</li> <li> <code>scan()</code> is much faster than <code>read.table</code> for simple/unstructured data reads.</li> </ul> According to the comments by the OP this takes about 4 rather than 40+ seconds.

There is a speed comparison of methods to read large CSV files in this blog. fread is the fastest by an order of magnitude. As mentioned in the comments above, you can use the select parameter to select which columns to read - so: <pre class="prettyprint"><code>fread("main.csv",sep = ",", select = c("f1") ) </code></pre> will work

Quicker way to read single column of CSV file

Tags:

performance

io

optimization

r

csv

I am trying to read a single column of a CSV file to R as quickly as possible. I am hoping to cut down on standard methods in terms of the time it takes to get the column into RAM by a factor of 10.

What is my motivation? I have two files; one called Main.csv which is 300000 rows and 500 columns, and one called Second.csv which is 300000 rows and 5 columns. If I system.time() the command read.csv("Second.csv"), it will take 2.2 seconds. Now if I use either of the two methods below to read the first column of Main.csv (which is 20% the size of Second.csv since it is 1 column instead of 5), it will take over 40 seconds. This is the same amount of time as it takes to read the whole 600 Megabyte file -- clearly unacceptable.

Method 1

colClasses <- rep('NULL',500)

colClasses[1] <- NA
system.time(
read.csv("Main.csv",colClasses=colClasses)
) # 40+ seconds, unacceptable

Method 2

 read.table(pipe("cut -f1 Main.csv")) #40+ seconds, unacceptable

How to reduce this time? I am hoping for an R solution.

893

asked Nov 02 '13 15:11

user2763361

2 Answers

I would suggest

scan(pipe("cut -f1 -d, Main.csv"))

This differs from the original proposal (read.table(pipe("cut -f1 Main.csv"))) in a couple of different ways:

since the file is comma-separated and cut assumes tab-separation by default, you need to specify d, to specify comma-separation
scan() is much faster than read.table for simple/unstructured data reads.

According to the comments by the OP this takes about 4 rather than 40+ seconds.

185

answered Sep 19 '22 14:09

Ben Bolker

There is a speed comparison of methods to read large CSV files in this blog. fread is the fastest by an order of magnitude.

As mentioned in the comments above, you can use the select parameter to select which columns to read - so:

fread("main.csv",sep = ",", select = c("f1") )

will work

answered Sep 21 '22 14:09

martino

Related questions
                            
                                How can I shut down Rserve gracefully?
                            
                                In R: remove commas from a field AND have the modified field remain part of the dataframe
                            
                                R: Can exists() function be used within mutate() (dplyr package)?
                            
                                R: Check existence of url, problems with httr:GET() and url.exists()
                            
                                dplyr n_distinct with condition
                            
                                How to get the position of elements in a list?
                            
                                Fastest way to read in 100,000 .dat.gz files
                            
                                dplyr arrange by reverse alphabetical order [duplicate]
                            
                                Solving Josephus permutation
                            
                                Adding data labels above geom_col() chart with ggplot2
                            
                                Convert a list of sf objects into one sf
                            
                                Rscript behaves inconsistently on windows with single and double quotes
                            
                                Can Ruby interface with r?
                            
                                creating tree diagram for showing case count using R
                            
                                How do I get discrete factor levels to be treated as continuous?
                            
                                non-joins with data.tables
                            
                                Overlay bar graphs in ggplot2 [duplicate]
                            
                                How to fix the error in R of "no lines available in input"?
                            
                                Cumulative sequence of occurrences of values [duplicate]
                            
                                geom_smooth on a subset of data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With