Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quicker way to read single column of CSV file

I am trying to read a single column of a CSV file to R as quickly as possible. I am hoping to cut down on standard methods in terms of the time it takes to get the column into RAM by a factor of 10.

What is my motivation? I have two files; one called Main.csv which is 300000 rows and 500 columns, and one called Second.csv which is 300000 rows and 5 columns. If I system.time() the command read.csv("Second.csv"), it will take 2.2 seconds. Now if I use either of the two methods below to read the first column of Main.csv (which is 20% the size of Second.csv since it is 1 column instead of 5), it will take over 40 seconds. This is the same amount of time as it takes to read the whole 600 Megabyte file -- clearly unacceptable.

  • Method 1

    colClasses <- rep('NULL',500)
    
    colClasses[1] <- NA
    system.time(
    read.csv("Main.csv",colClasses=colClasses)
    ) # 40+ seconds, unacceptable
    
  • Method 2

     read.table(pipe("cut -f1 Main.csv")) #40+ seconds, unacceptable
    

How to reduce this time? I am hoping for an R solution.

like image 893
user2763361 Avatar asked Nov 02 '13 15:11

user2763361


People also ask

How do you read a single column in a CSV file?

This can be done with the help of the pandas. read_csv() method. We will pass the first parameter as the CSV file and the second parameter the list of specific columns in the keyword usecols. It will return the data of the CSV file of specific columns.

How do I extract a column from a CSV file?

Steps. Make a list of columns that have to be extracted. Use read_csv() method to extract the csv file into data frame. Print the exracted data.

How read CSV file faster?

Measured purely by CPU, fastparquet is by far the fastest. Whether it gives you an elapsed time improvement will depend on whether you have existing parallelism or not, your particular computer, and so on. And different CSV files will presumably have different parsing costs; this is just one example.


2 Answers

I would suggest

scan(pipe("cut -f1 -d, Main.csv"))

This differs from the original proposal (read.table(pipe("cut -f1 Main.csv"))) in a couple of different ways:

  • since the file is comma-separated and cut assumes tab-separation by default, you need to specify d, to specify comma-separation
  • scan() is much faster than read.table for simple/unstructured data reads.

According to the comments by the OP this takes about 4 rather than 40+ seconds.

like image 185
Ben Bolker Avatar answered Sep 19 '22 14:09

Ben Bolker


There is a speed comparison of methods to read large CSV files in this blog. fread is the fastest by an order of magnitude.

As mentioned in the comments above, you can use the select parameter to select which columns to read - so:

fread("main.csv",sep = ",", select = c("f1") ) 

will work

like image 25
martino Avatar answered Sep 21 '22 14:09

martino