Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading in multiple CSVs with different numbers of lines to skip at start of file

Tags:

r

csv

read.table

I have to read in about 300 individual CSVs. I have managed to automate the process using a loop and structured CSV names. However each CSV has 14-17 lines of rubbish at the start and it varies randomly so hard coding a 'skip' parameter in the read.table command won't work. The column names and number of columns is the same for each CSV.

Here is an example of what I am up against:

QUICK STATISTICS:

      Directory: Data,,,,
           File: Final_Comp_Zn_1
      Selection: SEL{Ox*1000+Doma=1201}
         Weight: None,,,
     ,,Variable: AG,,,

Total Number of Samples: 450212  Number of Selected Samples: 277


Statistics

VARIABLE,Min slice Y(m),Max slice Y(m),Count,Minimum,Maximum,Mean,Std.Dev.,Variance,Total Samples in Domain,Active Samples in Domain AG,  
6780.00,   6840.00,         7,    3.0000,   52.5000,   23.4143,   16.8507,  283.9469,        10,        10 AG,   
6840.00,   6900.00,         4,    4.0000,    5.5000,    4.9500,    0.5766,    0.3325,        13,        13 AG,   
6900.00,   6960.00,        16,    1.0000,   37.0000,    8.7625,    9.0047,   81.0848,        29,        29 AG,   
6960.00,   7020.00,        58,    3.0000,   73.5000,   10.6931,   11.9087,  141.8172,       132,       132 AG,   
7020.00,   7080.00,        23,    3.0000,  104.5000,   15.3435,   23.2233,  539.3207,        23,        23 AG,   
7080.00,   7140.00,        33,    1.0000,   15.4000,    3.8152,    2.8441,    8.0892,        35,        35 AG,

Basically I want to read from the line VARIABLE,Min slice Y(m),Max slice Y(m),.... I can think of a few solutions but I don't know how I would go about programming it. Is there anyway I can:

  1. Read the CSV first and somehow work out how many lines of rubbish there is and then re-read it and specify the correct number of lines to skip? Or
  2. Tell read.table to start reading when it finds the column names (since these are the same for each CSV) and ignore everything prior to that?

I think solution (2) would be the most appropriate, but I am open to any suggestions!

like image 490
LoveMeow Avatar asked Mar 11 '13 06:03

LoveMeow


People also ask

Do all lines in a CSV files have the same number of columns?

A CSV file should have the same number of columns in each row. A CSV file stores data in rows and the values in each row is separated with a separator, also known as a delimiter.

How do I read a specific column in a CSV file in R?

Method 1: Using read. table() function. In this method of only importing the selected columns of the CSV file data, the user needs to call the read. table() function, which is an in-built function of R programming language, and then passes the selected column in its arguments to import particular columns from the data.

How to read multiple CSV files in rstudio?

In order to read multiple CSV files or all files from a folder in R, use data. table package. data. table is a third-party library hence, in order to use data.


2 Answers

Here's a minimal example of one approach that can be taken.

First, let's make up some csv files similar to the ones you describe:

cat("blah\nblah\nblah\nVARIABLE,X1,X2\nA,1,2\n", file="myfile1.csv")
cat("blah\nVARIABLE,A1,A2\nA,1,2\n", file="myfile2.csv")
cat("blah\nblah\nVARIABLE,Z1,Z2\nA,1,2\n", file="myfile3.csv")

Second, identify where the data start:

linesToSkip <- sapply(list.files(pattern = "myfile.*.csv"), 
                      function(x) grep("^VARIABLE", readLines(x))-1)

Third, use that information to read in your files into a single list.

lapply(names(linesToSkip), 
       function(x) read.csv(file=x, skip = linesToSkip[x]))
# [[1]]
#   VARIABLE X1 X2
# 1        A  1  2
# 
# [[2]]
#   VARIABLE A1 A2
# 1        A  1  2
# 
# [[3]]
#   VARIABLE Z1 Z2
# 1        A  1  2

Edit #1

An alternative to reading the data twice is to read it once into a list, and then perform the same type of processing:

myRawData <- lapply(list.files(pattern = "myfile.*.csv"), readLines)
lapply(myRawData, function(x) {
  linesToSkip <- grep("^VARIABLE", x)-1
  read.csv(text = x, skip = linesToSkip)
})

Or, for that matter:

lapply(list.files(pattern = "myfile.*.csv"), function(x) {
  temp <- readLines(x)
  linesToSkip <- grep("^VARIABLE", temp)-1
  read.csv(text = temp, skip = linesToSkip)
})

Edit #2

As @PaulHiemstra notes, you can use the argument n to only read a few lines of each file into memory, rather than reading the whole file. Thus, if you know for certain that there aren't more than 20 lines of "rubbish" in each file, if you are using the first approach described, you can use:

linesToSkip <- sapply(list.files(pattern = "myfile.*.csv"), 
                      function(x) grep("^VARIABLE", readLines(x, n = 20))-1)
like image 80
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 20 '22 16:10

A5C1D2H2I1M1N2O1R2T1


The function fread from the package data.table does automatic detection of number of rows to be skipped. The function is in development stage currently.

Here is an example code:

require(data.table)

cat("blah\nblah\nblah\nVARIABLE,X1,X2\nA,1,2\n", file="myfile1.csv")
cat("blah\nVARIABLE,A1,A2\nA,1,2\n", file="myfile2.csv")
cat("blah\nblah\nVARIABLE,Z1,Z2\nA,1,2\n", file="myfile3.csv")

lapply(list.files(pattern = "myfile.*.csv"), fread)
like image 9
djhurio Avatar answered Oct 20 '22 16:10

djhurio