Reading in multiple CSVs with different numbers of lines to skip at start of file

Tags:

I have to read in about 300 individual CSVs. I have managed to automate the process using a loop and structured CSV names. However each CSV has 14-17 lines of rubbish at the start and it varies randomly so hard coding a 'skip' parameter in the read.table command won't work. The column names and number of columns is the same for each CSV.

Here is an example of what I am up against:

QUICK STATISTICS:

      Directory: Data,,,,
           File: Final_Comp_Zn_1
      Selection: SEL{Ox*1000+Doma=1201}
         Weight: None,,,
     ,,Variable: AG,,,

Total Number of Samples: 450212  Number of Selected Samples: 277


Statistics

VARIABLE,Min slice Y(m),Max slice Y(m),Count,Minimum,Maximum,Mean,Std.Dev.,Variance,Total Samples in Domain,Active Samples in Domain AG,  
6780.00,   6840.00,         7,    3.0000,   52.5000,   23.4143,   16.8507,  283.9469,        10,        10 AG,   
6840.00,   6900.00,         4,    4.0000,    5.5000,    4.9500,    0.5766,    0.3325,        13,        13 AG,   
6900.00,   6960.00,        16,    1.0000,   37.0000,    8.7625,    9.0047,   81.0848,        29,        29 AG,   
6960.00,   7020.00,        58,    3.0000,   73.5000,   10.6931,   11.9087,  141.8172,       132,       132 AG,   
7020.00,   7080.00,        23,    3.0000,  104.5000,   15.3435,   23.2233,  539.3207,        23,        23 AG,   
7080.00,   7140.00,        33,    1.0000,   15.4000,    3.8152,    2.8441,    8.0892,        35,        35 AG,

Basically I want to read from the line VARIABLE,Min slice Y(m),Max slice Y(m),.... I can think of a few solutions but I don't know how I would go about programming it. Is there anyway I can:

Read the CSV first and somehow work out how many lines of rubbish there is and then re-read it and specify the correct number of lines to skip? Or
Tell read.table to start reading when it finds the column names (since these are the same for each CSV) and ignore everything prior to that?

I think solution (2) would be the most appropriate, but I am open to any suggestions!

490

asked Mar 11 '13 06:03

LoveMeow

2 Answers

Here's a minimal example of one approach that can be taken.

First, let's make up some csv files similar to the ones you describe:

cat("blah\nblah\nblah\nVARIABLE,X1,X2\nA,1,2\n", file="myfile1.csv")
cat("blah\nVARIABLE,A1,A2\nA,1,2\n", file="myfile2.csv")
cat("blah\nblah\nVARIABLE,Z1,Z2\nA,1,2\n", file="myfile3.csv")

Second, identify where the data start:

linesToSkip <- sapply(list.files(pattern = "myfile.*.csv"), 
                      function(x) grep("^VARIABLE", readLines(x))-1)

Third, use that information to read in your files into a single list.

lapply(names(linesToSkip), 
       function(x) read.csv(file=x, skip = linesToSkip[x]))
# [[1]]
#   VARIABLE X1 X2
# 1        A  1  2
# 
# [[2]]
#   VARIABLE A1 A2
# 1        A  1  2
# 
# [[3]]
#   VARIABLE Z1 Z2
# 1        A  1  2

Edit #1

An alternative to reading the data twice is to read it once into a list, and then perform the same type of processing:

myRawData <- lapply(list.files(pattern = "myfile.*.csv"), readLines)
lapply(myRawData, function(x) {
  linesToSkip <- grep("^VARIABLE", x)-1
  read.csv(text = x, skip = linesToSkip)
})

Or, for that matter:

lapply(list.files(pattern = "myfile.*.csv"), function(x) {
  temp <- readLines(x)
  linesToSkip <- grep("^VARIABLE", temp)-1
  read.csv(text = temp, skip = linesToSkip)
})

Edit #2

As @PaulHiemstra notes, you can use the argument n to only read a few lines of each file into memory, rather than reading the whole file. Thus, if you know for certain that there aren't more than 20 lines of "rubbish" in each file, if you are using the first approach described, you can use:

linesToSkip <- sapply(list.files(pattern = "myfile.*.csv"), 
                      function(x) grep("^VARIABLE", readLines(x, n = 20))-1)

answered Oct 20 '22 16:10

A5C1D2H2I1M1N2O1R2T1

The function fread from the package data.table does automatic detection of number of rows to be skipped. The function is in development stage currently.

Here is an example code:

require(data.table)

cat("blah\nblah\nblah\nVARIABLE,X1,X2\nA,1,2\n", file="myfile1.csv")
cat("blah\nVARIABLE,A1,A2\nA,1,2\n", file="myfile2.csv")
cat("blah\nblah\nVARIABLE,Z1,Z2\nA,1,2\n", file="myfile3.csv")

lapply(list.files(pattern = "myfile.*.csv"), fread)

answered Oct 20 '22 16:10

djhurio

Related questions
                            
                                Shiny App Error: /v1/applications/ 400 - Validation Error Execution halted
                            
                                Shiny App checkboxInput and conditionalPanel
                            
                                paste grid -- expand.grid for string concatenation
                            
                                Skip tests on CRAN, but run locally
                            
                                R: apply-like function that returns a data frame?
                            
                                Count number of occurrences of vector in list
                            
                                installing package is failing with: Error in if (file.exists(dest) && file.mtime(dest)
                            
                                Recode a variable using data.table
                            
                                R mutating along a list of dataframes
                            
                                how to append an element to a list without keeping track of the index?
                            
                                How to format kable table when knit from .rmd to Word (with bookdown)
                            
                                How can I put the labels outside of piechart?
                            
                                Using dplyr mutate_at with custom function
                            
                                converting a matrix to a list
                            
                                Removing the frame from the Boxplot() function in R
                            
                                State level unemployment in R
                            
                                Plot a line chart with conditional colors depending on values
                            
                                Double clustered standard errors for panel data
                            
                                Get names of list in for loop
                            
                                rbind two data.frame preserving row order and row names

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With