I have to read in about 300 individual CSVs. I have managed to automate the process using a loop and structured CSV names. However each CSV has 14-17 lines of rubbish at the start and it varies randomly so hard coding a 'skip' parameter in the read.table command won't work. The column names and number of columns is the same for each CSV.
Here is an example of what I am up against:
QUICK STATISTICS:
Directory: Data,,,,
File: Final_Comp_Zn_1
Selection: SEL{Ox*1000+Doma=1201}
Weight: None,,,
,,Variable: AG,,,
Total Number of Samples: 450212 Number of Selected Samples: 277
Statistics
VARIABLE,Min slice Y(m),Max slice Y(m),Count,Minimum,Maximum,Mean,Std.Dev.,Variance,Total Samples in Domain,Active Samples in Domain AG,
6780.00, 6840.00, 7, 3.0000, 52.5000, 23.4143, 16.8507, 283.9469, 10, 10 AG,
6840.00, 6900.00, 4, 4.0000, 5.5000, 4.9500, 0.5766, 0.3325, 13, 13 AG,
6900.00, 6960.00, 16, 1.0000, 37.0000, 8.7625, 9.0047, 81.0848, 29, 29 AG,
6960.00, 7020.00, 58, 3.0000, 73.5000, 10.6931, 11.9087, 141.8172, 132, 132 AG,
7020.00, 7080.00, 23, 3.0000, 104.5000, 15.3435, 23.2233, 539.3207, 23, 23 AG,
7080.00, 7140.00, 33, 1.0000, 15.4000, 3.8152, 2.8441, 8.0892, 35, 35 AG,
Basically I want to read from the line VARIABLE,Min slice Y(m),Max slice Y(m),...
. I can think of a few solutions but I don't know how I would go about programming it. Is there anyway I can:
read.table
to start reading when it finds the column names (since these are the same for each CSV) and ignore everything prior to that? I think solution (2) would be the most appropriate, but I am open to any suggestions!
A CSV file should have the same number of columns in each row. A CSV file stores data in rows and the values in each row is separated with a separator, also known as a delimiter.
Method 1: Using read. table() function. In this method of only importing the selected columns of the CSV file data, the user needs to call the read. table() function, which is an in-built function of R programming language, and then passes the selected column in its arguments to import particular columns from the data.
In order to read multiple CSV files or all files from a folder in R, use data. table package. data. table is a third-party library hence, in order to use data.
Here's a minimal example of one approach that can be taken.
First, let's make up some csv files similar to the ones you describe:
cat("blah\nblah\nblah\nVARIABLE,X1,X2\nA,1,2\n", file="myfile1.csv")
cat("blah\nVARIABLE,A1,A2\nA,1,2\n", file="myfile2.csv")
cat("blah\nblah\nVARIABLE,Z1,Z2\nA,1,2\n", file="myfile3.csv")
Second, identify where the data start:
linesToSkip <- sapply(list.files(pattern = "myfile.*.csv"),
function(x) grep("^VARIABLE", readLines(x))-1)
Third, use that information to read in your files into a single list.
lapply(names(linesToSkip),
function(x) read.csv(file=x, skip = linesToSkip[x]))
# [[1]]
# VARIABLE X1 X2
# 1 A 1 2
#
# [[2]]
# VARIABLE A1 A2
# 1 A 1 2
#
# [[3]]
# VARIABLE Z1 Z2
# 1 A 1 2
An alternative to reading the data twice is to read it once into a list, and then perform the same type of processing:
myRawData <- lapply(list.files(pattern = "myfile.*.csv"), readLines)
lapply(myRawData, function(x) {
linesToSkip <- grep("^VARIABLE", x)-1
read.csv(text = x, skip = linesToSkip)
})
Or, for that matter:
lapply(list.files(pattern = "myfile.*.csv"), function(x) {
temp <- readLines(x)
linesToSkip <- grep("^VARIABLE", temp)-1
read.csv(text = temp, skip = linesToSkip)
})
As @PaulHiemstra notes, you can use the argument n
to only read a few lines of each file into memory, rather than reading the whole file. Thus, if you know for certain that there aren't more than 20 lines of "rubbish" in each file, if you are using the first approach described, you can use:
linesToSkip <- sapply(list.files(pattern = "myfile.*.csv"),
function(x) grep("^VARIABLE", readLines(x, n = 20))-1)
The function fread
from the package data.table does automatic detection of number of rows to be skipped. The function is in development stage currently.
Here is an example code:
require(data.table)
cat("blah\nblah\nblah\nVARIABLE,X1,X2\nA,1,2\n", file="myfile1.csv")
cat("blah\nVARIABLE,A1,A2\nA,1,2\n", file="myfile2.csv")
cat("blah\nblah\nVARIABLE,Z1,Z2\nA,1,2\n", file="myfile3.csv")
lapply(list.files(pattern = "myfile.*.csv"), fread)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With