Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read CSV with variable rows to skip, bulk

Tags:

r

I am trying to make a loop that reads in multiple CSV files that all have the same type of air temperature data. However there are rows that I want to skip above the data. These are "alarms" in the dataset. Each file may have a different amount of alarms, thus a different amount of rows to skip. See below:

-------------First CSV file---------------
Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2

-------------Second CSV file---------------
Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Alarm 5
Alarm 6
Alarm 7
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2

How can I get the index of Date, Temp to tell read.csv to skip to that row?

for (i in 1:length(csv.list))  {
  df = read.csv(csv.list[i], header = T, skip=????????)
}
like image 305
bevingtona Avatar asked Apr 26 '14 00:04

bevingtona


3 Answers

You could add a couple of lines prior to your read.table in your loop

Use readLines to read in the data

r <- readLines(textConnection("Logger 001
               Alarm 1
               Alarm 2
               Alarm 3
               Alarm 4
               Date, Temp
               01/01/2011, -1.2
               01/02/2011, -1.3
               01/03/2011, -1.1
               01/04/2011, -1.2"))

[but without the textConnection for you ie r <- readLines("yourcsv")]

Find the row number where the actual headers begin - using grep

dt <- grep("Date",r)

Read in your data - skipping the lines prior to the headers

read.table(text=r , header=TRUE, sep="," , skip = dt-1)


So to read in your multiple csv files - these will be stored in a list of data,frames

 df.lst <- lapply(csv.list , function(i) {
                        r <- readLines(i)
                        dt <- grep("Date",r)
                        read.table(text=r , header=TRUE, sep="," , skip = dt-1)
                         })
like image 191
user20650 Avatar answered Nov 01 '22 05:11

user20650


fread from the "data.table" package might be useful for you since it is pretty good at auto-detecting "junk" header rows. Here's an example:

First, create two sample csv files

cat("Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2
", file = "socsv1.csv", sep = "\n")

cat("Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Alarm 5
Alarm 6
Alarm 7
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2", file = "socsv2.csv", sep = "\n")

## Check that they were created
list.files(pattern = "socsv")
# [1] "socsv1.csv" "socsv2.csv"

Now, just use fread and specify the sep instead of letting fread guess.

library(data.table)
lapply(list.files(pattern = "socsv"), fread, sep = ",")
# [[1]]
#          Date  Temp
# 1: 01/01/2011  -1.2
# 2: 01/02/2011  -1.3
# 3: 01/03/2011  -1.1
# 4: 01/04/2011  -1.2
# 
# [[2]]
#          Date  Temp
# 1: 01/01/2011  -1.2
# 2: 01/02/2011  -1.3
# 3: 01/03/2011  -1.1
# 4: 01/04/2011  -1.2
like image 2
A5C1D2H2I1M1N2O1R2T1 Avatar answered Nov 01 '22 07:11

A5C1D2H2I1M1N2O1R2T1


count.fields can identify the first line of each file that has two comma-separated fields. You can then use this to specify skip. For example:

writeLines('Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2', f <- tempfile())

writeLines('Logger 001
Alarm 1
Alarm 2
Alarm 3
Alarm 4
Alarm 5
Alarm 6
Alarm 7
Date, Temp
01/01/2011, -1.2
01/02/2011, -1.3
01/03/2011, -1.1
01/04/2011, -1.2', f2 <- tempfile())

for (x in c(f, f2))  {
  ind <- match(2, count.fields(x, ','))
  df <- read.csv(x, header = T, skip=ind - 1)
  print(df)
}

#         Date Temp
# 1 01/01/2011 -1.2
# 2 01/02/2011 -1.3
# 3 01/03/2011 -1.1
# 4 01/04/2011 -1.2
# Date Temp
# 1 01/01/2011 -1.2
# 2 01/02/2011 -1.3
# 3 01/03/2011 -1.1
# 4 01/04/2011 -1.2
like image 1
jbaums Avatar answered Nov 01 '22 07:11

jbaums