Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read data from a multi separated csv file in R

Tags:

r

csv

I'm trying to read a csv file into R. The problem is that the file has 2 separators and I don't know how to read it as a 3-column data frame;i.e first, second, and year. This is a sample of how the file looks like:

[Alin Deutsch, Mary F. Fernandez, 1998],  
[Alin Deutsch, Daniela Florescu, 1998],

I've tried the fread() function with sep="[" and sep2=",", but it does not work and R just read it the row as 1 column vector Thanks

like image 233
sarashaker Avatar asked Jan 07 '16 12:01

sarashaker


2 Answers

1) read.table/sub Read it in using sep = "," and comment.char = "]". That will split the fields and get rid of the trailing ] and everything after it and then we can just remove the [ from V1 with sub:

Lines <- "[Alin Deutsch, Mary F. Fernandez, 1998],  
[Alin Deutsch, Daniela Florescu, 1998],"

DF <- read.table(text = Lines, sep = ",", comment.char = "]", as.is = TRUE,
          strip.white = TRUE, # might not need this one
          col.names = c("Name1", "Name2", "Year"))
DF <- transform(DF, Name1 = sub("[", "", Name1, fixed = TRUE))

giving:

> DF
         Name1             Name2 Year
1 Alin Deutsch Mary F. Fernandez 1998
2 Alin Deutsch  Daniela Florescu 1998

2) read.pattern Another possibliity is to use read.pattern in gsubfn. This pattern assumes that each line begins with [, has three commas with the last one having a ] before it. That corresponds to what is in the question but if that is not the case the regular expression would need to be changed.

library(gsubfn)

read.pattern(text = Lines, pattern = ".(.*?),(.*?),(.*?).,", as.is = TRUE,
        strip.white = TRUE, # might not need this one
        col.names = c("Name1", "Name2", "Year"))

giving the same.

like image 52
G. Grothendieck Avatar answered Nov 08 '22 01:11

G. Grothendieck


You could read the file with sep="," and then remove the extra brackets:

df <- read.csv(file = textConnection("[Alin Deutsch, Mary F. Fernandez, 1998],  
[Alin Deutsch, Daniela Florescu, 1998],"),stringsAsFactors=FALSE,head=FALSE)

df <- df[,-4]

df$V1 <- gsub("\\[","",df$V1)
df$V3 <- gsub("\\]","",df$V3)

names(df) <- c("first","second","year")
df

output

         first             second  year
1 Alin Deutsch  Mary F. Fernandez  1998
2 Alin Deutsch   Daniela Florescu  1998
like image 43
scoa Avatar answered Nov 08 '22 02:11

scoa