Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I read a text file into R when each record is a paragraph and some records have 4 fields and others have 6

Tags:

r

How can one read in a text file in which each record is a paragraph and each newline denotes separate field. The complication is that some records have 4 lines and some have 6. @DWin nailed my questions when the the difference in number of fields was 1 but it all fell apart when it was two. You can have a look at his answer here.

So here is my latest simulation of the starting text

TheInstitute 5467
  telephone line 4125526987 x 4567
  datetime 2011110516 12:56
  blay blay blah who knows what, but anyway it may have a comma

TheInstitute 5467
  telephone line 4125526987 x 4567
  datetime 2011110516 12:58
  blay blay blah who knows what

TheInstitute 5467
  telephone line 412552999 x 4999
  bump phone line 4125527777
  bump pony pony oops 4125527777
  datetime 2011110516 12:59
  blay blay blah who knows what

TheInstitute 5467
  telephone line 4125526987 x 4567
  bump phone line 4125527777
  bump pony pony oops 4125527777
  datetime 2011110516 13:51
  blay blay blah who knows what, but anyway it may have a comma

TheInstitute 5467
  telephone line 4125526987 x 4567
  datetime 2011110516 14:56
  blay blay blah who knows what  

Here is what the output should look like. In fact this is one step removed from what I need. I am placing a ASCII text representation of an R data.frame below. You will see that everything is in a data frame but the field values are shifted by two columns because some records have two extra fields.

structure(list(institution = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "TheInstitute 5467", class = "factor"), 
    telephoneline = structure(c(1L, 1L, 2L, 1L, 1L), .Label = c("telephone line 4125526987 x 4567", 
    "telephone line 412552999 x 4999"), class = "factor"), date.or.bump = structure(c(2L, 
    3L, 1L, 1L, 4L), .Label = c("bump phone line 4125527777", 
    "datetime 2011110516 12:56", "datetime 2011110516 12:58", 
    "datetime 2011110516 14:56"), class = "factor"), field4 = structure(c(2L, 
    1L, 3L, 3L, 1L), .Label = c("blay blay blah who knows what", 
    "blay blay blah who knows what, but anyway it may have a comma", 
    "bump pony pony oops 4125527777"), class = "factor"), field5 = structure(c(1L, 
    1L, 2L, 3L, 1L), .Label = c("", "datetime 2011110516 12:59", 
    "datetime 2011110516 13:51"), class = "factor"), field6 = structure(c(1L, 
    1L, 2L, 3L, 1L), .Label = c("", "blay blay blah who knows what", 
    "blay blay blah who knows what, but anyway it may have a comma"
    ), class = "factor")), .Names = c("institution", "telephoneline", 
"date.or.bump", "field4", "field5", "field6"), class = "data.frame", row.names = c(NA, 
-5L))

PS: Am I correct to believe that one posts a data frame by using dput or can one save a .Rdata file direclty here.

like image 726
Farrel Avatar asked Dec 09 '11 22:12

Farrel


3 Answers

There is probably a more elegant way, but this should get the job done.

x <- readLines("foo.txt")  # read data with readLines
nx <- !nchar(x)            # locate lines with only empty strings
# create a list (split by empty lines, with empty lines removed)
y <- split(x[!nx], cumsum(nx)[!nx])
# determine largest number of columns
maxLength <- max(sapply(y,length))
# pad each list element with empty strings
z <- lapply(y, function(x) c(x,rep("",maxLength-length(x))))
# create final matrix
out <- do.call(rbind, z)

Update:

Here's another solution using plyr::rbind.fill:

x <- readLines("foo.txt")  # read data with readLines
nx <- !nchar(x)            # locate lines with only empty strings
# create final data.frame
out <- rbind.fill(lapply(split(x[!nx], cumsum(nx)[!nx]),
                    function(x) data.frame(t(x))))
like image 130
Joshua Ulrich Avatar answered Sep 17 '22 15:09

Joshua Ulrich


Another strategy is to use a string of your choosing -- call it EOL -- to mark the end of each line, and then paste all of the lines together.

You can then use two rounds of strsplit to first break out records, and then break out fields within records. (Records will be separated by two consecutive EOLs, while fields will be separated by a single EOL).

EOL <- "!@"  # (for instance)
x <- readLines("filename.R")
x <- paste(x, collapse=EOL)[[1]]

x <- strsplit(x, paste(EOL, EOL, sep=""))         # Split apart records
lapply(x, FUN=function(X) strsplit(X, EOL))[[1]]  # Split apart fields w/in records

This method appeals to me because it's close to what I'd like to do when I read in the file in the first place (i.e. use "\n\n" as the sep character), but am not able to do with either scan or readLines.

like image 34
Josh O'Brien Avatar answered Sep 17 '22 15:09

Josh O'Brien


Read data in. dat <- readLines("filename.txt")

Split data by records (inspired by Josh O'Brien solution)

dat_rec <- lapply(strsplit(paste(dat,collapse="\n"),split="\n\n")[[1]],
                  function(x) strsplit(x,split="\n")[[1]])

Transform data to named vectors (assume last field is comment and data starts with numeric value)

dat_rec_vn <- lapply(dat_rec,function(x) {
                           vn <- gsub(" ","_",sub("  ","",
                                        gsub("^(\\D*) \\d.*$","\\1", x[-length(x)])))
                           y <- gsub("^(\\D*) (\\d.*)$","\\2",x[-length(x)])
                           names(y) <- vn
                           return(y)})

Get unique names of field in data.

 vn <- unique(unlist(lapply(dat_rec_vn,names),use.names=FALSE))

Combine field into matrix and give it names.

 dat_mat <- do.call(rbind,lapply(dat_rec_vn,function(x) {
                     y <- vector(mode="character",length=length(vn))
                     y[match(names(x),vn)] <- x
                     return(y)}))

colnames(dat_mat) <- vn

SECOND solution (using gawk)

gawk_cmd <- "gawk 'BEGIN{FS=\"\\n\";RS=\"\";OFS=\"\\t\";ORS=\"\\n\"} 
                        {$1=$1; print $0}' test_multi.txt"
dat <- strsplit(system(gawk_cmd,intern=TRUE),split="\t")
NF <- do.call(max,lapply(dat,length))
M <- do.call(rbind,lapply(dat,"[",seq(NF)))
like image 37
Wojciech Sobala Avatar answered Sep 17 '22 15:09

Wojciech Sobala