How can one read in a text file in which each record is a paragraph and each newline denotes separate field. The complication is that some records have 4 lines and some have 6. @DWin nailed my questions when the the difference in number of fields was 1 but it all fell apart when it was two. You can have a look at his answer here.
So here is my latest simulation of the starting text
TheInstitute 5467
telephone line 4125526987 x 4567
datetime 2011110516 12:56
blay blay blah who knows what, but anyway it may have a comma
TheInstitute 5467
telephone line 4125526987 x 4567
datetime 2011110516 12:58
blay blay blah who knows what
TheInstitute 5467
telephone line 412552999 x 4999
bump phone line 4125527777
bump pony pony oops 4125527777
datetime 2011110516 12:59
blay blay blah who knows what
TheInstitute 5467
telephone line 4125526987 x 4567
bump phone line 4125527777
bump pony pony oops 4125527777
datetime 2011110516 13:51
blay blay blah who knows what, but anyway it may have a comma
TheInstitute 5467
telephone line 4125526987 x 4567
datetime 2011110516 14:56
blay blay blah who knows what
Here is what the output should look like. In fact this is one step removed from what I need. I am placing a ASCII text representation of an R data.frame below. You will see that everything is in a data frame but the field values are shifted by two columns because some records have two extra fields.
structure(list(institution = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "TheInstitute 5467", class = "factor"),
telephoneline = structure(c(1L, 1L, 2L, 1L, 1L), .Label = c("telephone line 4125526987 x 4567",
"telephone line 412552999 x 4999"), class = "factor"), date.or.bump = structure(c(2L,
3L, 1L, 1L, 4L), .Label = c("bump phone line 4125527777",
"datetime 2011110516 12:56", "datetime 2011110516 12:58",
"datetime 2011110516 14:56"), class = "factor"), field4 = structure(c(2L,
1L, 3L, 3L, 1L), .Label = c("blay blay blah who knows what",
"blay blay blah who knows what, but anyway it may have a comma",
"bump pony pony oops 4125527777"), class = "factor"), field5 = structure(c(1L,
1L, 2L, 3L, 1L), .Label = c("", "datetime 2011110516 12:59",
"datetime 2011110516 13:51"), class = "factor"), field6 = structure(c(1L,
1L, 2L, 3L, 1L), .Label = c("", "blay blay blah who knows what",
"blay blay blah who knows what, but anyway it may have a comma"
), class = "factor")), .Names = c("institution", "telephoneline",
"date.or.bump", "field4", "field5", "field6"), class = "data.frame", row.names = c(NA,
-5L))
PS: Am I correct to believe that one posts a data frame by using dput or can one save a .Rdata file direclty here.
There is probably a more elegant way, but this should get the job done.
x <- readLines("foo.txt") # read data with readLines
nx <- !nchar(x) # locate lines with only empty strings
# create a list (split by empty lines, with empty lines removed)
y <- split(x[!nx], cumsum(nx)[!nx])
# determine largest number of columns
maxLength <- max(sapply(y,length))
# pad each list element with empty strings
z <- lapply(y, function(x) c(x,rep("",maxLength-length(x))))
# create final matrix
out <- do.call(rbind, z)
Update:
Here's another solution using plyr::rbind.fill
:
x <- readLines("foo.txt") # read data with readLines
nx <- !nchar(x) # locate lines with only empty strings
# create final data.frame
out <- rbind.fill(lapply(split(x[!nx], cumsum(nx)[!nx]),
function(x) data.frame(t(x))))
Another strategy is to use a string of your choosing -- call it EOL
-- to mark the end of each line, and then paste all of the lines together.
You can then use two rounds of strsplit
to first break out records, and then break out fields within records. (Records will be separated by two consecutive EOL
s, while fields will be separated by a single EOL
).
EOL <- "!@" # (for instance)
x <- readLines("filename.R")
x <- paste(x, collapse=EOL)[[1]]
x <- strsplit(x, paste(EOL, EOL, sep="")) # Split apart records
lapply(x, FUN=function(X) strsplit(X, EOL))[[1]] # Split apart fields w/in records
This method appeals to me because it's close to what I'd like to do when I read in the file in the first place (i.e. use "\n\n"
as the sep
character), but am not able to do with either scan
or readLines
.
Read data in. dat <- readLines("filename.txt")
Split data by records (inspired by Josh O'Brien solution)
dat_rec <- lapply(strsplit(paste(dat,collapse="\n"),split="\n\n")[[1]],
function(x) strsplit(x,split="\n")[[1]])
Transform data to named vectors (assume last field is comment and data starts with numeric value)
dat_rec_vn <- lapply(dat_rec,function(x) {
vn <- gsub(" ","_",sub(" ","",
gsub("^(\\D*) \\d.*$","\\1", x[-length(x)])))
y <- gsub("^(\\D*) (\\d.*)$","\\2",x[-length(x)])
names(y) <- vn
return(y)})
Get unique names of field in data.
vn <- unique(unlist(lapply(dat_rec_vn,names),use.names=FALSE))
Combine field into matrix and give it names.
dat_mat <- do.call(rbind,lapply(dat_rec_vn,function(x) {
y <- vector(mode="character",length=length(vn))
y[match(names(x),vn)] <- x
return(y)}))
colnames(dat_mat) <- vn
SECOND solution (using gawk)
gawk_cmd <- "gawk 'BEGIN{FS=\"\\n\";RS=\"\";OFS=\"\\t\";ORS=\"\\n\"}
{$1=$1; print $0}' test_multi.txt"
dat <- strsplit(system(gawk_cmd,intern=TRUE),split="\t")
NF <- do.call(max,lapply(dat,length))
M <- do.call(rbind,lapply(dat,"[",seq(NF)))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With