I got a very long telephone log as a text file and I have tried to read it into R but it is not really working out. The text has a structure but it is most certainly not a table. Its structure is as follows
readLines
or scan
would have worked if one could have specified that records were separated by "\n\n" and that fields (or columns) were separated by "\n"Here is an example:
TheInstitute 5467
telephone line 4125526987 x 4567
datetime 2011110516 12:56
blay blay blah who knows what, but anyway it may have a comma
TheInstitute 5467
telephone line 4125526987 x 4567
datetime 2011110516 12:58
blay blay blah who knows what
TheInstitute 5467
telephone line 412552999 x 4999
bump phone line 4125527777
datetime 2011110516 12:59
blay blay blah who knows what
TheInstitute 5467
telephone line 4125526987 x 4567
bump phone line 4125527777
datetime 2011110516 13:51
blay blay blah who knows what, but anyway it may have a comma
TheInstitute 5467
telephone line 4125526987 x 4567
datetime 2011110516 14:56
blay blay blah who knows what
How can I do this in R? I have tried tricks with scan, paste, strsplit but I am spinning in circles. I may have to get it into a list since that can handle non-equal number of elements. I would like to get all the records to have the same number of fields and for those records that do not have the one field (here called bump phone) I would like them just to have a NA as the value in that field. I would appreciate help even just to get started. From there I can play and toy.
With multi.line = TRUE in the scan
function, a record should end with two end-of-lines. I did this with textConnection around your file, but you would use a valid file name:
inp <- scan(textConnection(txt), multi.line=TRUE,
what=list(place="character", tline1="character",
cline1="character", cline2 ="character", cline3="character"), sep="\n")
Read 5 records
> str(as.data.frame(inp))
'data.frame': 5 obs. of 5 variables:
$ place : Factor w/ 1 level "TheInstitute 5467": 1 1 1 1 1
$ tline1: Factor w/ 2 levels " telephone line 4125526987 x 4567",..: 1 1 2 1 1
$ cline1: Factor w/ 4 levels " bump phone line 4125527777",..: 2 3 1 1 4
$ cline2: Factor w/ 4 levels " blay blay blah who knows what",..: 2 1 3 4 1
$ cline3: Factor w/ 3 levels ""," blay blay blah who knows what",..: 1 1 2 3 1
> as.data.frame(inp)
place tline1
1 TheInstitute 5467 telephone line 4125526987 x 4567
2 TheInstitute 5467 telephone line 4125526987 x 4567
3 TheInstitute 5467 telephone line 412552999 x 4999
4 TheInstitute 5467 telephone line 4125526987 x 4567
5 TheInstitute 5467 telephone line 4125526987 x 4567
cline1
1 datetime 2011110516 12:56
2 datetime 2011110516 12:58
3 bump phone line 4125527777
4 bump phone line 4125527777
5 datetime 2011110516 14:56
cline2
1 blay blay blah who knows what, but anyway it may have a comma
2 blay blay blah who knows what
3 datetime 2011110516 12:59
4 datetime 2011110516 13:51
5 blay blay blah who knows what
cline3
1
2
3 blay blay blah who knows what
4 blay blay blah who knows what, but anyway it may have a comma
5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With