Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I read a text file into R when the data is not in a table

Tags:

r

I got a very long telephone log as a text file and I have tried to read it into R but it is not really working out. The text has a structure but it is most certainly not a table. Its structure is as follows

  1. Each record is composed of multiple lines so readLines is not quite appropriate
  2. Each line of each record is a separate field
  3. Some records have an additional field after the second field
  4. Each new record is noted by a blank line. readLines or scan would have worked if one could have specified that records were separated by "\n\n" and that fields (or columns) were separated by "\n"

Here is an example:

TheInstitute 5467
  telephone line 4125526987 x 4567
  datetime 2011110516 12:56
  blay blay blah who knows what, but anyway it may have a comma

TheInstitute 5467
  telephone line 4125526987 x 4567
  datetime 2011110516 12:58
  blay blay blah who knows what

TheInstitute 5467
  telephone line 412552999 x 4999
  bump phone line 4125527777
  datetime 2011110516 12:59
  blay blay blah who knows what

TheInstitute 5467
  telephone line 4125526987 x 4567
  bump phone line 4125527777
  datetime 2011110516 13:51
  blay blay blah who knows what, but anyway it may have a comma

TheInstitute 5467
  telephone line 4125526987 x 4567
  datetime 2011110516 14:56
  blay blay blah who knows what

How can I do this in R? I have tried tricks with scan, paste, strsplit but I am spinning in circles. I may have to get it into a list since that can handle non-equal number of elements. I would like to get all the records to have the same number of fields and for those records that do not have the one field (here called bump phone) I would like them just to have a NA as the value in that field. I would appreciate help even just to get started. From there I can play and toy.

like image 814
Farrel Avatar asked Dec 07 '11 21:12

Farrel


1 Answers

With multi.line = TRUE in the scan function, a record should end with two end-of-lines. I did this with textConnection around your file, but you would use a valid file name:

inp <- scan(textConnection(txt), multi.line=TRUE, 
             what=list(place="character", tline1="character", 
             cline1="character", cline2 ="character", cline3="character"), sep="\n")
Read 5 records
> str(as.data.frame(inp))
'data.frame':   5 obs. of  5 variables:
 $ place : Factor w/ 1 level "TheInstitute 5467": 1 1 1 1 1
 $ tline1: Factor w/ 2 levels "  telephone line 4125526987 x 4567",..: 1 1 2 1 1
 $ cline1: Factor w/ 4 levels "  bump phone line 4125527777",..: 2 3 1 1 4
 $ cline2: Factor w/ 4 levels "  blay blay blah who knows what",..: 2 1 3 4 1
 $ cline3: Factor w/ 3 levels "","  blay blay blah who knows what",..: 1 1 2 3 1
> as.data.frame(inp)
              place                             tline1
1 TheInstitute 5467   telephone line 4125526987 x 4567
2 TheInstitute 5467   telephone line 4125526987 x 4567
3 TheInstitute 5467    telephone line 412552999 x 4999
4 TheInstitute 5467   telephone line 4125526987 x 4567
5 TheInstitute 5467   telephone line 4125526987 x 4567
                        cline1
1    datetime 2011110516 12:56
2    datetime 2011110516 12:58
3   bump phone line 4125527777
4   bump phone line 4125527777
5    datetime 2011110516 14:56
                                                           cline2
1   blay blay blah who knows what, but anyway it may have a comma
2                                   blay blay blah who knows what
3                                       datetime 2011110516 12:59
4                                       datetime 2011110516 13:51
5                                   blay blay blah who knows what
                                                           cline3
1                                                                
2                                                                
3                                   blay blay blah who knows what
4   blay blay blah who knows what, but anyway it may have a comma
5                                                                
like image 155
IRTFM Avatar answered Nov 03 '22 01:11

IRTFM