Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R read comma delimited txt file with comma inside one column

Tags:

url

r

csv

I have the logs of some users' browsing behavior. It's from the data collector and apparently he used commas to separate variables. However some URLs do have commas inside. I can't read the txt file into R.

20091,2009-06-02 22:06:14,84,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1,www.taobao.com,shopping,e-commerce,C2C
20092,2009-06-16 12:25:35,8,sohu.com,www.wap.sohu.com,http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387,www.sohu.com,portal,entertainment,mobile
20092,2009-06-07 16:02:03,14,eetchina.com,www.powersystems.eetchina.com,http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT,,others,marketing,enterprise
20096,2009-06-30 07:51:38,7,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1,search1.taobao.com,shopping,e-commerce,C2C
2009184,2009-06-25 14:40:39,6,mktginc.com,surv.mktginc.com,,,unknown,unknown,unknown
20092,2009-06-07 15:13:06,32,ccb.com.cn,ibsbjstar.ccb.com.cn,https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp,,e-bank,finance,e-bank

The URLs above should be:

http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT
http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1

https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp

How can I tell R there are exactly 10 variables in each line and put commas in URL there? Thanks!

df <- read.table('2009.txt', sep= ',', quote= '', comment.char= '', stringsAsFactors= F)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : line 130 did not have 10 elements
like image 606
leoce Avatar asked Sep 30 '14 08:09

leoce


People also ask

How do you handle commas in data when importing a CSV file?

Re: Handling 'comma' in the data while writing to a CSV. So for data fields that contain a comma, you should just be able to wrap them in a double quote. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.

How do I read a comma in a CSV file?

Using the "From Text" feature in Excel Select the CSV file that has the data clustered into one column. Select Delimited, then make sure the File Origin is Unicode UTF-8. Select Comma (this is Affinity's default list separator). The preview will show the columns being separated.

How do I separate data from a comma in R?

The splitting of comma separated values in an R vector can be done by unlisting the elements of the vector then using strsplit function for splitting. For example, if we have a vector say x that contains comma separated values then the splitting of those values will be done by using the command unlist(strsplit(x,",")).


1 Answers

You could try:

  dat <- read.table(text=gsub("http:.*(?=(,www)|,,)(*SKIP)(*F)|,", "*",
           Lines, perl=TRUE), sep="*", header=FALSE, stringsAsFactors=FALSE)


  dat
  #    V1                  V2 V3           V4                            V5
  #1 20091 2009-06-02 22:06:14 84   taobao.com            search1.taobao.com
  #2 20092 2009-06-16 12:25:35  8     sohu.com              www.wap.sohu.com
  #3 20092 2009-06-07 16:02:03 14 eetchina.com www.powersystems.eetchina.com
   #                     V6
  #1               http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq------- 2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
  #2       http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
  #3 http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT
  #            V7       V8            V9        V10
  #1 www.taobao.com shopping    e-commerce        C2C
  #2   www.sohu.com   portal entertainment     mobile
  #3                  others     marketing enterprise

data

 Lines <-  readLines(textConnection(txt)) #(`txt` from @Richard Scriven)

Update

Using your new dataset

 indx <- grep("http", Lines)
 Lines1 <- Lines[indx]
 pat1 <- paste(unique(gsub(".*http[s]?.{3}(\\w+)\\..*", "\\1", Lines1)), collapse="|")
 pat1N <-  paste0("http:.*(?=,(", pat1, "|,))(*SKIP)(*F)|,") 

 dat1 <-  read.table(text=gsub(pat1N, "*", Lines, perl=TRUE),
                   sep="*", header=FALSE, stringsAsFactors=FALSE)

 dat1
 #           V1                  V2 V3           V4                            V5
 #1   20091 2009-06-02 22:06:14 84   taobao.com            search1.taobao.com
 #2   20092 2009-06-16 12:25:35  8     sohu.com              www.wap.sohu.com
 #3   20092 2009-06-07 16:02:03 14 eetchina.com www.powersystems.eetchina.com
 #4   20096 2009-06-30 07:51:38  7   taobao.com            search1.taobao.com
 #5 2009184 2009-06-25 14:40:39  6  mktginc.com              surv.mktginc.com
 #6   20092 2009-06-07 15:13:06 32   ccb.com.cn          ibsbjstar.ccb.com.cn
 #                                     V6
 # 1                                            http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
 # 2                                    http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
 # 3                              http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT
 # 4 http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1
 #5                                                                                                                                                                         
 #6                                                                                                                       https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp
#                 V7       V8            V9        V10
#1     www.taobao.com shopping    e-commerce        C2C
#2       www.sohu.com   portal entertainment     mobile
#3                      others     marketing enterprise
#4 search1.taobao.com shopping    e-commerce        C2C
#5                     unknown       unknown    unknown
#6                      e-bank       finance     e-bank

data

 txt <- '20091,2009-06-02 22:06:14,84,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1,www.taobao.com,shopping,e-commerce,C2C
20092,2009-06-16 12:25:35,8,sohu.com,www.wap.sohu.com,http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387,www.sohu.com,portal,entertainment,mobile
20092,2009-06-07 16:02:03,14,eetchina.com,www.powersystems.eetchina.com,http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT,,others,marketing,enterprise
20096,2009-06-30 07:51:38,7,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1,search1.taobao.com,shopping,e-commerce,C2C
2009184,2009-06-25 14:40:39,6,mktginc.com,surv.mktginc.com,,,unknown,unknown,unknown
20092,2009-06-07 15:13:06,32,ccb.com.cn,ibsbjstar.ccb.com.cn,https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp,,e-bank,finance,e-bank'

  Lines <- readLines(textConnection(txt))
like image 163
akrun Avatar answered Sep 21 '22 01:09

akrun