Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse Thorn "þ" delimited log file in r

Tags:

r

I'm trying to read csv files (originally log files with the extension changed) that are thorn delimited.

However simple readLines or read.delim don't except "þ" as a separator and the default functions result in one columned data.frame

when I'm trying to print the data frame I get:

temp <- readLines("sample.csv")
temp

[1] "12-31-2014-21:01:11þ0þþ4382037þ8048852þ286809041þ60534606þ1þ0x0þ1594842þ107607879þþ256þMDþ28þ0.0þ22þ13þ14915þ20646þþ1420088471"                     
[2] "12-31-2014-21:01:16þ0þþ4382037þ8048852þ286810825þ60533173þ1þ160x600þ1594842þ107606862þþ256þMDþ28þ0.0þ22þ13þ14915þ20646þþ1420088476"                 
[3] "12-31-2014-21:13:24þ0þþ4382037þ8048852þ286811019þ60532482þ1þ728x90þ1594842þ107607879þþ256þAZþ27þ0.0þ7þ189þ13541þ85747þþ1420089204"                  
[4] "12-31-2014-21:35:04þ2453419343309111450þþ4479850þ8209375þ283421530þ58729626þ1þ1x2þ1685249þ110242400þþ256þCOþ5þ9.0þ22þ171þ14885þ80120þþ1420090504"   
[5] "12-31-2014-21:07:22þ2453942989496582297þþ4382037þ8048852þ284408073þ59306057þ1þ728x90þ1594842þ107607879þþ256þGAþ26þ0.0þ22þ26þ14981þ30093þþ1420088842"

now I've tried to parse it with strsplit(temp,"þ") and recieved:

[[1]]
[1] "12-31-2014-21:01:11‏0‏‏4382037‏8048852‏286809041‏60534606‏1‏0x0‏1594842‏107607879‏‏256‏MD‏28‏0.0‏22‏13‏14915‏20646‏‏1420088471"

[[2]]
[1] "12-31-2014-21:01:16‏0‏‏4382037‏8048852‏286810825‏60533173‏1‏160x600‏1594842‏107606862‏‏256‏MD‏28‏0.0‏22‏13‏14915‏20646‏‏1420088476"

[[3]]
[1] "12-31-2014-21:13:24‏0‏‏4382037‏8048852‏286811019‏60532482‏1‏728x90‏1594842‏107607879‏‏256‏AZ‏27‏0.0‏7‏189‏13541‏85747‏‏1420089204"

[[4]]
[1] "12-31-2014-21:35:04‏2453419343309111450‏‏4479850‏8209375‏283421530‏58729626‏1‏1x2‏1685249‏110242400‏‏256‏CO‏5‏9.0‏22‏171‏14885‏80120‏‏1420090504"

[[5]]
[1] "12-31-2014-21:07:22‏2453942989496582297‏‏4382037‏8048852‏284408073‏59306057‏1‏728x90‏1594842‏107607879‏‏256‏GA‏26‏0.0‏22‏26‏14981‏30093‏‏1420088842"

alternatively gsub("þ"," ",temp) just returns the original print.

This gets even "weirder" because if I'll copy and paste a line from the console and run:

temp1 <- "12-31-2014-21:01:11þ0þþ4382037þ8048852þ286809041þ60534606þ1þ0x0þ1594842þ107607879þþ256þMDþ28þ0.0þ22þ13þ14915þ20646þþ1420088471"

and then run strsplit(temp1,split="þ") I'll get:

[[1]]
 [1] "12-31-2014-21:01:11" "0"                   ""                    "4382037"             "8048852"             "286809041"          
 [7] "60534606"            "1"                   "0x0"                 "1594842"             "107607879"           ""                   
 [13] "256"                 "MD"                  "28"                  "0.0"                 "22"                  "13"                 
 [19] "14915"               "20646"               ""                    "1420088471"

my end goal is to get this parsing: a data frame of 22 different columns.

I've loaded this "sample.cav" file to the following link:

https://drive.google.com/file/d/0B4Krk6hHv0GoUUp4NkhmT1l2ZDQ/view?usp=sharing

I add this link because I'm afraid using copy paste from the site will just produce the final result that I'm getting, So in order to reproduce the problem a download of the file will be necessary (it's just these 5 lines).

Additional notes:

  1. I'm using a windows machine.

  2. Excel tool "text to columns" is successful in parsing this, however I have multiple files that are too large for excel to handle and I wish to continue my analysis in r.

Any help will be highly appreciated! Thanks guys!

like image 738
Yevgeny Tkach Avatar asked Mar 17 '23 11:03

Yevgeny Tkach


1 Answers

You have a few options.

First, you can use the unicode representation of the delimiter with the built-in read.csv and it'll work fine:

read.csv("sample.csv", sep="\xfe", header=FALSE)

You can also use the built-in read.delim with the same parameters.

Hadley's new readr package also has a read_delim that works, but it's adding a \r to column V22 (what is X22 from read.csv):

library(readr)
read_delim("sample.csv", delim="\xfe", col_names=FALSE)

The speedy fread from data.table also works:

library(data.table)
fread("sample.csv", sep="\xfe")

Finally, you could use strsplit and rbind.data.frame:

do.call(rbind.data.frame, 
        strsplit(readLines("sample.csv"), "\xfe", useBytes=TRUE, fixed=TRUE))

But you'll need to set proper column names and convert to usable types after the fact.

like image 116
hrbrmstr Avatar answered Apr 01 '23 01:04

hrbrmstr