I'm trying to read csv files (originally log files with the extension changed) that are thorn delimited.
However simple readLines
or read.delim
don't except "þ" as a separator and the default functions result in one columned data.frame
when I'm trying to print the data frame I get:
temp <- readLines("sample.csv")
temp
[1] "12-31-2014-21:01:11þ0þþ4382037þ8048852þ286809041þ60534606þ1þ0x0þ1594842þ107607879þþ256þMDþ28þ0.0þ22þ13þ14915þ20646þþ1420088471"
[2] "12-31-2014-21:01:16þ0þþ4382037þ8048852þ286810825þ60533173þ1þ160x600þ1594842þ107606862þþ256þMDþ28þ0.0þ22þ13þ14915þ20646þþ1420088476"
[3] "12-31-2014-21:13:24þ0þþ4382037þ8048852þ286811019þ60532482þ1þ728x90þ1594842þ107607879þþ256þAZþ27þ0.0þ7þ189þ13541þ85747þþ1420089204"
[4] "12-31-2014-21:35:04þ2453419343309111450þþ4479850þ8209375þ283421530þ58729626þ1þ1x2þ1685249þ110242400þþ256þCOþ5þ9.0þ22þ171þ14885þ80120þþ1420090504"
[5] "12-31-2014-21:07:22þ2453942989496582297þþ4382037þ8048852þ284408073þ59306057þ1þ728x90þ1594842þ107607879þþ256þGAþ26þ0.0þ22þ26þ14981þ30093þþ1420088842"
now I've tried to parse it with strsplit(temp,"þ")
and recieved:
[[1]]
[1] "12-31-2014-21:01:110438203780488522868090416053460610x01594842107607879256MD280.0221314915206461420088471"
[[2]]
[1] "12-31-2014-21:01:16043820378048852286810825605331731160x6001594842107606862256MD280.0221314915206461420088476"
[[3]]
[1] "12-31-2014-21:13:24043820378048852286811019605324821728x901594842107607879256AZ270.0718913541857471420089204"
[[4]]
[1] "12-31-2014-21:35:042453419343309111450447985082093752834215305872962611x21685249110242400256CO59.02217114885801201420090504"
[[5]]
[1] "12-31-2014-21:07:22245394298949658229743820378048852284408073593060571728x901594842107607879256GA260.0222614981300931420088842"
alternatively gsub("þ"," ",temp)
just returns the original print.
This gets even "weirder" because if I'll copy and paste a line from the console and run:
temp1 <- "12-31-2014-21:01:11þ0þþ4382037þ8048852þ286809041þ60534606þ1þ0x0þ1594842þ107607879þþ256þMDþ28þ0.0þ22þ13þ14915þ20646þþ1420088471"
and then run strsplit(temp1,split="þ")
I'll get:
[[1]]
[1] "12-31-2014-21:01:11" "0" "" "4382037" "8048852" "286809041"
[7] "60534606" "1" "0x0" "1594842" "107607879" ""
[13] "256" "MD" "28" "0.0" "22" "13"
[19] "14915" "20646" "" "1420088471"
my end goal is to get this parsing: a data frame of 22 different columns.
I've loaded this "sample.cav" file to the following link:
https://drive.google.com/file/d/0B4Krk6hHv0GoUUp4NkhmT1l2ZDQ/view?usp=sharing
I add this link because I'm afraid using copy paste from the site will just produce the final result that I'm getting, So in order to reproduce the problem a download of the file will be necessary (it's just these 5 lines).
Additional notes:
I'm using a windows machine.
Excel tool "text to columns" is successful in parsing this, however I have multiple files that are too large for excel to handle and I wish to continue my analysis in r.
Any help will be highly appreciated! Thanks guys!
You have a few options.
First, you can use the unicode representation of the delimiter with the built-in read.csv
and it'll work fine:
read.csv("sample.csv", sep="\xfe", header=FALSE)
You can also use the built-in read.delim
with the same parameters.
Hadley's new readr
package also has a read_delim
that works, but it's adding a \r
to column V22
(what is X22
from read.csv
):
library(readr)
read_delim("sample.csv", delim="\xfe", col_names=FALSE)
The speedy fread
from data.table
also works:
library(data.table)
fread("sample.csv", sep="\xfe")
Finally, you could use strsplit
and rbind.data.frame
:
do.call(rbind.data.frame,
strsplit(readLines("sample.csv"), "\xfe", useBytes=TRUE, fixed=TRUE))
But you'll need to set proper column names and convert to usable types after the fact.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With