I have a raw text output from a program that I want to convert into a DataFrame
. The text file is not formatted and is as shown below.
10037 149439Special Event 11538.00 13542.59 2004.59
10070 10071Weekday 8234.00 9244.87 1010.87
10216 13463Weekend 145.00 0 -145.00
I am able to read the data into R
using readLines()
in the base package. How can I convert this into data that looks like this (column names can be anything).
A B C D E F
10037 149439 Special Event 11538.00 13542.59 2004.59
10070 10071 Weekday 8234.00 9244.87 1010.87
10216 13463 Weekend 145.00 0 -145.00
What regular expression should I use to achieve this? I know that this is ideal for applying a combination of regexec()
and regmatches()
. But I am unable to come up with an expression that splits the line into the desired components.
Here's a simple solution:
raw <- readLines("filename.txt")
data.frame(do.call(rbind, strsplit(raw, " {2,}|(?<=\\d)(?=[A-Z])", perl = TRUE)))
# X1 X2 X3 X4 X5 X6
# 1 10037 149439 Special Event 11538.00 13542.59 2004.59
# 2 10070 10071 Weekday 8234.00 9244.87 1010.87
# 3 10216 13463 Weekend 145.00 0 -145.00
The regular expression " {2,}|(?<=\\d)(?=[A-Z])"
consists of two parts, combined with "|"
(logical or).
" {2,}"
means at least two spaces. This will split between the different columns only, since the text in the third column has a single space."(?<=\\d)(?=[A-Z])"
denotes the positions that are preceded by a digit and followed by an uppercase letter. This is used to split between the second and the third column.I created "txt.txt"
from your data. Then we work some with a regular expression.
> read <- readLines("txt.txt")
> S <- strsplit(read, "[A-Za-z]|\\s")
> W <- do.call(rbind, lapply(S, function(x) x[nzchar(x)]))
> D <- data.frame(W[,1:2], col, W[,3:5])
> names(D) <- LETTERS[seq(D)]
> D
## A B C D E F
## 1 10037 149439 SpecialEvent 11538.00 13542.59 2004.59
## 2 10070 10071 Weekday 8234.00 9244.87 1010.87
## 3 10216 13463 Weekend 145.00 0 -145.00
Toss it all into some curly brackets and you've got yourself a function to parse your files.
PS: If the space between "Special" and "Event" is important, please comment and I'll revise.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With