Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to convert raw text into columns of data

Tags:

regex

r

I have a raw text output from a program that I want to convert into a DataFrame. The text file is not formatted and is as shown below.

 10037    149439Special Event       11538.00       13542.59   2004.59
 10070     10071Weekday        8234.00        9244.87   1010.87
 10216     13463Weekend        145.00              0   -145.00

I am able to read the data into R using readLines() in the base package. How can I convert this into data that looks like this (column names can be anything).

 A        B         C              D              E          F
 10037    149439    Special Event  11538.00       13542.59   2004.59
 10070     10071    Weekday        8234.00         9244.87   1010.87
 10216     13463    Weekend        145.00                0   -145.00

What regular expression should I use to achieve this? I know that this is ideal for applying a combination of regexec() and regmatches(). But I am unable to come up with an expression that splits the line into the desired components.

like image 819
sriramn Avatar asked Feb 13 '23 03:02

sriramn


2 Answers

Here's a simple solution:

raw <- readLines("filename.txt")
data.frame(do.call(rbind, strsplit(raw, " {2,}|(?<=\\d)(?=[A-Z])", perl = TRUE)))

#       X1     X2            X3       X4       X5      X6
# 1  10037 149439 Special Event 11538.00 13542.59 2004.59
# 2  10070  10071       Weekday  8234.00  9244.87 1010.87
# 3  10216  13463       Weekend   145.00        0 -145.00

The regular expression " {2,}|(?<=\\d)(?=[A-Z])" consists of two parts, combined with "|" (logical or).

  1. " {2,}" means at least two spaces. This will split between the different columns only, since the text in the third column has a single space.
  2. "(?<=\\d)(?=[A-Z])" denotes the positions that are preceded by a digit and followed by an uppercase letter. This is used to split between the second and the third column.
like image 180
Sven Hohenstein Avatar answered Feb 15 '23 15:02

Sven Hohenstein


I created "txt.txt" from your data. Then we work some with a regular expression.

> read <- readLines("txt.txt")
> S <- strsplit(read, "[A-Za-z]|\\s")
> W <- do.call(rbind, lapply(S, function(x) x[nzchar(x)]))
> D <- data.frame(W[,1:2], col, W[,3:5])
> names(D) <- LETTERS[seq(D)]
> D
##       A      B            C        D        E       F
## 1 10037 149439 SpecialEvent 11538.00 13542.59 2004.59
## 2 10070  10071      Weekday  8234.00  9244.87 1010.87
## 3 10216  13463      Weekend   145.00        0 -145.00

Toss it all into some curly brackets and you've got yourself a function to parse your files.

PS: If the space between "Special" and "Event" is important, please comment and I'll revise.

like image 25
Rich Scriven Avatar answered Feb 15 '23 17:02

Rich Scriven