Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

importing poorly structured data in r

Tags:

r

I was given data in a text file which looks like:

Measurement: mc
Loop: 
var1=0, var2=-5, var3=1.8
values:
iteration     data
0             1.203
1             1.206
2             2.206
3             1.201
4             1.204
5             1.204
6             1.204
statistics:
max           1.206
min           1.201
mean          1.204
stddev        0.001
avgdev        0.001
failedtimes   0

Measurement: mc
Loop: 
var1=10, var2=-5, var3=1.8
values:
iteration     data
0             1.203
1             1.206
2             2.206
3             1.201
statistics:
max           1.206
min           1.201
mean          1.204
stddev        0.001
avgdev        0.001
failedtimes   0

I'm looking to get the data in a more normal format like:

var1, var2, var3, iteration,  data,
   0,   -5,  1.8,         0, 1.203, 
   0,   -5,  1.8,         1, 1.206,
   ...
  10,   -5,  1.8,         0, 1.203,

I'm having problems trying to parse data like this. pls help

like image 540
verigolfer Avatar asked Jan 28 '17 00:01

verigolfer


People also ask

What format should data be in for R?

R also has two native data formats—Rdata (sometimes shortened to Rda) and Rds. These formats are used when R objects are saved for later use. Rdata is used to save multiple R objects, while Rds is used to save a single R object. See below for instructions on how to read and load data into R from both file extensions.


2 Answers

One way is to use a wee bit of simple regex and readLines to pull out the relevant rows.

Your data

txt <- 
  "Measurement: mc
Loop: 
var1=0, var2=-5, var3=1.8
values:
iteration     data
0             1.203
1             1.206
2             2.206
3             1.201
4             1.204
5             1.204
6             1.204
statistics:
max           1.206
min           1.201
mean          1.204
stddev        0.001
avgdev        0.001
failedtimes   0

Measurement: mc
Loop: 
var1=10, var2=-5, var3=1.8
values:
iteration     data
0             1.203
1             1.206
2             2.206
3             1.201
statistics:
max           1.206
min           1.201
mean          1.204
stddev        0.001
avgdev        0.001"


# Read in : you can pass the file path instead of textConnection
r = readLines(textConnection(txt))

# Find indices of relevant parts of string that you want to keep
id1 = grep("var", r)
id2 = grep("iteration", r)
id3 = grep("statistics", r)

# indices for iteration data
m = mapply( seq, id2, id3-1)

# Use read.table to parse the relevant rows
lst <- lapply(seq_along(m), function(x) 
                     cbind(read.table(text=r[id1][x], sep=","), #var data
                           read.table(text=r[m[[x]]], header=TRUE))) # iteration data

dat <- do.call(rbind, lst)

# Remove the var= text and convert to numeric
dat[] <- lapply(dat, function(x) as.numeric(gsub("var\\d+=", "", x)))
dat
#    V1 V2  V3 iteration  data
# 1   0 -5 1.8         0 1.203
# 2   0 -5 1.8         1 1.206
# 3   0 -5 1.8         2 2.206
# 4   0 -5 1.8         3 1.201
# 5   0 -5 1.8         4 1.204
# 6   0 -5 1.8         5 1.204
# 7   0 -5 1.8         6 1.204
# 8  10 -5 1.8         0 1.203
# 9  10 -5 1.8         1 1.206
# 10 10 -5 1.8         2 2.206
# 11 10 -5 1.8         3 1.201

Might actually be a bit clearer to split the data in to sections and then apply a function ie

sp <- split(r, cumsum(grepl("measure", r, TRUE)))

# Function to parse
fun <- function(x){
    id1 = grep("var", x)
    id2 = grep("iteration", x)
    id3 = grep("statistics", x)
    m = seq(id2, id3-1)

    cbind(read.table(text=x[id1], sep=","),
          read.table(text=x[m], header=TRUE))
}

lst <- lapply(sp, fun)

Then continue as before

like image 77
user20650 Avatar answered Oct 18 '22 13:10

user20650


Here is a pipeline that reads it in and processes it. Assume the data is in L as per Note at the end. You will likely need to create this with something like L <- readLines("myfile.dat").

Trim leading and trailing whitespace using trimws -- this step may not be needed but it can't hurt just in case the data does have whitespace at the beginning of lines. Then grep out the lines that begin with a digit or contain var replacing each of v, a, r, = with a space and replacing comma with a newline. That puts it in the form that read.table can read it into a 2 column data frame in which the first column is 1, 2, 3 followed by the iteration number and the second column is the value of var1, var2, var3 and the data all repeated for each group. We form a grouping variable by identifying sequential runs using the expression cumsum(...) %/% 2. This assumes that there are at least 2 iterations (0 and 1) per group. (From the data shown it apears that this is the case but if not it could be addressed with additional code as shown later.) Finally, split by the grouping expression and rework each such split out group into the required data frame.

library(purrr)

L %>%
  trimws %>%
  grep(pattern = "^\\d|var", value = TRUE) %>%
  chartr(old = "var=,", new = "    \n") %>%
  read.table(text = .) %>%
  split(cumsum(c(FALSE, diff(.$V1) != 1)) %/% 2) %>%
  map_df(function(x) data.frame(var1 = x[1, 2], var2 = x[2, 2], 
      var3 = x[3, 2],iteration = x[-(1:3), 1], data = x[-(1:3), 2])) 

giving:

   var1 var2 var3 iteration  data
1     0   -5  1.8         0 1.203
2     0   -5  1.8         1 1.206
3     0   -5  1.8         2 2.206
4     0   -5  1.8         3 1.201
5     0   -5  1.8         4 1.204
6     0   -5  1.8         5 1.204
7     0   -5  1.8         6 1.204
8    10   -5  1.8         0 1.203
9    10   -5  1.8         1 1.206
10   10   -5  1.8         2 2.206
11   10   -5  1.8         3 1.201

variation This variation of the code also handles the case where there is only one iteration, i.e. iteration 0, and simplifies the grouping calculation at the expense of a few more lines of code. Here the two instances of -9999 can be any number that does not appear in the data.

L %>%
  grep(pattern = "^\\s*\\d|var", value = TRUE) %>%
  sub(pattern = "var", replacement = "-9999 var") %>%
  gsub(pattern = "[^0-9.,-]", replacement = " ") %>%
  gsub(pattern = ",", replacement = "\n") %>%
  strsplit("\\s+") %>%
  unlist %>%
  as.numeric %>%
  split(cumsum(. == -9999)) %>%
  map_df(function(x) {
    x <- t(matrix(x[-1], 2))
    data.frame(var1 = x[1, 2], var2 = x[2, 2], var3 = x[3, 2],
       iteration = x[-(1:3), 1], data = x[-(1:3), 2])
  })

dplyr/tidyr We could alternately use dplyr and tidyr packages. vars has 3 columns var1, var2 and var3 and one row per group. values has one column containing a nested two column data frame of iteration and data and has one row per group but each such row contains a data frame of many rows.

library(tidyr)
library(dplyr)

vars <- L %>%
    grep(pattern = "var", value = TRUE) %>%
    gsub(pattern = "[=,]", replacement = " ") %>%
    read.table(text = ., col.names = c(NA, "var1", NA, "var2", NA, "var3")) %>%
    select(var1, var2, var3)

values <- L %>%
    trimws %>%
    grep(pattern = "^\\d", value = TRUE) %>%
    read.table(text = ., col.names = c("iteration", "data")) %>%
    mutate(g = cumsum(iteration == 0)) %>%
    nest(-g) %>%
    select(-g)


cbind(vars, values) %>% unnest

Note:

Lines <- "Measurement: mc
Loop: 
var1=0, var2=-5, var3=1.8
values:
iteration     data
0             1.203
1             1.206
2             2.206
3             1.201
4             1.204
5             1.204
6             1.204
statistics:
max           1.206
min           1.201
mean          1.204
stddev        0.001
avgdev        0.001
failedtimes   0

Measurement: mc
Loop: 
var1=10, var2=-5, var3=1.8
values:
iteration     data
0             1.203
1             1.206
2             2.206
3             1.201
statistics:
max           1.206
min           1.201
mean          1.204
stddev        0.001
avgdev        0.001
failedtimes   0"
L <- readLines(textConnection(Lines))
like image 37
G. Grothendieck Avatar answered Oct 18 '22 14:10

G. Grothendieck