Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

From Python to R - DataFrame from string

Tags:

python

regex

r

I have the following working example in Python which takes a string, uses a dict comprehension and a regular expression on it and finally generates a dataframe from it:

import re, pandas as pd

junk = """total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;"""

rx = re.compile(r'(?P<key>\w+)=(?P<value>[^;]+)')
records = [{m.group('key'): m.group('value') 
            for m in rx.finditer(line)} 
            for line in junk.split("\n")]
df = pd.DataFrame(records)
print(df)

This yields

  buffers  cached    free  shared   total    used
0   304MB  1059MB  5711MB     0MB  7871MB  2159MB
1    30MB  1059MB    71MB  3159MB  5751MB     5MB
2    30MB  1059MB   109MB  3159MB  5751MB     5MB


Now how the ... can I do the same thing in R ?
I messed around with lapply and regmatches but to no avail. Additionally, how would I do this with missing values?
like image 981
Jan Avatar asked Mar 06 '23 15:03

Jan


1 Answers

A purrr option:

library(purrr)

'total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;' %>% 
    strsplit('\n') %>% .[[1]] %>%    # separate lines into character vector
    strsplit(';') %>%     # separate each line into a list of key-value pairs
    map(strsplit, '=') %>%    # split key-value pairs into length-2 sublists
    map(transpose) %>%    # flip list of key-value pairs to list of keys and values
    map_dfr(~set_names(.x[[2]], .x[[1]]))    # set names of values to keys and simplify to data frame
#> # A tibble: 3 x 6
#>   total  free   used   shared buffers cached
#>   <chr>  <chr>  <chr>  <chr>  <chr>   <chr> 
#> 1 7871MB 5711MB 2159MB 0MB    304MB   1059MB
#> 2 5751MB 71MB   5MB    3159MB 30MB    1059MB
#> 3 5751MB 109MB  5MB    3159MB 30MB    1059MB

or a more data-frame-centric option:

library(tidyverse)

# put text in data frame
data_frame(text = 'total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;') %>% 
    separate_rows(text, sep = '\n') %>%    # separate lines into separate rows
    rowid_to_column('line') %>%    # add index for each line to help spreading later
    separate_rows(text, sep = ';') %>%    # separate each line into key-value pairs
    filter(text != '') %>%    # drop extra entries from superfluous semicolons
    separate(text, c('key', 'value')) %>%    # separate keys and values into columns
    spread(key, value) %>%    # reshape to wide form
    select(-line)    # drop line index column
#> # A tibble: 3 x 6
#>   buffers cached free   shared total  used  
#>   <chr>   <chr>  <chr>  <chr>  <chr>  <chr> 
#> 1 304MB   1059MB 5711MB 0MB    7871MB 2159MB
#> 2 30MB    1059MB 71MB   3159MB 5751MB 5MB   
#> 3 30MB    1059MB 109MB  3159MB 5751MB 5MB

If you want to avoid packages, you can hack it through read.dcf, which reads Debian Control Format (like R package DESCRIPTION files), which is just key-value pairs. DCF uses : instead of = and \n instead of ;, though, so you'll need to do a little gsubing first:

junk <- 'total=7871MB;free=5711MB;used=2159MB;shared=0MB;buffers=304MB;cached=1059MB;
free=71MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;cached=1059MB;
cached=1059MB;total=5751MB;shared=3159MB;used=5MB;buffers=30MB;free=109MB;'

junk <- gsub('=', ':', junk) 
junk <- gsub(';', '\n', junk)
mat <- read.dcf(textConnection(junk))
mat
#>      total    free     used     shared   buffers cached  
#> [1,] "7871MB" "5711MB" "2159MB" "0MB"    "304MB" "1059MB"
#> [2,] "5751MB" "71MB"   "5MB"    "3159MB" "30MB"  "1059MB"
#> [3,] "5751MB" "109MB"  "5MB"    "3159MB" "30MB"  "1059MB"

It returns a matrix, but it's well-formed and easy to convert to a proper data.frame:

df <- as.data.frame(mat, stringsAsFactors = FALSE)
df
#>    total   free   used shared buffers cached
#> 1 7871MB 5711MB 2159MB    0MB   304MB 1059MB
#> 2 5751MB   71MB    5MB 3159MB    30MB 1059MB
#> 3 5751MB  109MB    5MB 3159MB    30MB 1059MB
like image 150
alistaire Avatar answered Mar 10 '23 11:03

alistaire