Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting last word from many data frame columns (R)

Tags:

r

I have a dataframe that contains a 3 columns. The data looks like this

V1                V2               V3
Auto = Chevy      Engine = V6      Trans = Auto
Auto = Chevy      Engine = V8      Trans = Manual
Auto = Chevy      Engine = V10     Trans = Manual

I want the dataframe to look like this:

Auto       Engine  Trans
Chevy      V6      Auto
Chevy      V8      Manual
Chevy      V10     Manual

In other words, retrieve the last string after the "=" and take the 1st value in the column and make it the column header. Or a way to just retrieve the last word of after the "=" and replace it the column without adding new columns.

Can this be done in R? Many thanks!

like image 248
Fishing101 Avatar asked Jan 21 '17 03:01

Fishing101


People also ask

How do I extract text from columns in R?

To extract the substring of the column in R we use functions like substr() , str_sub() or str_extract() function.

How do I extract multiple columns in R?

To pick out single or multiple columns use the select() function. The select() function expects a dataframe as it's first input ('argument', in R language), followed by the names of the columns you want to extract with a comma between each name.


2 Answers

Well, if you don't mind just using old-style (pre-Hadley) R, here's a solution:

> x <- as.data.frame(list(c('Auto = Chevy', 'Auto = Chevy', 'Auto = Chevy'),
+ c('Engine = V6', 'Engine = V8', 'Engine = V10'),
+ c('Trans = Auto', 'Trans = Manual', 'Trans = Manual')),
+ stringsAsFactors=FALSE)
> values <- lapply(x, gsub, pattern='.*= ', replacement='')
> new.names <- lapply(x, gsub, pattern=' =.*', replacement='')
> new.names <- lapply(new.names, unique)
> names(values) <- new.names
> new.frame <- as.data.frame(values, stringsAsFactors = FALSE)
> new.frame
   Auto Engine  Trans
1 Chevy     V6   Auto
2 Chevy     V8 Manual
3 Chevy    V10 Manual

It won't work for a data frame with many columns, but it will work for a narrow data frame with many rows.

like image 95
JWLM Avatar answered Oct 21 '22 08:10

JWLM


Or, we could avoid the stringr crutch and use a highly optimized function for just such this use case in stringi (most of stringr functions wrap stringi functions):

library(stringi)
library(dplyr)

read.table(text='V1,V2,V3
"Auto = Chevy","Engine = V6","Trans = Auto"
"Auto = Chevy","Engine = V8","Trans = Manual"
"Auto = Chevy","Engine = V10","Trans = Manual"',
sep=",", header=TRUE, stringsAsFactors=FALSE) -> df

mutate_all(df, funs(stri_extract_last_words))
##      V1  V2     V3
## 1 Chevy  V6   Auto
## 2 Chevy  V8 Manual
## 3 Chevy V10 Manual

More representative tidyverse with the "column name" req that could actually break your R script if the columns aren't as you imagine:

library(stringi)
library(dplyr)
library(purrr)

read.table(text='V1,V2,V3
"Auto = Chevy","Engine = V6","Trans = Auto"
"Auto = Chevy","Engine = V8","Trans = Manual"
"Auto = Chevy","Engine = V10","Trans = Manual"',
sep=",", header=TRUE, stringsAsFactors=FALSE) -> df

mutate_all(df, funs(stri_extract_last_words)) %>%
  setNames(mutate_all(df, stri_extract_first_words) %>%
             distinct() %>%
             flatten_chr())

More tidyverse and stringi with the very much assumed requirements that could actually break your R script if the columns aren't as you imagine:

library(stringi)
library(tidyverse)

read.table(text='V1,V2,V3
"Auto = Chevy","Engine = V6","Trans = Auto"
"Auto = Chevy","Engine = V8","Trans = Manual"
"Auto = Chevy","Engine = V10","Trans = Manual"',
sep=",", header=TRUE, stringsAsFactors=FALSE) -> df

by_row(df, function(x) {
  map(x, stri_match_all_regex, "(.*) = (.*)") %>%
    map(1) %>%
    map(~setNames(.[,3], .[,2])) %>%
    flatten_df()
}) %>%
  select(.out) %>%
  unnest()
## # A tibble: 3 × 3
##    Auto Engine  Trans
##   <chr>  <chr>  <chr>
## 1 Chevy     V6   Auto
## 2 Chevy     V8 Manual
## 3 Chevy    V10 Manual
like image 40
hrbrmstr Avatar answered Oct 21 '22 08:10

hrbrmstr