Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract a number from a string in a dataframe and place it in a new column?

Tags:

dataframe

r

I have a simple dataframe:

df <- data.frame(test = c("test_A_1_1.txt", "test_A_2_1.txt", "test_A_3_1.txt"), value = c(0.51, 0.52, 0.56))

          test   value
1 test_A_1_1.txt  0.51
2 test_A_2_1.txt  0.52
3 test_A_3_1.txt  0.56

Expected output

I would like to copy the numbers on the end of the string in column 1 and placed it in column three or four respectively, like this:

          test value  new new
1 test_A_1.txt  0.51   1  1
2 test_A_2.txt  0.52   2  1
3 test_A_3.txt  0.56   3  1

Attempt

Using the following code, I am able to extract the numbers from the string:

library(stringr)
as.numeric(str_extract_all("test_A_3.txt", "[0-9]+")[[1]])[1] # Extracts the first number
as.numeric(str_extract_all("test_A_3.txt", "[0-9]+")[[1]])[2] # Extracts the second number

I would like to apply this code on all the values of the first column:

library(tidyverse)
df %>% mutate(new = as.numeric(str_extract_all(df$test, "[0-9]+")[[1]])[1])

However, this lead to a column new, with only the number 1. What am I doing wrong?

like image 686
user213544 Avatar asked Nov 27 '22 04:11

user213544


2 Answers

We can use parse_number from readr

library(dplyr)
library(purrr)
library(stringr)
df %>%
    mutate(new = readr::parse_number(as.character(test)))

Regarding the OP's issue, it is selecting only the first list element ([[1]]) from the str_extract_all (which returns a list). Instead, it is better to use str_extract as we need to extract only the first instance of one or more digits (\\d+)

df %>%
    mutate(new = as.numeric(str_extract(test, "[0-9]+")))

If we need to get the output from str_extract_all (in case), unlist the list to a vector and then apply the as.numeric on that vector

df %>%
     mutate(new = as.numeric(unlist(str_extract_all(test, "[0-9]+"))))

If there are multiple instances, then keep it as a list after converting to numeric by looping through the list elements with map

df %>% 
     mutate(new = map(str_extract_all(test, "[0-9]+"), as.numeric))

NOTE: The str_extract based solution was first posted here.


In base R, we can use regexpr

df$new <- as.numeric(regmatches(df$test, regexpr("\\d+", df$test)))

Update

With the updated example, if we need to get two instances of numbers, the first one can be extracted with str_extract and the last (stri_extract_last - from stringi can be used as well), by providing a regex lookaround to check for digits followed by a . and 'txt'

df %>% 
  mutate(new1 = as.numeric(str_extract(test, "\\d+")),
      new2 = as.numeric(str_extract(test, "\\d+(?=\\.txt)")))
#            test value new1 new2
#1 test_A_1_1.txt  0.51    1    1
#2 test_A_2_1.txt  0.52    2    1
#3 test_A_3_1.txt  0.56    3    1
like image 82
akrun Avatar answered Nov 29 '22 17:11

akrun


Slightly modifying your existing code:

df %>% 
  mutate(new = as.integer(str_extract(test, "[0-9]+")))

Or simply

df$new <- as.integer(str_extract(df$test, "[0-9]+"))
like image 33
sindri_baldur Avatar answered Nov 29 '22 17:11

sindri_baldur