Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove part of a string in dataframe column (R)

Tags:

r

I have a dataframe (df) with a column (Col2) like this:

Col1                 Col2                   Col3
  1   C607989_booboobear_Nation               A
  2   C607989_booboobear_Nation               B
  3   C607989_booboobear_Nation               C
  4   C607989_booboobear_Nation               D
  5   C607989_booboobear_Nation               E
  6   C607989_booboobear_Nation               F

I want to extract just the number in Col2

Col1              Col2                    Col3
  1              607989                     A
  2              607989                     B
  3              607989                     C
  4              607989                     D
  5              607989                     E
  6              607989                     F

I have tried things like:

gsub("^.*?_","_",df$Col2)

but it's not working.

like image 387
Cybernetic Avatar asked Aug 13 '14 02:08

Cybernetic


People also ask

How do I remove part of a string in a column in R?

Remove Specific Character from StringUse gsub() function to remove a character from a string or text in R. This is an R base function that takes 3 arguments, first, the character to look for, second, the value to replace with, in our case we use blank string, and the third input string were to replace.

How do I remove part of a column name in R?

To remove a character in an R data frame column, we can use gsub function which will replace the character with blank. For example, if we have a data frame called df that contains a character column say x which has a character ID in each value then it can be removed by using the command gsub("ID","",as.


2 Answers

If your string is not too fancy/complex, it might be easiest to do something like:

gsub("C([0-9]+)_.*", "\\1", df$Col2)
# [1] "607989" "607989" "607989" "607989" "607989" "607989"

Start with a "C", followed by digits, followed by an underscore and then anything else. Digits are captured with (), and the replacement is set to that capture group (\\1).

like image 174
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 21 '22 06:10

A5C1D2H2I1M1N2O1R2T1


An alternate approach using qdap::genXtract that grabs strings between a left and right boundary. Here I use C and _ for the left and right bounds:

## Your data in a better form for sharing
dat <- structure(list(Col1 = c("1", "2", "3", "4", "5", "6"), Col2 = c("C607989_booboobear_Nation", 
    "C607989_booboobear_Nation", "C607989_booboobear_Nation", "C607989_booboobear_Nation", 
    "C607989_booboobear_Nation", "C607989_booboobear_Nation"), Col3 = c("A", 
    "B", "C", "D", "E", "F")), .Names = c("Col1", "Col2", "Col3"), row.names = c(NA, 
    -6L), class = "data.frame")

library(qdap)
dat[[2]] <- unlist(genXtract(dat[[2]], "C", "_"))
dat

##   Col1   Col2 Col3
## 1    1 607989    A
## 2    2 607989    B
## 3    3 607989    C
## 4    4 607989    D
## 5    5 607989    E
## 6    6 607989    F
like image 22
Tyler Rinker Avatar answered Oct 21 '22 08:10

Tyler Rinker