How to extract number from character string?




I have a dataframe like this:

    > dns1
               variant_id         gene_id pval_nominal
21821  chr1_165656237_T_C_b38 ENSG00000143149  1.24119e-05
21822 chr1_165659346_C_CA_b38 ENSG00000143149  1.24119e-05
21823  chr1_165659350_A_G_b38 ENSG00000143149  1.24119e-05
21824  chr1_165659415_A_G_b38 ENSG00000143149  1.24119e-05
21825  chr1_165660430_T_C_b38 ENSG00000143149  1.24119e-05
21826  chr1_165661135_T_G_b38 ENSG00000143149  1.24119e-05
21827  chr1_165661238_C_T_b38 ENSG00000143149  1.24119e-05

I would like to remove all characters from the 2nd column (variant_id) and to extract just the second number, to look like this:


I tried this:

dns1$variant_id <- gsub('[^0-9.]', '', dns1$variant_id)

but with the above command I am getting this:

> dns1
      variant_id         gene_id pval_nominal
21821    116565623738 ENSG00000143149  1.24119e-05
21822    116565934638 ENSG00000143149  1.24119e-05
21823    116565935038 ENSG00000143149  1.24119e-05
21824    116565941538 ENSG00000143149  1.24119e-05

So this matches all numbers in variant_id column, and I would need to get 16565623738 instead of 116565623738. So the question is how to match in this 2nd column just the 2nd number?

2 Answers

You may use

dns1$variant_id <- sub('^[^_]*_(\\d+).*', '\\1', dns1$variant_id)

See the regex demo


  • ^ - start of string
  • [^_]* - 0+ chars other than _
  • _ - an underscore
  • (\\d+) - Group 1: one or more digits
  • .* - the rest of the string.

The sub function will only perform a single search and replace operation on each string, and the \1 backreference in the replacement will put back the contents in Group 1.

Online R demo:

variant_id <- c("chr1_165656237_T_C_b38", "chr1_165659346_C_CA_b38")
dns1 <- data.frame(variant_id)
dns1$variant_id <- sub('^[^_]*_(\\d+).*', '\\1', dns1$variant_id)
##=> variant_id
## 1  165656237
## 2  165659346
I believe you can catch the digits as follows:

gsub(".*?_([[:digit:]]+)_.*", "\\1", dns1$variant_id)
