Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract number from character string?

Tags:

regex

r

I have a dataframe like this:

    > dns1
               variant_id         gene_id pval_nominal
21821  chr1_165656237_T_C_b38 ENSG00000143149  1.24119e-05
21822 chr1_165659346_C_CA_b38 ENSG00000143149  1.24119e-05
21823  chr1_165659350_A_G_b38 ENSG00000143149  1.24119e-05
21824  chr1_165659415_A_G_b38 ENSG00000143149  1.24119e-05
21825  chr1_165660430_T_C_b38 ENSG00000143149  1.24119e-05
21826  chr1_165661135_T_G_b38 ENSG00000143149  1.24119e-05
21827  chr1_165661238_C_T_b38 ENSG00000143149  1.24119e-05
...

I would like to remove all characters from the 2nd column (variant_id) and to extract just the second number, to look like this:

165656237
165659346
165659350
165659415
165660430
165661135
165661238
...

I tried this:

dns1$variant_id <- gsub('[^0-9.]', '', dns1$variant_id)

but with the above command I am getting this:

> dns1
      variant_id         gene_id pval_nominal
21821    116565623738 ENSG00000143149  1.24119e-05
21822    116565934638 ENSG00000143149  1.24119e-05
21823    116565935038 ENSG00000143149  1.24119e-05
21824    116565941538 ENSG00000143149  1.24119e-05
...

So this matches all numbers in variant_id column, and I would need to get 16565623738 instead of 116565623738. So the question is how to match in this 2nd column just the 2nd number?

like image 464
anikaM Avatar asked Jan 08 '19 20:01

anikaM


2 Answers

You may use

dns1$variant_id <- sub('^[^_]*_(\\d+).*', '\\1', dns1$variant_id)

See the regex demo

Details

  • ^ - start of string
  • [^_]* - 0+ chars other than _
  • _ - an underscore
  • (\\d+) - Group 1: one or more digits
  • .* - the rest of the string.

The sub function will only perform a single search and replace operation on each string, and the \1 backreference in the replacement will put back the contents in Group 1.

Online R demo:

variant_id <- c("chr1_165656237_T_C_b38", "chr1_165659346_C_CA_b38")
dns1 <- data.frame(variant_id)
dns1$variant_id <- sub('^[^_]*_(\\d+).*', '\\1', dns1$variant_id)
dns1
##=> variant_id
## 1  165656237
## 2  165659346
like image 176
Wiktor Stribiżew Avatar answered Nov 14 '22 14:11

Wiktor Stribiżew


I believe you can catch the digits as follows:

gsub(".*?_([[:digit:]]+)_.*", "\\1", dns1$variant_id)
like image 41
Russ Hyde Avatar answered Nov 14 '22 13:11

Russ Hyde