Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grab data from strings in R using regular expression

Tags:

regex

r

Now the string is looks like:

"Interest.USD,Vol=[Integrated,(0,0.101),(0.2,0.108),(1,0.110),(2,0.106),
(3,0.102),(4,0.09),(5,0.091),(6,0.09128272)],Drift=[Integrated,(0.002,0.09),
(0.24,0.0007),(0.4,0.007),(1,-0.033),(2,-0.005),(3,-0.0041),
(4,-0.3505),(5,-0.65),(7,-0.08346),(8,-0.049),(9,-0.0613),(10,-0.019)],
Risk_Neutral=YES,Lambda=0.09,FX_Volatility=0.01,FX_Correlation=0.9"

I want to grab the data following the "Vol" and "Drift" in a matrix format like:

Vol matrix:

0,0.101
0.2,0.108
1,0.110
2,0.106
3,0.102
4,0.09
5,0.091
6,0.09128272

and also the single value like 0.09 for Lambda. I guess I shuold use regular expression, but I not that familiar with that. Any suggestion? :)

P.S. I tried using:

str_extract_all(text,'[ .+? ]')

try to get the data bewteen [ and ], but it returns "."

like image 672
Louisyan Avatar asked Mar 20 '23 07:03

Louisyan


1 Answers

Here's a way to extract those values in R. Let's assume that strings you posted is stored in a variable named a. In order to make things easier, i'm going to use a helper function: getcapturedmatches(). Then you can do

expr <- "(Vol|Drift)=\\[Integrated,([^\\]]*)\\]"
mm <- regcapturedmatches(a,gregexpr(expr,a, perl=T))[[1]]
expr <- "\\(([^,]+),([^,]+)\\)"
vv <- regcapturedmatches(mm[,2],gregexpr(expr,mm[,2], perl=T))

First we do a pass to extract the Vol and Drift elements in mm and then we split the comma delimited lists into vv. Now we can combine the data into one large data.frame

tt <- Map(data.frame, col=mm[,1], val=lapply(vv, 
    function(x) {class(x)<-"numeric"; x}))
dd<-do.call(rbind, unname(tt))

In the end dd will look like

     col  val.1       val.2
1    Vol  0.000  0.10100000
2    Vol  0.200  0.10800000
3    Vol  1.000  0.11000000
4    Vol  2.000  0.10600000
5    Vol  3.000  0.10200000
6    Vol  4.000  0.09000000
7    Vol  5.000  0.09100000
8    Vol  6.000  0.09128272
9  Drift  0.002  0.09000000
10 Drift  0.240  0.00070000
11 Drift  0.400  0.00700000
12 Drift  1.000 -0.03300000
13 Drift  2.000 -0.00500000
14 Drift  3.000 -0.00410000
15 Drift  4.000 -0.35050000
16 Drift  5.000 -0.65000000
17 Drift  7.000 -0.08346000
18 Drift  8.000 -0.04900000
19 Drift  9.000 -0.06130000
20 Drift 10.000 -0.01900000

This method allows for any number of repeated values in each of those sections.

If you did just want simple matrices then

Map(function(a,b) {class(b)<-"numeric"; b}, mm[,1], 
    lapply(vv, function(x) {class(x)<-"numeric"; x}))

will give you a named list of the matrices.

like image 106
MrFlick Avatar answered Apr 27 '23 21:04

MrFlick