Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract text within a string of text

Tags:

I have a simple problem that I'm hoping to resolve without using VBA but if that's the only way it can be solved, so be it.

I have a file with multiple rows (all one column). Each row has data that looks something like this:

1 7.82E-13 >gi|297848936|ref|XP_00| 4-hydroxide gi|297338191|gb|23343|randomrandom

2 5.09E-09 >gi|168010496|ref|xp_00| 2-pyruvate

etc...

What I want is some way to extract the string of numbers that begin with "gi|" and end with a "|". For some rows this might mean as many as 5 gi numbers, for others it'll just be one.

What I would hope the output would look like would be something like:

297848936,297338191

168010496

etc...

like image 855
Brandon Avatar asked Aug 16 '11 23:08

Brandon


2 Answers

Here is a very flexible VBA answer using the regex object. What the function does is extract every single sub-group match it finds (stuff inside the parenthesis), separated by whatever string you want (default is ", "). You can find info on regular expressions here: http://www.regular-expressions.info/

You would call it like this, assuming that first string is in A1:

=RegexExtract(A1,"gi[|](\d+)[|]") 

Since this looks for all occurance of "gi|" followed by a series of numbers and then another "|", for the first line in your question, this would give you this result:

297848936, 297338191 

Just run this down the column and you're all done!

Function RegexExtract(ByVal text As String, _                       ByVal extract_what As String, _                       Optional separator As String = ", ") As String  Dim allMatches As Object Dim RE As Object Set RE = CreateObject("vbscript.regexp") Dim i As Long, j As Long Dim result As String  RE.pattern = extract_what RE.Global = True Set allMatches = RE.Execute(text)  For i = 0 To allMatches.count - 1     For j = 0 To allMatches.Item(i).submatches.count - 1         result = result & (separator & allMatches.Item(i).submatches.Item(j))     Next Next  If Len(result) <> 0 Then     result = Right$(result, Len(result) - Len(separator)) End If  RegexExtract = result  End Function 
like image 157
aevanko Avatar answered Sep 24 '22 12:09

aevanko


Here it is (assuming data is in column A)

=VALUE(LEFT(RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2), FIND("|",RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2)) -1 )) 

Not the nicest formula, but it will work to extract the number.

I just noticed since you have two values per row with output separated by commas. You will need to check if there is a second match, third match etc. to make it work for multiple numbers per cell.

In reference to your exact sample (assuming 2 values maximum per cell) the following code will work:

=IF(ISNUMBER(FIND("gi|",$A1,FIND("gi|", $A1)+1)),CONCATENATE(LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ),  ", ",LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1)  - 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1) - 2))  -1 )),LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2), FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 )) 

How's that for ugly? A VBA solution may be better for you, but I'll leave this here for you.

To go up to 5 numbers, well, study the pattern and recurse manually in the formula. IT will get long!

like image 34
Zelgada Avatar answered Sep 24 '22 12:09

Zelgada