Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract some part of text separated by a delimiter using a regex

Tags:

regex

aql

I have a sample input file as follows, with columns Id, Name, start date, end date, Age, Description, and Location:

220;John;23/11/2008;22/12/2008;28;Working as a professor in University;Hyderabad
221;Paul;30;23/11/2008;22/12/2008;He is a software engineer at MNC;Bangalore
222;Emma;23/11/2008;22/12/200825;Working as a mechanical engineer;Chennai

It contains 30 lines of data. My requirement is to only extract descriptions from the above text file.

My output should contain

Working as a professor in University

He is a software engineer at MNC

working as a mechanical engineer

I need to find a regular expression to extract the Description, and have tried many kinds, but I haven't been able to find the solution. How can I do it?

like image 536
mahodaya Avatar asked Feb 19 '13 04:02

mahodaya


2 Answers

You can use this regex:

[^;]+(?=;[^;]*$)

[^;] matches any character except ;

+ is a quantifier that matches the preceding character or group one to many times

* is a quantifier that matches the preceding character or group zero to many times

$ is the end of the string

(?=pattern) is a lookahead which checks if a particular pattern occurs ahead

like image 97
Anirudha Avatar answered Oct 08 '22 13:10

Anirudha


/^(?:[^;]+;){3}([^;]+)/ will grab the fourth group between semicolons.

Although as stated in my comment, you should just split the string by semicolon and grab the fourth element of the split...that's the whole point of a delimited file - you don't need complex pattern matching.

Example implementation in Perl using your input example:

open(my $IN, "<input.txt") or die $!;

while(<$IN>){
    (my $desc) = $_ =~ /^(?:[^;]+;){3}([^;]+)/;
    print "'$desc'\n";
}
close $IN;

yields:

'Working as a professor in University'
'He is a software engineer at MNC'
'Working as a mechanical engineer'
like image 25
Lone Shepherd Avatar answered Oct 08 '22 12:10

Lone Shepherd