Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SED: multiple patterns on the same line, how to match/parse first one

I have a file, which holds phone number data, and also some useless stuff. I'm trying to parse the numbers out, and when there is only 1 phone number / line, it's not problem. But when I have multiple numbers, sed matches the last one (even though everywhere it says it should match only match the first pattern?), and I can't get other numbers out..

My data.txt:

bla bla bla NUM:09011111111 bla bla bla bla NUM:08022222222 bla bla bla

When I parse for the data, my idea was first to remove all the "initial" "bla bla bla" in front of the first phone number (so I search for first occurrence of 'NUM:'), then I remove all the stuff after phone number, and get the number. After that I want to parse the next occurrence from the leftover string.

So now when I try to sed it, I always get the last number on the line:

>sed 's/.*NUM://' data.txt
08022222222 bla bla bla
> 

Primarily I would like to understand what's wrong with my understanding of SED. Of course more efficient suggestions are welcome! Doesn't my sed command say, replace all stuff before 'NUM:' with '' (empty)? Why it matches always the last occurrence ?

Thanks!

like image 552
julumme Avatar asked Mar 13 '12 09:03

julumme


2 Answers

This might work for you:

echo "bla bla bla NUM:09011111111 bla bla bla bla NUM:08022222222 bla bla bla" |
sed 's/NUM:/\n&/g;s/[^\n]*\n\(NUM:[0-9]*\)[^\n]*/\1 /g;s/.$//'
NUM:09011111111 NUM:08022222222

The problem you have is understanding that the .* is greedy i.e. it matches the longest match not the first match. By placing a unique character (\n sed uses it as a line delimiter so it cannot exist in the line) in front of the string we're interested in (NUM:...) and deleting everything that is not that unique character [^\n]* followed by the unique character \n, we effectively split the string into manageable pieces.

like image 116
potong Avatar answered Oct 27 '22 23:10

potong


As you know by now, sed regexes are greedy and as far as I can tell can't be made non-greedy.

Two alternatives that haven't been brought up until now are to just use other tools for this kind of matching/extraction.

You can use perl as a drop-in replacement for sed with the -pe parameters. It supports the ? non-greedy modifier:

$ perl -pe 's/.*?NUM://' data.txt
09011111111 bla bla bla bla NUM:08022222222 bla bla bla

You can use the -o option to GNU grep to get only the bits of your data that match the regex:

$ egrep -o 'NUM:[0-9]*' data.txt 
NUM:09011111111
NUM:08022222222
like image 32
Eduardo Ivanec Avatar answered Oct 27 '22 23:10

Eduardo Ivanec