Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting multiple occurrences in line without known separator using sed

Tags:

regex

sed

I have a large text file that contains probabilities embedded in sentences. I want to extract only those probabilities and the text before them. Example

Input:

not interesting
foo is 1 in 1,200 and test is 1 in 3.4 not interesting
something else is 1 in 2.5, things are 1 in 10
also not interesting

Wanted output:

foo is 1/1,200
and test is 1/3.4
something else is 1/2.5,
things are 1/10

What I have so far:

$ sed -nr ':a s|(.*) 1 in ([0-9.,]+)|\1 1/\2\n|;tx;by; :x h;ba; :y g;/^$/d; p' input

foo is 1/1,200
 and test is 1/3.4
 not interesting
something else is 1/2.5,
 things are 1/10

something else is 1/2.5,
 things are 1/10

This beautiful code repeatedly splits lines when it matches, and tries to only print it if it contained matches. The problem with my code seems to be that the hold space isn't cleared after a line is done.

The general problem is that sed can't do non-greedy matching and my separator can be anything.

I guess a solution in a different language would be okay, but now I'm kind of intrigued if this is possible in sed?

like image 537
phiresky Avatar asked Jul 19 '15 12:07

phiresky


1 Answers

This might work for you (GNU sed):

sed -r 's/([0-9]) in ([0-9]\S*\s*)/\1\/\2\n/;/[0-9]\/[0-9]/P;D' file

This replaces some number followed by space followed by in followed by a space followed by a token beginning with a number followed by a possible space with the first number followed by a / followed by the second token beginning with a number followed by a new line. If the following line contains a number followed by a /` followed by a number, then print it and then delete it and if anything else is in the pattern space repeat.

like image 59
potong Avatar answered Nov 16 '22 04:11

potong