I have a large text file that contains probabilities embedded in sentences. I want to extract only those probabilities and the text before them. Example
Input:
not interesting
foo is 1 in 1,200 and test is 1 in 3.4 not interesting
something else is 1 in 2.5, things are 1 in 10
also not interesting
Wanted output:
foo is 1/1,200
and test is 1/3.4
something else is 1/2.5,
things are 1/10
What I have so far:
$ sed -nr ':a s|(.*) 1 in ([0-9.,]+)|\1 1/\2\n|;tx;by; :x h;ba; :y g;/^$/d; p' input
foo is 1/1,200
and test is 1/3.4
not interesting
something else is 1/2.5,
things are 1/10
something else is 1/2.5,
things are 1/10
This beautiful code repeatedly splits lines when it matches, and tries to only print it if it contained matches. The problem with my code seems to be that the hold space isn't cleared after a line is done.
The general problem is that sed can't do non-greedy matching and my separator can be anything.
I guess a solution in a different language would be okay, but now I'm kind of intrigued if this is possible in sed?
This might work for you (GNU sed):
sed -r 's/([0-9]) in ([0-9]\S*\s*)/\1\/\2\n/;/[0-9]\/[0-9]/P;D' file
This replaces some number followed by space followed by in
followed by a space followed by a token beginning with a number followed by a possible space with the first number followed by a /
followed by the second token beginning with a number followed by a new line. If the following line contains a number followed by a /` followed by a number, then print it and then delete it and if anything else is in the pattern space repeat.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With