Regular Expression over multiple lines

Tags:

I'm stuck with this for several hours now and cycled through a wealth of different tools to get the job done. Without success. It would be fantastic, if someone could help me out with this.

Here is the problem:

I have a very large CSV file (400mb+) that is not formatted correctly. Right now it looks something like this:

This is a long abstract describing something. What follows is the tile for this sentence."   
,Title1  
This is another sentence that is running on one line. On the next line you can find the title.   
,Title2

As you can probably see the titles ",Title1" and ",Title2" should actually be on the same line as the foregoing sentence. Then it would look something like this:

This is a long abstract describing something. What follows is the tile for this sentence.",Title1  
This is another sentence that is running on one line. On the next line you can find the title.,Title2

Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.

Here is what I came up with so far:

sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv

This should actually get the job done of matching the expression over multiple lines. Unfortunately it doesn't :)

The expression is looking for the dot at the end of the sentence and the optional quotes plus a newline character that I'm trying to match with .*.

Help much appreciated. And it doesn't really matter what tool gets the job done (awk, perl, sed, tr, etc.).

552

asked Dec 22 '10 15:12

herrherr

1 Answers

Multiline in sed isn't necessarily tricky per se, it's just that it uses commands most people aren't familiar with and have certain side effects, like delimiting the current line from the next line with a '\n' when you use 'N' to append the next line to the pattern space.

Anyway, it's much easier if you match on a line that starts with a comma to decide whether or not to remove the newline, so that's what I did here:

sed 'N;/\n,/s/"\? *\n//;P;D' title_csv

Input

$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line

Output

$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line

128

answered Sep 28 '22 21:09

SiegeX

Related questions
                            
                                Regex reads from right to left
                            
                                Regex Match any string powershell
                            
                                htaccess redirect to HTTPS except a few urls
                            
                                How to find and replace nth occurrence of word in a sentence using python regular expression?
                            
                                python re.sub, only replace part of match [duplicate]
                            
                                Regular expression implementation details
                            
                                Find a string in multiple files using grep
                            
                                Does std::regex support "(?i)" for case insensitivity?
                            
                                why python regex is so slow?
                            
                                How can I get a regex match to only be added once to the matches collection?
                            
                                How can I assign the match of my regular expression to a variable?
                            
                                JavaScript Regex (string should include only alpha, space, hyphen)
                            
                                Regular Expression for delimited email address
                            
                                Can I "combine" 2 regex with a logic or?
                            
                                Strange behavior in a perl regexp with global substitution
                            
                                How to find spans with a specific class containing specific text using beautiful soup and re?
                            
                                Make array from regex
                            
                                Mongodb match accented characters as underlying character
                            
                                sed: unescaped newline inside substitute pattern?
                            
                                Javascript regex (negative) lookbehind not working in firefox

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regular Expression over multiple lines

Tags:

regex

bash

csv

sed