I'm stuck with this for several hours now and cycled through a wealth of different tools to get the job done. Without success. It would be fantastic, if someone could help me out with this.
Here is the problem:
I have a very large CSV file (400mb+) that is not formatted correctly. Right now it looks something like this:
This is a long abstract describing something. What follows is the tile for this sentence." ,Title1 This is another sentence that is running on one line. On the next line you can find the title. ,Title2
As you can probably see the titles ",Title1" and ",Title2" should actually be on the same line as the foregoing sentence. Then it would look something like this:
This is a long abstract describing something. What follows is the tile for this sentence.",Title1 This is another sentence that is running on one line. On the next line you can find the title.,Title2
Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.
Here is what I came up with so far:
sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv
This should actually get the job done of matching the expression over multiple lines. Unfortunately it doesn't :)
The expression is looking for the dot at the end of the sentence and the optional quotes plus a newline character that I'm trying to match with .*.
Help much appreciated. And it doesn't really matter what tool gets the job done (awk, perl, sed, tr, etc.).
Multiline option, or the m inline option, enables the regular expression engine to handle an input string that consists of multiple lines. It changes the interpretation of the ^ and $ language elements so that they match the beginning and end of a line, instead of the beginning and end of the input string.
The dot matches all except newlines (\r\n). So use \s\S, which will match ALL characters.
[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .
The m flag indicates that a multiline input string should be treated as multiple lines. For example, if m is used, ^ and $ change from matching at only the start or end of the entire string to the start or end of any line within the string. The set accessor of multiline is undefined .
Multiline in sed
isn't necessarily tricky per se, it's just that it uses commands most people aren't familiar with and have certain side effects, like delimiting the current line from the next line with a '\n' when you use 'N' to append the next line to the pattern space.
Anyway, it's much easier if you match on a line that starts with a comma to decide whether or not to remove the newline, so that's what I did here:
sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line
$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With