Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Awk/sed replace newlines

Intro:

I have been given a CSV file in which the field delimiter is the pipe characted (i.e., |). This file has a pre-defined number of fields (say N). I can discover the value of N by reading the header of the CSV file, which we can assume to be correct.

Problem:

Some of the fields contain a newline character by mistake, which makes the line appear shorter than required (i.e., it has M fields, with M < N).

What I need to create is a sh script (not bash) to fix those lines.

Attempted solution:

I tried creating the following script to try fixing the file:

if [ $# -ne 1 ]
then
    echo "Usage: $0 <filename>"
    exit
fi

# get first line
first_line=$(head -n 1 $1)

# get number of fields
num_separators=$(echo "$first_line" | tr -d -c '|' | awk '{print length}')

cat $1  | awk -v numFields=$(( num_separators + 1 )) -F '|' '
{
    totRecords = NF/numFields
    # loop over lines
    for (record=0; record < totRecords; record++) {
        output = ""
        # loop over fields
        for (i=0; i<numFields; i++) {
            j = (numFields*record)+i+1 
            # replace newline with question mark
            sub("\n", "?", $j)
            output = output (i > 0 ? "|" : "") $j 
        }
        print output
    }
}
'

However, the newline character is still present. How can I fix that problem?

Example of the CSV:

FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a
newline
Foo|Bar|Baz

Expected output:

FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a * newline
Foo|Bar|Baz

* I don't care about the replacement, it could be a space, a question mark, whatever except a newline or a pipe (which would create a new field)
like image 699
user2340612 Avatar asked Jun 27 '16 16:06

user2340612


People also ask

How do you use a new line character in sed?

By default, every line ends with \n when creating a file. The `sed` command can easily split on \n and replace the newline with any character. Another delimiter can be used in place of \n, but only when GNU sed is used. When the \n is missing in the last line of the file, GNU sed can avoid printing \n.

How do you replace commas with new lines?

Select the cells containing the commas you need to replace with newlines, then press the Alt + F11 keys simultaneously to open the Microsoft Visual Basic for Applications window. 3. Press the F5 key or click the Run button to run the code. Then all commas in selected cells are replaced with newlines immediately.


2 Answers

$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==1 { reqdNF = NF; printf "%s", $0; next }
{ printf "%s%s", (NF < reqdNF ? " " : ORS), $0 }
END { print "" }

$ awk -f tst.awk file.csv
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a newline
Foo|Bar|Baz

If that's not what you want then edit your question to provide more truly representative sample input and associated output.

like image 70
Ed Morton Avatar answered Sep 23 '22 13:09

Ed Morton


Based on the assumption that the last field may contain one newline. Using tac and sed:

tac file.csv | sed -n '/|/!{h;n;x;H;x;s/\n/ * /p;b};p' | tac 

Output:

FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a * newline
Foo|Bar|Baz

How it works. Read the file backwards, sed is easier without forward references. If a line has no '|' separator, /|/!, run the block of code in curly braces {};, otherwise just p print the line. The block of code:

  1. h; stores the delimiter-less line in sed's hold buffer.
  2. n; fetches another line, since we're reading backwards, this is the line that should be appended to.
  3. x; exchange hold buffer and pattern buffer.
  4. H; append pattern buffer to hold buffer.
  5. x; exchange newly appended lines to pattern buffer, now there's two lines in one buffer.
  6. s/\n/ * /p; replace the middle linefeed with a " * ", now there's only one longer line; and print.
  7. b start again, leave the code block.

Re-reverse the file with tac; done.

like image 35
agc Avatar answered Sep 21 '22 13:09

agc