I have been given a CSV file in which the field delimiter is the pipe characted (i.e., |
).
This file has a pre-defined number of fields (say N
). I can discover the value of N
by reading the header of the CSV file, which we can assume to be correct.
Some of the fields contain a newline character by mistake, which makes the line appear shorter than required (i.e., it has M
fields, with M < N
).
What I need to create is a sh
script (not bash
) to fix those lines.
I tried creating the following script to try fixing the file:
if [ $# -ne 1 ]
then
echo "Usage: $0 <filename>"
exit
fi
# get first line
first_line=$(head -n 1 $1)
# get number of fields
num_separators=$(echo "$first_line" | tr -d -c '|' | awk '{print length}')
cat $1 | awk -v numFields=$(( num_separators + 1 )) -F '|' '
{
totRecords = NF/numFields
# loop over lines
for (record=0; record < totRecords; record++) {
output = ""
# loop over fields
for (i=0; i<numFields; i++) {
j = (numFields*record)+i+1
# replace newline with question mark
sub("\n", "?", $j)
output = output (i > 0 ? "|" : "") $j
}
print output
}
}
'
However, the newline character is still present. How can I fix that problem?
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a
newline
Foo|Bar|Baz
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a * newline
Foo|Bar|Baz
* I don't care about the replacement, it could be a space, a question mark, whatever except a newline or a pipe (which would create a new field)
By default, every line ends with \n when creating a file. The `sed` command can easily split on \n and replace the newline with any character. Another delimiter can be used in place of \n, but only when GNU sed is used. When the \n is missing in the last line of the file, GNU sed can avoid printing \n.
Select the cells containing the commas you need to replace with newlines, then press the Alt + F11 keys simultaneously to open the Microsoft Visual Basic for Applications window. 3. Press the F5 key or click the Run button to run the code. Then all commas in selected cells are replaced with newlines immediately.
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==1 { reqdNF = NF; printf "%s", $0; next }
{ printf "%s%s", (NF < reqdNF ? " " : ORS), $0 }
END { print "" }
$ awk -f tst.awk file.csv
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a newline
Foo|Bar|Baz
If that's not what you want then edit your question to provide more truly representative sample input and associated output.
Based on the assumption that the last field may contain one newline. Using tac and sed:
tac file.csv | sed -n '/|/!{h;n;x;H;x;s/\n/ * /p;b};p' | tac
Output:
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a * newline
Foo|Bar|Baz
How it works. Read the file backwards, sed is easier without forward references. If a line has no '|' separator, /|/!
, run the block of code in curly braces {};
, otherwise just p
print the line. The block of code:
h;
stores the delimiter-less line in sed
's hold buffer.n;
fetches another line, since we're reading backwards, this is the line that should be appended to.x;
exchange hold buffer and pattern buffer.H;
append pattern buffer to hold buffer.x;
exchange newly appended lines to pattern buffer, now there's two lines in one buffer.s/\n/ * /p;
replace the middle linefeed with a " * ", now there's only one longer line; and print.b
start again, leave the code block.Re-reverse the file with tac
; done.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With