I've written a script that cleans up .csv files, removing some bad commas and bad quotes (bad, means they break an in house program we use to transform these files) using sed:
# remove all commas, and re-insert the good commas using clean.sed
sed -f clean.sed $1 > $1.1st
# remove all quotes
sed 's/\"//g' $1.1st > $1.tmp
# add the good quotes around good commas
sed 's/\,/\"\,\"/g' $1.tmp > $1.tmp1
# add leading quotes
sed 's/^/\"/' $1.tmp1 > $1.tmp2
# add trailing quotes
sed 's/$/\"/' $1.tmp2 > $1.tmp3
# remove utf characters
sed 's/<feff>//' $1.tmp3 > $1.tmp4
# replace original file with new stripped version and delete .tmp files
cp -rf $1.tmp4 quotes_$1
Here is clean.sed:
s/\",\"/XXX/g;
:a
s/,//g
ta
s/XXX/\",\"/g;
Then it removes the temp files and viola we have a new file that starts with the word "quotes" that we can use for our other processes.
My question is:
Why do I have to make a sed statement to remove the feff tag in that temp file? The original file doesn't have it, but it always appears in the replacement. At first I thought cp was causing this but if I put in the sed statement to remove before the cp, it isn't there.
Maybe I'm just missing something...
FEFF is an automated program for ab initio multiple scattering calculations of X-ray Absorption Fine Structure (XAFS), X-ray Absorption Near-Edge Structure (XANES) and various other spectra for clusters of atoms.
Our friend FEFF means different things, but it's basically a signal for a program on how to read the text. It can be UTF-8 (more common), UTF-16 , or even UTF-32 . FEFF itself is for UTF-16 — in UTF-8 it is more commonly known as 0xEF,0xBB, or 0xBF .
From man bash : -s If the -s option is present, or if no arguments remain after option processing, then commands are read from the standard input. This option allows the positional parameters to be set when invoking an interactive shell.
U+FEFF is the code point for a byte order mark. Your files most likely contain data saved in UTF-16 and the BOM has been corrupted by your 'cleaning process' which is most likely expecting ASCII. It's probably not a good idea to remove the BOM, but instead to fix your scripts to not corrupt it in the first place.
To get rid of these in GNU emacs:
There is also a way to convert files with DOS line termination convention to Unix line termination convention.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With