I have a file with a lot of text, and mixed there are special space characters, those are Unicode Spaces
I need to replace all of them with the normal "space" character.
Easy using perl:
perl -CSDA -plE 's/\s/ /g' file
but as @mklement0 corectly said in comment, it will match the \t (TAB) too. If this is problem, you could use
perl -CSDA -plE 's/[^\S\t]/ /g'
Demo:
X             X
the above containing:
U+00058 X LATIN CAPITAL LETTER X
U+01680   OGHAM SPACE MARK
U+02002   EN SPACE
U+02003   EM SPACE
U+02004   THREE-PER-EM SPACE
U+02005   FOUR-PER-EM SPACE
U+02006   SIX-PER-EM SPACE
U+02007   FIGURE SPACE
U+02008   PUNCTUATION SPACE
U+02009   THIN SPACE
U+0200A   HAIR SPACE
U+0202F   NARROW NO-BREAK SPACE
U+0205F   MEDIUM MATHEMATICAL SPACE
U+03000   IDEOGRAPHIC SPACE
U+00058 X LATIN CAPITAL LETTER X
using:
perl -CSDA -plE 's/\s/_/g'  <<<"X             X"
note, for the demo replacing to underscore, prints
X_____________X
also, doable using pure bash
LC_ALL=en_US.UTF-8 spaces=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")
while read -r line; do
    echo "${line//[$spaces]/ }"
done
The LC_ALL=en_US.UTF-8 is necessary only if your default locale isn't UTF-8. (which you should have, if do you working with utf8 texts) :)
demo:
str="X             X"
echo "${str//[$spaces]/_}"
prints again:
X_____________X
same using sed - prepare the variable $spaces as above and use:
sed "s/[$spaces]/ /g" file
Edit - because some strange copy/paste (or Locale) problems:
xxd -ps <<<"$spaces"
shows
c2a0e19a80e1a08ee28080e28081e28082e28083e28084e28085e28086e2
8087e28088e28089e2808ae2808be280afe2819fe38080efbbbf0a
the md5 digest (two different programs)
md5sum <<<"$spaces"
LC_ALL=C md5 <<<"$spaces"
prints the same md5
35cf5e1d7a5f512031d18f3d2ec6612f  -
35cf5e1d7a5f512031d18f3d2ec6612f
                        It is possible to identify the characters by their unicode, the sed 's/[[:space:]]\+/\ /g' wont do the trick unfortunately.
By reworking another SO answer, we list all the unicodes save them in a variable, then use sed for the replacement (note using -i.bak we will also save a copy of the original file)
 CHARS=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")
 sed -i.bak 's/['"$CHARS"']/ /g' /tmp/file_to_edit.txt 
                        If you're faced with this task repeatedly, consider installing nws (normalize whitespace), a utility (of mine) that simplifies the task:
nws --ascii file # convert non-ASCII whitespace and punctuation to ASCII
nws --ascii -i file  # update file in place
The --ascii mode of nws:
transliterates (non-ASCII) Unicode whitespace (such as a no-break space ( )) and punctuation (such as curly quotes (“”), en dash (–), ... ) to their closest ASCII equivalent
while leaving any other Unicode characters alone.
This mode is helpful for source-code samples that have been formatted for display with typographic quotes, em dashes, and the like, which usually makes the code indigestible to compilers/interpreters.
nws from the npm registry (Linux and macOS)Note: Even if you don't use Node.js, npm, its package manager, works across platforms and is easy to install; trycurl -L https://git.io/n-install | bash
With Node.js installed, install as follows:
[sudo] npm install nws-cli -g
Note:
sudo depends on how you installed Node.js and whether you've changed permissions later; if you get an EACCES error, try again with sudo.-g ensures global installation and is needed to put nws-cli in your system's $PATH.bash)bash script as nws.chmod +x nws.$PATH, such as /usr/local/bin (macOS) or /usr/bin (Linux).[:space:] and [:blank:] and non-ASCII Unicode whitespaceIn UTF-8-based locales, POSIX-compatible utilities should make POSIX character class [:space:] and [:blank:] match (non-ASCII) Unicode whitespace.
This relies on the locale charmap's correct classification of Unicode characters based on the POSIX-mandated character classifications, which directly correspond to character classes such as [:space:], available in patterns and regular expressions.
There are two pitfalls:
Unicode is an evolving standard (version 9 as of this writing); your platform's UTF-8 charmap may not be current.
Ubuntu 16.04 the following characters are not properly classified and therefore not matched by [:space:] / [:blank:]:The utilities should use the active locale's charmap - but there are regrettable exceptions - the following utilities are NOT Unicode-aware (there may be more):
Among GNU utilities (as of coreutils v8.27):
cut, tr
Mawk, the awk implementation that is the default on Ubuntu, for instance.
Among BSD/macOS utilities (as of macOS 10.12):
awkTherefore, on a platform that has a current UTF-8 charmap, the following sed command should work, but note that [:space:] also matches tab characters and therefore replaces them with a single space too:
sed 's/[[:space:]]/ /g' file
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With