Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing binary control characters from a text file

I have a text file that contains binary control characters, such as "^@" and "^M". When I try to perform string operations directly on the text file, the control characters crash the script.

Through trial and error, I discovered that the more command will strip the control characters so that I can process the file properly.

more file_with_control_characters.not_txt > file_without_control_characters.txt

Is this considered a good method, or is there a better way to remove control characters from a text file? Does more have this behavior in OSes earlier than Windows 8?

like image 476
svengineer99 Avatar asked Jun 27 '26 00:06

svengineer99


2 Answers

Certainly you do not want to simply remove all control characters. Newline and Tab characters are control characters as well, and you don't want to remove those.

I'm assuming your ^M is a carriage return, and ^@ is a NULL byte. The carriage returns are not causing you problems, and MORE does not remove them. But NULL bytes can cause problems if your utility is expecting ASCII text files.

Your input file is most likely UTF-16. MORE is converting the UTF-16 into ANSI (extended ASCII) format, which does effectively remove the NULL bytes. It also converts non-ASCII values into extended ASCII characters in the decimal 128 - 255 byte value range. I believe it uses your active code page (CHCP) value to figure out what characters map where, but I'm not positive.

You should be aware of some additional issues.

  • MORE will convert all Tab characters into a series of spaces, and you cannot control how many spaces (it varies depending on the current position in the line).

  • MORE will always terminate each line with \r\n (carriage return and line feed).

  • MORE also removes the two byte BOM at the beginning of the file, if it exists. The BOM indicates the UTF-16 format. But MORE does not require the 2 byte BOM indicator, it will convert the UTF-16 to ANSI regardless.

  • Lastly MORE can hang indefinitely if your file exceeds 64K lines.

If MORE works for you, than by all means use it.

One other option is to use TYPE, which will also convert UTF-16 to ANSI:

type "yourFile.txt" >"newFile.txt"

TYPE definitely maps non-ASCII codes based on the active code page.

There are some differences with how TYPE converts vs. MORE

  • One advantage of TYPE is it does not convert Tab characters to spaces.

  • Another advantage is it will not hang with large files.

  • Another difference (maybe good, maybe bad) is it will not add a line terminator to a line that does not already have one.

  • A potential disadvantage of TYPE is it will not convert UTF-16 to ANSI if the input is missing the BOM.

like image 91
dbenham Avatar answered Jun 29 '26 14:06

dbenham


Hi, sorry for replying to this old thread but I have seen this question being asked in many places, even several times here. This might as well help other people. I tried the type command as suggested by @dbenham but it did not work.

This can be done by cat -v file > newfile
Credit to Roel Van de Paar from youtube.
You can remove the ^@ characters from the file with sed
Example: sed 's/\^@//g' newfile > newfile.out

like image 24
Kenneth Vargas Avatar answered Jun 29 '26 13:06

Kenneth Vargas



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!