Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to filter only printable characters in a file on Bash (linux) or Python?

I want to make a file including non-printable characters to just only include printable characters. I think this problem is related to ACSCII control action, but I could not find a solution to do that and also could not understand meaning of .[16D (ASCII control action character??) in the following file.

HEXDUMP OF INPUT FILE:

00000000: 4845 4c4c 4f20 5448 4953 2049 5320 5448 HELLO THIS IS TH
00000010: 4520 5445 5354 1b5b 3136 4420 2020 2020 E TEST.[16D
00000020: 2020 2020 2020 2020 2020 201b 5b31 3644            .[16D
00000030: 2020

When I cated that file on bash, I just got: "HELLO ". I think this is because default cat interprets that ASCII control action, two .[16Ds.

Why are two .[16D strings make cat FILE just to print "HELLO"?, and... how can I make that file just to include printable characters, i.e., "HELLO "?

like image 775
freddy Avatar asked Sep 27 '22 18:09

freddy


2 Answers

The hexdump shows that the dot in .[16D is actually an escape character, \x1b.
Esc[nD is an ANSI escape code to delete n characters. So Esc[16D tells the terminal to delete 16 characters, which explains the cat output.

There are various ways to remove ANSI escape codes from a file, either using Bash commands (eg using sed, as in Anubhava's answer) or Python.

However, in cases like this, it may be better to run the file through a terminal emulator to interpret any existing editing control sequences in the file, so you get the result the file's author intended after they applied those editing sequences.

One way to do that in Python is to use pyte, a Python module that implements a simple VTXXX compatible terminal emulator. You can easily install it using pip, and here are its docs on readthedocs.

Here's a simple demo program that interprets the data given in the question. It's written for Python 2, but it's easy to adapt to Python 3. pyte is Unicode-aware, and its standard Stream class expects Unicode strings, but this example uses a ByteStream, so I can pass it a plain byte string.

#!/usr/bin/env python

''' pyte VTxxx terminal emulator demo

    Interpret a byte string containing text and ANSI / VTxxx control sequences

    Code adapted from the demo script in the pyte tutorial at
    http://pyte.readthedocs.org/en/latest/tutorial.html#tutorial

    Posted to http://stackoverflow.com/a/30571342/4014959 

    Written by PM 2Ring 2015.06.02
'''

import pyte


#hex dump of data
#00000000  48 45 4c 4c 4f 20 54 48  49 53 20 49 53 20 54 48  |HELLO THIS IS TH|
#00000010  45 20 54 45 53 54 1b 5b  31 36 44 20 20 20 20 20  |E TEST.[16D     |
#00000020  20 20 20 20 20 20 20 20  20 20 20 1b 5b 31 36 44  |           .[16D|
#00000030  20 20                                             |  |

data = 'HELLO THIS IS THE TEST\x1b[16D                \x1b[16D  '

#Create a default sized screen that tracks changed lines
screen = pyte.DiffScreen(80, 24)
screen.dirty.clear()
stream = pyte.ByteStream()
stream.attach(screen)
stream.feed(data)

#Get index of last line containing text
last = max(screen.dirty)

#Gather lines, stripping trailing whitespace
lines = [screen.display[i].rstrip() for i in range(last + 1)]

print '\n'.join(lines)

output

HELLO

hex dump of output

00000000  48 45 4c 4c 4f 0a                                 |HELLO.|
like image 155
PM 2Ring Avatar answered Oct 14 '22 10:10

PM 2Ring


for me, the following command works well, using strings out of box

head /dev/random | strings -ws ''

detail explain:

head /dev/random : not quite matters, just to create some lines with random chars including non printing chars that might mass your screen up.

-w & -s option of strings: (partial output of man strings)

-w --include-all-whitespace By default tab and space characters are included in the strings that are displayed, but other whitespace characters, such a newlines and carriage returns, are not. The -w option changes this so that all whitespace characters are considered to be part of a string.

-s --output-separator By default, output strings are delimited by a new-line. This option allows you to supply any string to be used as the output record separator. Useful with --include-all-whitespace where strings may contain new-lines internally.

using -w & -s options, data piped through strings are treated as is, so strings -ws '' print the sequences of printable characters.

like image 45
pusen luo Avatar answered Oct 14 '22 09:10

pusen luo