Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I remove the ANSI escape sequences from a string in python

This is my string:

'ls\r\n\x1b[00m\x1b[01;31mexamplefile.zip\x1b[00m\r\n\x1b[01;31m' 

I was using code to retrieve the output from a SSH command and I want my string to only contain 'examplefile.zip'

What I can use to remove the extra escape sequences?

like image 347
SpartaSixZero Avatar asked Feb 04 '13 19:02

SpartaSixZero


People also ask

How do I delete ANSI escape sequences?

You can use regexes to remove the ANSI escape sequences from a string in Python. Simply substitute the escape sequences with an empty string using re. sub(). The regex you can use for removing ANSI escape sequences is: '(\x9B|\x1B\[)[0-?]

How do you escape characters in a string in Python?

Escape sequences allow you to include special characters in strings. To do this, simply add a backslash ( \ ) before the character you want to escape.


2 Answers

Delete them with a regular expression:

import re  # 7-bit C1 ANSI sequences ansi_escape = re.compile(r'''     \x1B  # ESC     (?:   # 7-bit C1 Fe (except CSI)         [@-Z\\-_]     |     # or [ for CSI, followed by a control sequence         \[         [0-?]*  # Parameter bytes         [ -/]*  # Intermediate bytes         [@-~]   # Final byte     ) ''', re.VERBOSE) result = ansi_escape.sub('', sometext) 

or, without the VERBOSE flag, in condensed form:

ansi_escape = re.compile(r'\x1B(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])') result = ansi_escape.sub('', sometext) 

Demo:

>>> import re >>> ansi_escape = re.compile(r'\x1B(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])') >>> sometext = 'ls\r\n\x1b[00m\x1b[01;31mexamplefile.zip\x1b[00m\r\n\x1b[01;31m' >>> ansi_escape.sub('', sometext) 'ls\r\nexamplefile.zip\r\n' 

The above regular expression covers all 7-bit ANSI C1 escape sequences, but not the 8-bit C1 escape sequence openers. The latter are never used in today's UTF-8 world where the same range of bytes have a different meaning.

If you do need to cover the 8-bit codes too (and are then, presumably, working with bytes values) then the regular expression becomes a bytes pattern like this:

# 7-bit and 8-bit C1 ANSI sequences ansi_escape_8bit = re.compile(br'''     (?: # either 7-bit C1, two bytes, ESC Fe (omitting CSI)         \x1B         [@-Z\\-_]     |   # or a single 8-bit byte Fe (omitting CSI)         [\x80-\x9A\x9C-\x9F]     |   # or CSI + control codes         (?: # 7-bit CSI, ESC [              \x1B\[         |   # 8-bit CSI, 9B             \x9B         )         [0-?]*  # Parameter bytes         [ -/]*  # Intermediate bytes         [@-~]   # Final byte     ) ''', re.VERBOSE) result = ansi_escape_8bit.sub(b'', somebytesvalue) 

which can be condensed down to

# 7-bit and 8-bit C1 ANSI sequences ansi_escape_8bit = re.compile(     br'(?:\x1B[@-Z\\-_]|[\x80-\x9A\x9C-\x9F]|(?:\x1B\[|\x9B)[0-?]*[ -/]*[@-~])' ) result = ansi_escape_8bit.sub(b'', somebytesvalue) 

For more information, see:

  • the ANSI escape codes overview on Wikipedia
  • ECMA-48 standard, 5th edition (especially sections 5.3 and 5.3)

The example you gave contains 4 CSI (Control Sequence Introducer) codes, as marked by the \x1B[ or ESC [ opening bytes, and each contains a SGR (Select Graphic Rendition) code, because they each end in m. The parameters (separated by ; semicolons) in between those tell your terminal what graphic rendition attributes to use. So for each \x1B[....m sequence, the 3 codes that are used are:

  • 0 (or 00 in this example): reset, disable all attributes
  • 1 (or 01 in the example): bold
  • 31: red (foreground)

However, there is more to ANSI than just CSI SGR codes. With CSI alone you can also control the cursor, clear lines or the whole display, or scroll (provided the terminal supports this of course). And beyond CSI, there are codes to select alternative fonts (SS2 and SS3), to send 'private messages' (think passwords), to communicate with the terminal (DCS), the OS (OSC), or the application itself (APC, a way for applications to piggy-back custom control codes on to the communication stream), and further codes to help define strings (SOS, Start of String, ST String Terminator) or to reset everything back to a base state (RIS). The above regexes cover all of these.

Note that the above regex only removes the ANSI C1 codes, however, and not any additional data that those codes may be marking up (such as the strings sent between an OSC opener and the terminating ST code). Removing those would require additional work outside the scope of this answer.

like image 94
Martijn Pieters Avatar answered Sep 22 '22 21:09

Martijn Pieters


The accepted answer to this question only considers color and font effects. There are a lot of sequences that do not end in 'm', such as cursor positioning, erasing, and scroll regions.

The complete regexp for Control Sequences (aka ANSI Escape Sequences) is

/(\x9B|\x1B\[)[0-?]*[ -\/]*[@-~]/ 

Refer to ECMA-48 Section 5.4 and ANSI escape code

like image 43
Jeff Avatar answered Sep 23 '22 21:09

Jeff