Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove escape sequence like '\xe2' or '\x0c' in python

I am working on a project (content based search), for that I am using 'pdftotext' command line utility in Ubuntu which writes all the text from pdf to some text file. But it also writes bullets, now when I'm reading the file to index each word, it also gets some escape sequence indexed(like '\x01').I know its because of bullets(•).

I want only text, so is there any way to remove this escape sequence.I have done something like this

escape_char = re.compile('\+x[0123456789abcdef]*')
re.sub(escape_char, " ", string)

But this do not remove escape sequence

Thanks in advance.

like image 554
vaibhav1312 Avatar asked Feb 18 '13 22:02

vaibhav1312


1 Answers

Your only real problem is that backslashes are tricky. In a string, a backslash might be treated specially; for example \t would turn into a tab. Since \+ isn't special in strings, the string was actually what you expected. So then the regular expression compiler looked at it, and \+ in a regular expression would just be a plain + character. Normally the + has a special meaning ("1 or more instances of the preceding pattern") and the backslash escapes it.

The solution is just to double the backslash, which makes a pattern that matches a single backslash.

I put the pattern into r'', to make it a "raw string" where Python leaves backslashes alone. If you don't do that, Python's string parser will turn the two backslashes into a single backslash; just as \t turns into a tab, \\ turns into a single backslash. So, use a raw string and put exactly what you want the regular expression compiler to see.

Also, a better pattern would be: backslash, then an x, then 1 or more instances of the character class matching a hex character. I rewrote the pattern to this.

import re

s = r'+\x01+'
escape_char = re.compile(r'\\x[0123456789abcdef]+')
s = re.sub(escape_char, " ", s)

Instead of using a raw string, you could use a normal string and just be very careful with backslashes. In this case we would have to put four backslashes! The string parser would turn each doubled backslash into a single backslash, and we want the regular expression compiler to see two backslashes. It's easier to just use the raw string!

Also, your original pattern would remove zero or more hex digits. My pattern removes one or more. But I think it is likely that there will always be exactly two hex digits, or perhaps with Unicode maybe there will be four. You should figure out how many there can be and put a pattern that ensures this. Here's a pattern that matches 2, 3, or 4 hex digits:

escape_char = re.compile(r'\\x[0123456789abcdef]{2,4}')

And here is one that matches exactly two or exactly four. We have to use a vertical bar to make two alternatives, and we need to make a group with parentheses. I'm using a non-matching group here, with (?:pattern) instead of just (pattern) (where pattern means a pattern, not literally the word pattern).

escape_char = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')

Here is example code. The bullet sequence is immediately followed by a 1 character, and this pattern leaves it alone.

import re

s = r'+\x011+'
pat = re.compile(r'\\x(?:[0123456789abcdef]{2,2}|[0123456789abcdef]{4,4})')
s = pat.sub("@", s)
print("Result: '%s'" % s)

This prints: Result: '+@1+'

NOTE: all of this is assuming that you actually are trying to match a backslash character followed by hex chars. If you are actually trying to match character byte values that might or might not be "printable" chars, then use the answer by @nneonneo instead of this one.

like image 52
steveha Avatar answered Oct 06 '22 00:10

steveha