Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to remove '\xe2' from a list

Tags:

python

regex

I am new to python and am using it to use nltk in my project.After word-tokenizing the raw data obtained from a webpage I got a list containing '\xe2' ,'\xe3','\x98' etc.However I do not need these and want to delete them.

I simply tried

if '\x' in a

and

if a.startswith('\xe')

and it gives me an error saying invalid \x escape

But when I try a regular expression

re.search('^\\x',a)

i get

Traceback (most recent call last):
File "<pyshell#83>", line 1, in <module>
print re.search('^\\x',a)
File "C:\Python26\lib\re.py", line 142, in search
return _compile(pattern, flags).search(string)
File "C:\Python26\lib\re.py", line 245, in _compile
raise error, v # invalid expression
error: bogus escape: '\\x'

even re.search('^\\x',a) is not identifying it.

I am confused by this,even googling didnt help(I might be missing something).Please suggest any simple way to remove such strings from the list and what was wrong with the above.

Thanks in advance!

like image 986
silentNinJa Avatar asked Jul 25 '10 11:07

silentNinJa


People also ask

How do I get rid of x0c in Python?

The solution is just to double the backslash, which makes a pattern that matches a single backslash.

What is '\ x0c?

\x0c is a form feed; it forces a printer to move to the next sheet of paper. You can also express it as \f in Python: >>> '\f' '\x0c' In terminals the effects of \v and \f are often the same.

How do I print non-ascii characters in Python?

Use repr(obj) instead of str(obj) . repr() will convert the result to ASCII, properly escaping everything that isn't in the ASCII code range. The encoding of the source file has nothing to do with what str() supports. str() only supports unicode characters in py3k, so either use repr() or unicode() everywhere.


1 Answers

You can use unicode(a, 'ascii', 'ignore') to remove all non-ascii characters in the string at once.

like image 78
cypheon Avatar answered Oct 02 '22 06:10

cypheon