Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Removing Non Latin Characters

How can I delete all the non latin characters from a string? More specifically, is there a way to find out Non Latin characters from unicode data?

like image 816
sgp Avatar asked May 15 '14 14:05

sgp


People also ask

How do I remove non ascii characters from a string in Python?

In python, to remove non-ASCII characters in python, we need to use string. encode() with encoding as ASCII and error as ignore, to returns a string without ASCII character use string. decode().

How do I remove a control character in Python?

Explanation : \n, \0, \f, \r, \b, \t being control characters are removed from string. Explanation : \n, \0, \f, \r being control characters are removed from string, giving Gfg as output.


3 Answers

In order to remove the non latin characters from a string, You can use the following regex to remove all the non-ascii characters from the string :

import re
result = re.sub(r'[^\x00-\x7f]',r'', text)
like image 177
mounirboulwafa Avatar answered Oct 12 '22 15:10

mounirboulwafa


Using the third-party regex module, you could remove all non-Latin characters with

import regex
result = regex.sub(ur'[^\p{Latin}]', u'', text)

If you don't want to use the regex module, this page lists Latin unicode blocks:

\p{InBasic_Latin}: U+0000–U+007F
\p{InLatin-1_Supplement}: U+0080–U+00FF
\p{InLatin_Extended-A}: U+0100–U+017F
\p{InLatin_Extended-B}: U+0180–U+024F
\p{InLatin_Extended_Additional}: U+1E00–U+1EFF 

So you could use these to form a character class using Python's builtin re module:

import re
result = re.sub(ur'[^\x00-\x7F\x80-\xFF\u0100-\u017F\u0180-\u024F\u1E00-\u1EFF]', u'', text) 

Demo:

In [24]: import re
In [25]: import regex

In [35]: text = u'aweerwq\u0645\u0631\u062d\u0628\u0627\u043c\u0438\u0440'

In [36]: print(text)
aweerwqمرحباмир

In [37]: regex.sub(ur'[^\p{Latin}]', u'', text)
Out[37]: u'aweerwq'

In [38]: re.sub(ur'[^\x00-\x7F\x80-\xFF\u0100-\u017F\u0180-\u024F\u1E00-\u1EFF]', u'', text)    
Out[38]: u'aweerwq'
like image 44
unutbu Avatar answered Oct 12 '22 15:10

unutbu


I had a similar problem (Python 3). You could try something like this.

text = u'aweerwq\u0645\u0631\u062d\u0628\u0627\u043c\u0438\u0440'
stripped_text = ''
for c in text:
   stripped_text += c if len(c.encode(encoding='utf_8'))==1 else ''
print(stripped_text)
aweerwq
like image 2
SteverT Avatar answered Oct 12 '22 15:10

SteverT