Replace non-ASCII characters with a single space

People also ask

Is space a non ASCII character?

ASCII, pronounced ask-ee, stands for the American Standard Code for Information Interchange. ASCII was originally based on the English alphabet and consists of 128 characters including A-Z, 0-9, punctuation, spaces, and other control codes that can be found on a standard English keyboard.

Is there an ASCII code for space?

The ASCII code for a blank space is the decimal number 32, or the binary number 0010 00002.

How do I ignore non ASCII characters in Python?

In python, to remove non-ASCII characters in python, we need to use string. encode() with encoding as ASCII and error as ignore, to returns a string without ASCII character use string. decode().

Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

return ''.join([i if ord(i) < 128 else ' ' for i in text])

This handles characters one by one and would still use one space per character replaced.

Your regular expression should just replace consecutive non-ASCII characters with a space:

re.sub(r'[^\x00-\x7F]+',' ', text)

Note the + there.

For you the get the most alike representation of your original string I recommend the unidecode module:

# python 2.x:
from unidecode import unidecode
def remove_non_ascii(text):
    return unidecode(unicode(text, encoding = "utf-8"))

Then you can use it in a string:

remove_non_ascii("Ceñía")
Cenia

For character processing, use Unicode strings:

PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.
>>> s='ABC马克def'
>>> import re
>>> re.sub(r'[^\x00-\x7f]',r' ',s)   # Each char is a Unicode codepoint.
'ABC  def'
>>> b = s.encode('utf8')
>>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence.
b'ABC      def'

But note you will still have a problem if your string contains decomposed Unicode characters (separate character and combining accent marks, for example):

>>> s = 'mañana'
>>> len(s)
6
>>> import unicodedata as ud
>>> n=ud.normalize('NFD',s)
>>> n
'mañana'
>>> len(n)
7
>>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint
'ma ana'
>>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced
'man ana'

If the replacement character can be '?' instead of a space, then I'd suggest result = text.encode('ascii', 'replace').decode():

"""Test the performance of different non-ASCII replacement methods."""


import re
from timeit import timeit


# 10_000 is typical in the project that I'm working on and most of the text
# is going to be non-ASCII.
text = 'Æ' * 10_000


print(timeit(
    """
result = ''.join([c if ord(c) < 128 else '?' for c in text])
    """,
    number=1000,
    globals=globals(),
))

print(timeit(
    """
result = text.encode('ascii', 'replace').decode()
    """,
    number=1000,
    globals=globals(),
))

Results:

0.7208260721400134
0.009975979187503592

What about this one?

def replace_trash(unicode_string):
     for i in range(0, len(unicode_string)):
         try:
             unicode_string[i].encode("ascii")
         except:
              #means it's non-ASCII
              unicode_string=unicode_string[i].replace(" ") #replacing it with a single space
     return unicode_string

As a native and efficient approach, you don't need to use ord or any loop over the characters. Just encode with ascii and ignore the errors.

The following will just remove the non-ascii characters:

new_string = old_string.encode('ascii',errors='ignore')

Now if you want to replace the deleted characters just do the following:

final_string = new_string + b' ' * (len(old_string) - len(new_string))

Related questions
                            
                                How do I loop through a list by twos? [duplicate]
                            
                                Read file from line 2 or skip header row
                            
                                How do I fix 'ImportError: cannot import name IncompleteRead'?
                            
                                Virtualenv Command Not Found
                            
                                How to fix Python indentation
                            
                                How to append multiple values to a list in Python
                            
                                How do you extract a column from a multi-dimensional array?
                            
                                List comprehension on a nested list?
                            
                                Defining private module functions in python
                            
                                How can I find where Python is installed on Windows?
                            
                                Why are some float < integer comparisons four times slower than others?
                            
                                Escaping regex string
                            
                                What is the purpose of "pip install --user ..."?
                            
                                How can I read large text files in Python, line by line, without loading it into memory?
                            
                                how do I insert a column at a specific column index in pandas?
                            
                                Is Python strongly typed?
                            
                                Disable Tensorflow debugging information
                            
                                Using Pandas to pd.read_excel() for multiple worksheets of the same workbook
                            
                                Get human readable version of file size?
                            
                                Why do python lists have pop() but not push()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Replace non-ASCII characters with a single space

Tags:

python

encoding

unicode

ascii

People also ask

Recent Activity

Donate For Us