How can I remove non-ASCII characters but leave periods and spaces?

People also ask

How do you remove non-ASCII characters?

Use . replace() method to replace the Non-ASCII characters with the empty string.

How do I ignore non-ASCII characters in Python?

In python, to remove non-ASCII characters in python, we need to use string. encode() with encoding as ASCII and error as ignore, to returns a string without ASCII character use string. decode().

How do I remove a non-ASCII character from a DataFrame?

By using encode and decode function we can easily remove non-ASCII characters from Pandas DataFrame. In Python, the encode() function is used to encode the string using a given encoding, and decoding means converting a string of bytes to a Unicode string.

Is space a non-ASCII character?

ASCII, pronounced ask-ee, stands for the American Standard Code for Information Interchange. ASCII was originally based on the English alphabet and consists of 128 characters including A-Z, 0-9, punctuation, spaces, and other control codes that can be found on a standard English keyboard.

You can filter all characters from the string that are not printable using string.printable, like this:

>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'

string.printable on my machine contains:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c

EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:

''.join(filter(lambda x: x in printable, s))

An easy way to change to a different codec, is by using encode() or decode(). In your case, you want to convert to ASCII and ignore all symbols that are not supported. For example, the Swedish letter å is not an ASCII character:

    >>>s = u'Good bye in Swedish is Hej d\xe5'
    >>>s = s.encode('ascii',errors='ignore')
    >>>print s
    Good bye in Swedish is Hej d

Edit:

Python3: str -> bytes -> str

>>>"Hej då".encode("ascii", errors="ignore").decode()
'hej d'

Python2: unicode -> str -> unicode

>>> u"hej då".encode("ascii", errors="ignore").decode()
u'hej d'

Python2: str -> unicode -> str (decode and encode in reverse order)

>>> "hej d\xe5".decode("ascii", errors="ignore").encode()
'hej d'

According to @artfulrobot, this should be faster than filter and lambda:

import re
re.sub(r'[^\x00-\x7f]',r'', your-non-ascii-string)

See more examples here Replace non-ASCII characters with a single space

You may use the following code to remove non-English letters:

import re
str = "123456790 ABC#%? .(朱惠英)"
result = re.sub(r'[^\x00-\x7f]',r'', str)
print(result)

This will return

123456790 ABC#%? .()

Your question is ambiguous; the first two sentences taken together imply that you believe that space and "period" are non-ASCII characters. This is incorrect. All chars such that ord(char) <= 127 are ASCII characters. For example, your function excludes these characters !"#$%&\'()*+,-./ but includes several others e.g. []{}.

Please step back, think a bit, and edit your question to tell us what you are trying to do, without mentioning the word ASCII, and why you think that chars such that ord(char) >= 128 are ignorable. Also: which version of Python? What is the encoding of your input data?

Please note that your code reads the whole input file as a single string, and your comment ("great solution") to another answer implies that you don't care about newlines in your data. If your file contains two lines like this:

this is line 1
this is line 2

the result would be 'this is line 1this is line 2' ... is that what you really want?

A greater solution would include:

a better name for the filter function than onlyascii

recognition that a filter function merely needs to return a truthy value if the argument is to be retained:

def filter_func(char):
    return char == '\n' or 32 <= ord(char) <= 126
# and later:
filtered_data = filter(filter_func, data).lower()

Working my way through Fluent Python (Ramalho) - highly recommended. List comprehension one-ish-liners inspired by Chapter 2:

onlyascii = ''.join([s for s in data if ord(s) < 127])
onlymatch = ''.join([s for s in data if s in
              'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'])

Related questions
                            
                                Single Line Nested For Loops
                            
                                One-liner to check whether an iterator yields at least one element?
                            
                                How do you set your pythonpath in an already-created virtualenv?
                            
                                Pandas groupby cumulative sum
                            
                                Tensorflow Strides Argument
                            
                                Where is pip cache folder?
                            
                                How to pass another entire column as argument to pandas fillna()
                            
                                ImportError: No module named six
                            
                                Does pandas iterrows have performance issues?
                            
                                Python - How to sort a list of lists by the fourth element in each list? [duplicate]
                            
                                Add a prefix to all Flask routes
                            
                                Extract elements of list at odd positions
                            
                                How to pass a user defined argument in scrapy spider
                            
                                How to copy a file to a remote server in Python using SCP or SSH?
                            
                                Is there an easy way to request a URL in python and NOT follow redirects?
                            
                                TemplateDoesNotExist - Django Error
                            
                                How to load jinja template directly from filesystem
                            
                                Pandas Merge - How to avoid duplicating columns
                            
                                Test if numpy array contains only zeros
                            
                                Checking if all elements in a list are unique

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I remove non-ASCII characters but leave periods and spaces?

Tags:

python

text

unicode

ascii

filter

People also ask

Recent Activity

Donate For Us