Remove non-ASCII characters from a string using python / django

Tags:

I have a string of HTML stored in a database. Unfortunately it contains characters such as ® I want to replace these characters by their HTML equivalent, either in the DB itself or using a Find Replace in my Python / Django code.

Any suggestions on how I can do this?

221

asked Apr 30 '10 07:04

Gaurav Sharma

3 Answers

You can use that the ASCII characters are the first 128 ones, so get the number of each character with ord and strip it if it's out of range

# -*- coding: utf-8 -*-

def strip_non_ascii(string):
    ''' Returns the string without non ASCII characters'''
    stripped = (c for c in string if 0 < ord(c) < 127)
    return ''.join(stripped)


test = u'éáé123456tgreáé@€'
print test
print strip_non_ascii(test)

Result

éáé123456tgreáé@€
123456tgre@

Please note that @ is included because, well, after all it's an ASCII character. If you want to strip a particular subset (like just numbers and uppercase and lowercase letters), you can limit the range looking at a ASCII table

EDITED: After reading your question again, maybe you need to escape your HTML code, so all those characters appears correctly once rendered. You can use the escape filter on your templates.

answered Oct 21 '22 18:10

Khelben

There's a much simpler answer to this at https://stackoverflow.com/a/18430817/5100481

To remove non-ASCII characters from a string, s, use:

s = s.encode('ascii',errors='ignore')

Then convert it from bytes back to a string using:

s = s.decode()

This all using Python 3.6

answered Oct 21 '22 19:10

somedude

I found this a while ago, so this isn't in any way my work. I can't find the source, but here's the snippet from my code.

def unicode_escape(unistr):
    """
    Tidys up unicode entities into HTML friendly entities

    Takes a unicode string as an argument

    Returns a unicode string
    """
    import htmlentitydefs
    escaped = ""

    for char in unistr:
        if ord(char) in htmlentitydefs.codepoint2name:
            name = htmlentitydefs.codepoint2name.get(ord(char))
            entity = htmlentitydefs.name2codepoint.get(name)
            escaped +="&#" + str(entity)

        else:
            escaped += char

    return escaped

Use it like this

>>> from zack.utilities import unicode_escape
>>> unicode_escape(u'such as ® I want')
u'such as &#174 I want'

answered Oct 21 '22 19:10

Zack

Related questions
                            
                                PyMongo create_index only if it does not exist
                            
                                Bulk saving complex objects SQLAlchemy
                            
                                How to GET data in Flask from AJAX post
                            
                                Write to a file with sudo privileges in Python
                            
                                How do I get a regex pattern type for MyPy
                            
                                Open a csv.gz file in Python and print first 100 rows
                            
                                Plotting a time series?
                            
                                Python json.loads changes the order of the object
                            
                                Is pd.get_dummies one-hot encoding?
                            
                                list of columns in common in two pandas dataframes
                            
                                VS Code task and Python virtual environment
                            
                                sklearn DeprecationWarning truth value of an array
                            
                                Causal padding in keras
                            
                                How to add a subtitle to an Altair-generated chart
                            
                                Library/tool for drawing ternary/triangle plots [closed]
                            
                                Django session expiry?
                            
                                Why isn't psycopg2 executing any of my SQL functions? (IndexError: tuple index out of range)
                            
                                Raise unhandled exceptions in a thread in the main thread? [duplicate]
                            
                                Get rid of leading zeros for date strings in Python? [duplicate]
                            
                                Django url tag multiple parameters

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Remove non-ASCII characters from a string using python / django

Tags:

python

regex

replace

unicode

django