<p>I need to compare two strings. <code>aa</code> is extracted from a PDF file (using pdfminer/chardet) and <code>bb</code> is a keyboard input. How can I normalize first string to make a comparison?</p> <pre class="prettyprint"><code>>>> aa = "ā" >>> bb = "ā" >>> aa == bb False >>> >>> aa.encode('utf-8') b'\xc4\x81' >>> bb.encode('utf-8') b'a\xcc\x84' </code></pre>

<p>You normalize with unicodedata.normalize:</p> <pre class="prettyprint"><code>>>> aa = b'\xc4\x81'.decode('utf8') # composed form >>> bb = b'a\xcc\x84'.decode('utf8') # decomposed form >>> aa 'ā' >>> bb 'ā' >>> aa == bb False >>> import unicodedata as ud >>> aa == ud.normalize('NFC',bb) # compare composed True >>> ud.normalize('NFD',aa) == bb # compare decomposed True </code></pre>

How to "normalize" python 3 unicode string

Tags:

python-3.x

python-unicode

utf-8

I need to compare two strings. aa is extracted from a PDF file (using pdfminer/chardet) and bb is a keyboard input. How can I normalize first string to make a comparison?

>>> aa = "ā"
>>> bb = "ā"
>>> aa == bb
False
>>> 
>>> aa.encode('utf-8')
b'\xc4\x81'
>>> bb.encode('utf-8')
b'a\xcc\x84'

530

asked Nov 03 '17 10:11

rudensm

1 Answers

You normalize with unicodedata.normalize:

>>> aa = b'\xc4\x81'.decode('utf8')   # composed form
>>> bb = b'a\xcc\x84'.decode('utf8')  # decomposed form
>>> aa
'ā'
>>> bb
'ā'
>>> aa == bb
False
>>> import unicodedata as ud
>>> aa == ud.normalize('NFC',bb)  # compare composed
True
>>> ud.normalize('NFD',aa) == bb  # compare decomposed
True

answered Sep 21 '22 16:09

Mark Tolonen

Related questions
                            
                                Optional[Type[Foo]] raises TypeError in Python 3.5.2
                            
                                Groupby.transform doesn't work in dask dataframe
                            
                                How to change the directory where cx_Freeze creates the "build" and "dist" folders?
                            
                                pytest: run test from code, not from command line
                            
                                ImportError: No module named 'pandas' Using Ubuntu
                            
                                The real solution for multiple inheritance with different init parameters
                            
                                RSA decryption of AES Session key fails with 'AttributeError: 'bytes' object has no attribute 'n'
                            
                                Select the maximum/minimum from n previous rows in a DataFrame
                            
                                Name binding in `except` clause deleted after the clause [duplicate]
                            
                                Why is it called operator overloading and not overriding in Python?
                            
                                Accessing `.days` for a pandas Series of timedeltas
                            
                                Python Selenium : How to hide geckodriver?
                            
                                Can't display multiple .md files in .rst toctree Sphinx
                            
                                Python program outputting different results, even though no random is used
                            
                                Extracting Prices with Regex
                            
                                Keras model output information/log level
                            
                                Moviepy - Output video not playable
                            
                                Python PCA plot using Hotelling's T2 for a confidence interval
                            
                                Getting Labels on top of Bar in Polar/Radial Bar Chart in Matplotlib, Python3
                            
                                Sqlalchemy get row in timeslot