Which form of unicode normalization is appropriate for text mining?

Tags:

I've been reading a lot on the subject of Unicode, but I remain very confused about normalization and its different forms. In short, I am working on a project that involves extracting text from PDF files and performing some semantic text analysis.

I've managed to satisfactorily extract the text using a simple python script, but now I need to make sure that all equivalent orthographic strings have one (and only one) representation. For example, the 'fi' typographic ligature should be decomposed into 'f' and 'i'.

I see that python's unicodedata.normalize function offers several algorithms for normalizing unicode code points. Could someone please explain the difference between:

NFC
NFKC
NFD
NFKD

I read the relevant wikipedia article, but it was far too opaque for my feeble brain to understand. Could someone kindly explain this to me in plain English?

Also, could you please make a recommendation for the normalization method best adapted to a natural language processing project?

322

asked Jun 27 '12 19:06

Louis Thibault

1 Answers

Characters like é can be written either as a single character or as a sequence of two, a regular e plus the accent (a diacritic). Normalization chooses consistently among such alternatives, and will order multiple diacritics in a consistent way.

Since you need to deal with ligatures, you should use "compatibility (de)composition", NFKD or NFKC, which normalizes ligatures. It's probably ok to either use composed or decomposed forms, but if you also want to do lossy matching (e.g., match é even if the user types plain e), you could use the compatibility decomposition NFKD and discard the diacritics for loose matching.

198

answered Sep 19 '22 16:09

alexis

Related questions
                            
                                Beautifulsoup, maximum recursion depth reached
                            
                                Two-dimensional vs. One-dimensional dictionary efficiency in Python
                            
                                How can I prefetch_related across a reverse one-to-one relationship where the one-to-one relationship may be different?
                            
                                Why sys.getsizeof(numpy.int8(1)) returns 12?
                            
                                Re evaluate django query after changes done to database
                            
                                Mac OSX: Switch to Python 2.7.3
                            
                                How can I make Django-Tastypie override a resource if it already exists?
                            
                                Pass a JSON object to an url with requests
                            
                                Reading data blocks from a file in Python
                            
                                Will Distribute be outdated when new packaging comes with Python 3.3?
                            
                                tkinter default button in a widget
                            
                                Moving from multiprocessing to threading
                            
                                No luck pip-installing pylint for Python 3
                            
                                Best way to modify and generalize spaced repetition software
                            
                                Protocol buffers python - unicode decode error
                            
                                Compound assignment operators in Python's Numpy library
                            
                                Is json.dumps guaranteed not to lose floating point precision?
                            
                                Placing child window relative to parent in Tkinter python
                            
                                strange Python function scope behavior
                            
                                give openid users additional information

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Which form of unicode normalization is appropriate for text mining?

Tags:

python

unicode

normalization

unicode-normalization

text-normalization

Louis Thibault

People also ask

1 Answers

alexis

Recent Activity

Donate For Us