Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which form of unicode normalization is appropriate for text mining?

I've been reading a lot on the subject of Unicode, but I remain very confused about normalization and its different forms. In short, I am working on a project that involves extracting text from PDF files and performing some semantic text analysis.

I've managed to satisfactorily extract the text using a simple python script, but now I need to make sure that all equivalent orthographic strings have one (and only one) representation. For example, the 'fi' typographic ligature should be decomposed into 'f' and 'i'.

I see that python's unicodedata.normalize function offers several algorithms for normalizing unicode code points. Could someone please explain the difference between:

  • NFC
  • NFKC
  • NFD
  • NFKD

I read the relevant wikipedia article, but it was far too opaque for my feeble brain to understand. Could someone kindly explain this to me in plain English?

Also, could you please make a recommendation for the normalization method best adapted to a natural language processing project?

like image 322
Louis Thibault Avatar asked Jun 27 '12 19:06

Louis Thibault


People also ask

What is normalization in text mining?

Normalization is the process of converting a token into its base form. In the normalization process, the inflectional form of a word is removed so that the base form can be obtained. So in our above example, the normal form of antinationalist is national.

What does Unicode normalize do?

Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms. A binary comparison of the transformed strings will then determine equivalence.

What is text Normalisation What is the first step of text Normalisation?

Text normalization is the process of transforming text into a single canonical form that it might not have had before. Normalizing text before storing or processing it allows for separation of concerns, since input is guaranteed to be consistent before operations are performed on it.

What is the need of text normalization in NLP?

Why do we need text normalization? When we normalize text, we attempt to reduce its randomness, bringing it closer to a predefined “standard”. This helps us to reduce the amount of different information that the computer has to deal with, and therefore improves efficiency.


1 Answers

Characters like é can be written either as a single character or as a sequence of two, a regular e plus the accent (a diacritic). Normalization chooses consistently among such alternatives, and will order multiple diacritics in a consistent way.

Since you need to deal with ligatures, you should use "compatibility (de)composition", NFKD or NFKC, which normalizes ligatures. It's probably ok to either use composed or decomposed forms, but if you also want to do lossy matching (e.g., match é even if the user types plain e), you could use the compatibility decomposition NFKD and discard the diacritics for loose matching.

like image 198
alexis Avatar answered Sep 19 '22 16:09

alexis