Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detecting Unicode text ligatures in Clojure/Java

Ligatures are the Unicode characters which are represented by more than one code points. For example, in Devanagari त्र is a ligature which consists of code points त + ् + र.

When seen in simple text file editors like Notepad, त्र is shown as त् + र and is stored as three Unicode characters. However when the same file is opened in Firefox, it is shown as a proper ligature.

So my question is, how to detect such ligatures programmatically while reading the file from my code. Since Firefox does it, there must exist a way to do it programmatically. Are there any Unicode properties which contain this information or do I need to have a map to all such ligatures?

SVG CSS property text-rendering when set to optimizeLegibility does the same thing (combine code points into proper ligature).

PS: I am using Java.

EDIT

The purpose of my code is to count the characters in the Unicode text assuming a ligature to be a single character. So I need a way to collapse multiple code points into a single ligature.

like image 927
Abhinav Sarkar Avatar asked Aug 12 '10 10:08

Abhinav Sarkar


1 Answers

The Computer Typesetting wikipedia page says -

The Computer Modern Roman typeface provided with TeX includes the five common ligatures ff, fi, fl, ffi, and ffl. When TeX finds these combinations in a text it substitutes the appropriate ligature, unless overridden by the typesetter.

This indicates that it's the editor that does substitution. Moreover,

Unicode maintains that ligaturing is a presentation issue rather than a character definition issue, and that, for example, "if a modern font is asked to display 'h' followed by 'r', and the font has an 'hr' ligature in it, it can display the ligature."

As far as I see (I got some interest in this topic and just now reading few articles), the instructions for ligature substitute is embeded inside font. Now, I dug into more and found these for you; GSUB - The Glyph Substitution Table and Ligature Substitution Subtable from the OpenType file format specification.

Next, you need to find some library which can allow you to peak inside OpenType font files, i.e. file parser for quick access. Reading the following two discussions may give you some directions in how to do these substitutions:

  1. Chromium bug http://code.google.com/p/chromium/issues/detail?id=22240
  2. Firefox bug https://bugs.launchpad.net/firefox/+bug/37828
like image 73
ankitjaininfo Avatar answered Sep 22 '22 18:09

ankitjaininfo