Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate character images with a font whose name cannot be correctly decoded

I am creating images of Chinese seal script. I have three true type fonts for this task (Jin_Wen_Da_Zhuan_Ti.7z, Zhong_Guo_Long_Jin_Shi_Zhuan.7z, Zhong_Yan_Yuan_Jin_Wen.7z, for testing purpose only). Below are the appearances in Microsoft Word

appearance in Word

of the Chinese character "我" (I/me). Here is my Python script:

import numpy as np
from PIL import Image, ImageFont, ImageDraw, ImageChops
import itertools
import os


def grey2binary(grey, white_value=1):
    grey[np.where(grey <= 127)] = 0
    grey[np.where(grey > 127)] = white_value
    return grey


def create_testing_images(characters,
                          font_path,
                          save_to_folder,
                          sub_folder=None,
                          image_size=64):
    font_size = image_size * 2
    if sub_folder is None:
        sub_folder = os.path.split(font_path)[-1]
        sub_folder = os.path.splitext(sub_folder)[0]
    sub_folder_full = os.path.join(save_to_folder, sub_folder)
    if not os.path.exists(sub_folder_full):
        os.mkdir(sub_folder_full)
    font = ImageFont.truetype(font_path,font_size)
    bg = Image.new('L',(font_size,font_size),'white')

    for char in characters:
        img = Image.new('L',(font_size,font_size),'white')
        draw = ImageDraw.Draw(img)
        draw.text((0,0), text=char, font=font)
        diff = ImageChops.difference(img, bg)
        bbox = diff.getbbox()
        if bbox:
            img = img.crop(bbox)
            img = img.resize((image_size, image_size), resample=Image.BILINEAR)

            img_array = np.array(img)
            img_array = grey2binary(img_array, white_value=255)

            edge_top = img_array[0, range(image_size)]
            edge_left = img_array[range(image_size), 0]
            edge_bottom = img_array[image_size - 1, range(image_size)]
            edge_right = img_array[range(image_size), image_size - 1]

            criterion = sum(itertools.chain(edge_top, edge_left, 
                                           edge_bottom, edge_right))

            if criteria > 255 * image_size * 2:
                img = Image.fromarray(np.uint8(img_array))
                img.save(os.path.join(sub_folder_full, char) + '.gif')

where the core snippet is

        font = ImageFont.truetype(font_path,font_size)
        img = Image.new('L',(font_size,font_size),'white')
        draw = ImageDraw.Draw(img)
        draw.text((0,0), text=char, font=font)

For example, if you put those fonts in the folder ./fonts, and call it with

create_testing_images(['我'], 'fonts/金文大篆体.ttf', save_to_folder='test')

the script will create ./test/金文大篆体/我.gif in your file system.

Now the problem is, though it works well with the first font 金文大篆体.ttf (in Jin_Wen_Da_Zhuan_Ti.7z), the script does not work on the other two fonts, even if they can be rendered correctly in Microsoft Word: for 中國龍金石篆.ttf (in Zhong_Guo_Long_Jin_Shi_Zhuan.7z), it draws nothing so bbox will be None; for 中研院金文.ttf (in Zhong_Yan_Yuan_Jin_Wen.7z), it will draw a black frame with no character in the picture.

enter image description here

and thus fails to pass the test of criterion, whose purpose is for testing an all-black output. I used FontForge to check the properties of the fonts, and found that the first font 金文大篆体.ttf (in Jin_Wen_Da_Zhuan_Ti.7z) uses UnicodeBmp

UnicodeBmp

while the other two use Big5hkscs

Big5hkscs_中國龍金石篆中研院金文

which is not the encoding scheme of my system. This may be the reason that the font names are unrecognizable in my system:

font viewer

Actually I also try to solve this by trying to get the font with the messy font name. I tried pycairo after installing those fonts:

import cairo

# adapted from
# http://heuristically.wordpress.com/2011/01/31/pycairo-hello-world/

# setup a place to draw
surface = cairo.ImageSurface(cairo.FORMAT_ARGB32, 100, 100)
ctx = cairo.Context (surface)

# paint background
ctx.set_source_rgb(1, 1, 1)
ctx.rectangle(0, 0, 100, 100)
ctx.fill()

# draw text
ctx.select_font_face('金文大篆体')
ctx.set_font_size(80)
ctx.move_to(12,80)
ctx.set_source_rgb(0, 0, 0)
ctx.show_text('我')

# finish up
ctx.stroke() # commit to surface
surface.write_to_png('我.gif')

This works well again with 金文大篆体.ttf (in Jin_Wen_Da_Zhuan_Ti.7z):

enter image description here

but still not with others. For example: neither ctx.select_font_face('中國龍金石篆') (which reports _cairo_win32_scaled_font_ucs4_to_index:GetGlyphIndicesW) nor ctx.select_font_face('¤¤°êÀsª÷¥Û½f') (which draws with the default font) works. (The latter name is the messy code displayed in the font viewer as shown above, obtained by a line of Mathematica code ToCharacterCode["中國龍金石篆", "CP950"] // FromCharacterCode where CP950 is the code page of Big5.)

So I think I've tried my best to tackle this issue, but still cannot solve it. I've also come up with other ways like renaming the font name with FontForge or changing the system encoding to Big5, but I would still prefer a solution that involves Python only and thus needs less additional actions from the user. Any hints will be greatly appreciated. Thank you.

To the moderators of stackoverflow: this problem may seem "too localized" at its first glance, but it could happen in other languages / other encodings / other fonts, and the solution can be generalized to other cases, so please don't close it with this reason. Thank you.

UPDATE: Weirdly Mathematica can recognize the font name in CP936 (GBK, which can be thought of as my system encoding). Take 中國龍金石篆.ttf (in Zhong_Guo_Long_Jin_Shi_Zhuan.7z) for an example:

Mathematica

But setting ctx.select_font_face('ÖÐøý½ðʯ*­') does not work either, which will create the character image with the default font.

like image 656
ziyuang Avatar asked Jun 02 '13 19:06

ziyuang


2 Answers

Silvia's comment on the OP...

You might want to consider specifying the encoding parameter like ImageFont.truetype(font_path,font_size,encoding="big5")

...gets you halfway there, but it looks like you also have to manually translate the Unicode characters if you're not using a Unicode font.

For the fonts which use "big5hkscs" encoding, I had to do this...

>>> u = u'\u6211'      # Unicode for 我
>>> u.encode('big5hkscs')
'\xa7\xda'

...then use u'\ua7da' to get the right glyph, which is a bit weird, but it looks to be the only way to pass a multi-byte character to PIL.

The following code works for me on both Python 2.7.4 and Python 3.3.1, with PIL 1.1.7...

from PIL import Image, ImageDraw, ImageFont


# Declare font files and encodings
FONT1 = ('Jin_Wen_Da_Zhuan_Ti.ttf',          'unicode')
FONT2 = ('Zhong_Guo_Long_Jin_Shi_Zhuan.ttf', 'big5hkscs')
FONT3 = ('Zhong_Yan_Yuan_Jin_Wen.ttf',       'big5hkscs')


# Declare a mapping from encodings used by str.encode() to encodings used by
# the FreeType library
ENCODING_MAP = {'unicode':   'unic',
                'big5':      'big5',
                'big5hkscs': 'big5',
                'shift-jis': 'sjis'}


# The glyphs we want to draw
GLYPHS = ((FONT1, u'\u6211'),
          (FONT2, u'\u6211'),
          (FONT3, u'\u6211'),
          (FONT3, u'\u66ce'),
          (FONT2, u'\u4e36'))


# Returns PIL Image object
def draw_glyph(font_file, font_encoding, unicode_char, glyph_size=128):

    # Translate unicode string if necessary
    if font_encoding != 'unicode':
        mb_string = unicode_char.encode(font_encoding)
        try:
            # Try using Python 2.x's unichr
            unicode_char = unichr(ord(mb_string[0]) << 8 | ord(mb_string[1]))
        except NameError:
            # Use Python 3.x-compatible code
            unicode_char = chr(mb_string[0] << 8 | mb_string[1])

    # Load font using mapped encoding
    font = ImageFont.truetype(font_file, glyph_size, encoding=ENCODING_MAP[font_encoding])

    # Now draw the glyph
    img = Image.new('L', (glyph_size, glyph_size), 'white')
    draw = ImageDraw.Draw(img)
    draw.text((0, 0), text=unicode_char, font=font)
    return img


# Save an image for each glyph we want to draw
for (font_file, font_encoding), unicode_char in GLYPHS:
    img = draw_glyph(font_file, font_encoding, unicode_char)
    filename = '%s-%s.png' % (font_file, hex(ord(unicode_char)))
    img.save(filename)

Note that I renamed the font files to the same names as the 7zip files. I try to avoid using non-ASCII characters in code examples, since they sometimes get screwed up when copy/pasting.

This example should work fine for the types declared in ENCODING_MAP, which can be extended if needed (see the FreeType encoding strings for valid FreeType encodings), but you'll need to change some of the code in cases where the Python str.encode() doesn't produce a multi-byte string of length 2.


Update

If the problem is in the ttf file, how could you find the answer in the PIL and FreeType source code? Above, you seem to be saying PIL is to blame, but why should one have to pass unicode_char.encode(...).decode(...) when you just want unicode_char?

As I understand it, the TrueType font format was developed before Unicode became widely adopted, so if you wanted to create a Chinese font back then, you'd have to have used one of the encodings which was in use at the time, and China had mostly been using Big5 since the mid 1980s.

It stands to reason, then, that there had to be a way to retrieve glyphs from a Big5-encoded TTF using the Big5 character encodings.

The C code for rendering a string with PIL starts with the font_render() function, and ultimately calls FT_Get_Char_Index() to locate the correct glyph, given the character code as an unsigned long.

However, PIL's font_getchar() function, which produces that unsigned long only accepts Python string and unicode types, and since it doesn't seem to do any translation of the character encodings itself, it seemed that the only way to get the correct value for the Big5 character set was to coerce a Python unicode character into the correct unsigned long value by exploiting the fact that u'\ua7da' was stored internally as the integer 0xa7da, either in 16 bits or 32 bits, depending on how you compiled Python.

TBH, there was a fair amount of guesswork involved, since I didn't bother to investigate what exactly the effect of ImageFont.truetype()'s encoding parameter is, but by the looks of it, it's not supposed to do any translation of character encodings, but rather to allow a single TTF file to support multiple character encodings of the same glyphs, using the FT_Select_Charmap() function to switch between them.

So, as I understand it, the FreeType library's interaction with the TTF files works something like this...

#!/usr/bin/env python
# -*- coding: utf-8 -*-

class TTF(object):
    glyphs = {}
    encoding_maps = {}

    def __init__(self, encoding='unic'):
        self.set_encoding(encoding)

    def set_encoding(self, encoding):
        self.current_encoding = encoding

    def get_glyph(self, charcode):
        try:
            return self.glyphs[self.encoding_maps[self.current_encoding][charcode]]
        except KeyError:
            return ' '


class MyTTF(TTF):
    glyphs = {1: '我',
              2: '曎'}
    encoding_maps = {'unic': {0x6211: 1, 0x66ce: 2},
                     'big5': {0xa7da: 1, 0x93be: 2}}


font = MyTTF()
print 'Get via Unicode map: %s' % font.get_glyph(0x6211)
font.set_encoding('big5')
print 'Get via Big5 map: %s' % font.get_glyph(0xa7da)

...but it's up to each TTF to provide the encoding_maps variable, and there's no requirement for a TTF to provide one for Unicode. Indeed, it's unlikely that a font created prior to the adoption of Unicode would have.

Assuming all that is correct, then there's nothing wrong with the TTF - the problem is just with PIL making it a little awkward to access glyphs for fonts which don't have a Unicode mapping, and for which the required glyph's unsigned long character code is greater than 255.

like image 91
Aya Avatar answered Oct 18 '22 20:10

Aya


The problem is the fonts not strictly conforming to the TrueType specification. A quick solution is to use FontForge (you are using it already), and let it sanitize the fonts.

  1. Open a font file
  2. Go to Encoding, then select Reencode
  3. Choose ISO 10646-1 (Unicode BMP)
  4. Go to File then Generate Fonts
  5. Save as TTF
  6. Run your script with the newly generated fonts
  7. Voila! It prints 我 in beautiful font!
like image 41
Kenji Noguchi Avatar answered Oct 18 '22 21:10

Kenji Noguchi