Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: extract Cyrillic string from EXIF

I am a complete beginner in Python, and would like to start learning it by doing. Namely, I'd love to correct some EXIF information in a huge bunch of family photos I have. To start with, I want to just get this information out of JPEG files properly.

Some of them have a title written in EXIF. It can be obtained e.g. by

import pyexiv2
metadata = pyexiv2.ImageMetadata(filename)
metadata.read()
title = metadata['Exif.Image.XPTitle'] 

This far I've got. Now comes the problem. Some of the titles contain Cyrillic letters. If I do print title.human_value I get for example

`Милой Мамуле от Майи, 11 ÑÐ½Ð²Ð°Ñ€Ñ 1944.`

while with print title, it is

<Exif.Image.XPTitle [Byte] = 28 4 56 4 59 4 62 4 57 4 32 0 28 4 48 4 60 4 67 4 59 4 53 4 32 0 62 4 66 4 32 0 28 4 48 4 57 4 56 4 44 0 32 0 49 0 49 0 32 0 79 4 61 4 50 4 48 4 64 4 79 4 32 0 49 0 57 0 52 0 52 0 46 0 0 0>

The actual string I'd love to see is

Милой Мамуле от Майи, 11 января 1944.

It seems to be a unicode problem, but after trying already a dozen of different methods found here and elsewhere, I just cannot cope with it. Is it possible to see Russian letters in the console at all? I am using python(xy) on Windows 7 (English), so my IDE is spyder2. Just the default installation, to which I added pyexiv2. TIA!

like image 826
texnic Avatar asked Jul 19 '12 18:07

texnic


2 Answers

The bytes are UTF-16.

In Python 3:

>>> b = [28, 4, 56, 4, 59, 4, 62, 4, 57, 4, 32, 0, 28, 4, 48, 4, 60, 4, 67, 4, 59, 4, 53, 4, 32, 0, 62, 4, 66, 4, 32, 0, 28, 4, 48, 4, 57, 4, 56, 4, 44, 0, 32, 0, 49, 0, 49, 0, 32, 0, 79, 4, 61, 4, 50, 4, 48, 4, 64, 4, 79, 4, 32, 0, 49, 0, 57, 0, 52, 0, 52, 0, 46, 0, 0, 0]
>>> bytes(b).decode("utf-16")
'Милой Мамуле от Майи, 11 января 1944.\x00'

In Python 2:

>>> b = [28, 4, 56, 4, 59, 4, 62, 4, 57, 4, 32, 0, 28, 4, 48, 4, 60, 4, 67, 4, 59, 4, 53, 4, 32, 0, 62, 4, 66, 4, 32, 0, 28, 4, 48, 4, 57, 4, 56, 4, 44, 0, 32, 0, 49, 0, 49, 0, 32, 0, 79, 4, 61, 4, 50, 4, 48, 4, 64, 4, 79, 4, 32, 0, 49, 0, 57, 0, 52, 0, 52, 0, 46, 0, 0, 0]
>>> "".join(chr(c) for c in b).decode("utf-16")
u'\u041c\u0438\u043b\u043e\u0439 \u041c\u0430\u043c\u0443\u043b\u0435 \u043e\u04
42 \u041c\u0430\u0439\u0438, 11 \u044f\u043d\u0432\u0430\u0440\u044f 1944.\x00'
like image 132
MRAB Avatar answered Sep 22 '22 03:09

MRAB


I think the title.human_value data is in UTF-8, having already been decoded from the raw UTF-16 bytes of title.

In the python shell, running in a terminal window on OSX:

>>> # this should be the same as your title.human_value:
>>> print ''.join( chr(x) for x in [208, 156, 208, 184, 208, 
              187, 208, 190, 208, 185, 32, 208, 156, 208, 
              176, 208, 188, 209, 131, 208, 187, 208, 181, 
              32, 208, 190, 209, 130, 32, 208, 156, 208, 
              176, 208, 185, 208, 184, 44, 32, 49, 49, 32, 
              209, 143, 208, 189, 208, 178, 208, 176, 209, 
              128, 209, 143, 32, 49, 57, 52, 52, 46])

Милой Мамуле от Майи, 11 января 1944.

Your console may not support Cyrillic characters. You might try setting the font in the Command Prompt to "Lucida Console" -- a more modern vector font is more likely to support it correctly than the historical bitmapped fonts that cmd defaults to.

like image 22
Russell Borogove Avatar answered Sep 25 '22 03:09

Russell Borogove