Python: extract Cyrillic string from EXIF

Question

I am a complete beginner in Python, and would like to start learning it by doing. Namely, I'd love to correct some EXIF information in a huge bunch of family photos I have. To start with, I want to just get this information out of JPEG files properly.

Some of them have a title written in EXIF. It can be obtained e.g. by

import pyexiv2
metadata = pyexiv2.ImageMetadata(filename)
metadata.read()
title = metadata['Exif.Image.XPTitle']

This far I've got. Now comes the problem. Some of the titles contain Cyrillic letters. If I do print title.human_value I get for example

`ÐœÐ¸Ð»Ð¾Ð¹ ÐœÐ°Ð¼ÑƒÐ»Ðµ Ð¾Ñ‚ ÐœÐ°Ð¹Ð¸, 11 ÑÐ½Ð²Ð°Ñ€Ñ 1944.`

while with print title, it is

<Exif.Image.XPTitle [Byte] = 28 4 56 4 59 4 62 4 57 4 32 0 28 4 48 4 60 4 67 4 59 4 53 4 32 0 62 4 66 4 32 0 28 4 48 4 57 4 56 4 44 0 32 0 49 0 49 0 32 0 79 4 61 4 50 4 48 4 64 4 79 4 32 0 49 0 57 0 52 0 52 0 46 0 0 0>

The actual string I'd love to see is

Милой Мамуле от Майи, 11 января 1944.

It seems to be a unicode problem, but after trying already a dozen of different methods found here and elsewhere, I just cannot cope with it. Is it possible to see Russian letters in the console at all? I am using python(xy) on Windows 7 (English), so my IDE is spyder2. Just the default installation, to which I added pyexiv2. TIA!

MRAB · Accepted Answer

The bytes are UTF-16.

In Python 3:

>>> b = [28, 4, 56, 4, 59, 4, 62, 4, 57, 4, 32, 0, 28, 4, 48, 4, 60, 4, 67, 4, 59, 4, 53, 4, 32, 0, 62, 4, 66, 4, 32, 0, 28, 4, 48, 4, 57, 4, 56, 4, 44, 0, 32, 0, 49, 0, 49, 0, 32, 0, 79, 4, 61, 4, 50, 4, 48, 4, 64, 4, 79, 4, 32, 0, 49, 0, 57, 0, 52, 0, 52, 0, 46, 0, 0, 0]
>>> bytes(b).decode("utf-16")
'Милой Мамуле от Майи, 11 января 1944.\x00'

In Python 2:

>>> b = [28, 4, 56, 4, 59, 4, 62, 4, 57, 4, 32, 0, 28, 4, 48, 4, 60, 4, 67, 4, 59, 4, 53, 4, 32, 0, 62, 4, 66, 4, 32, 0, 28, 4, 48, 4, 57, 4, 56, 4, 44, 0, 32, 0, 49, 0, 49, 0, 32, 0, 79, 4, 61, 4, 50, 4, 48, 4, 64, 4, 79, 4, 32, 0, 49, 0, 57, 0, 52, 0, 52, 0, 46, 0, 0, 0]
>>> "".join(chr(c) for c in b).decode("utf-16")
u'\u041c\u0438\u043b\u043e\u0439 \u041c\u0430\u043c\u0443\u043b\u0435 \u043e\u04
42 \u041c\u0430\u0439\u0438, 11 \u044f\u043d\u0432\u0430\u0440\u044f 1944.\x00'

Russell Borogove · Answer

I think the title.human_value data is in UTF-8, having already been decoded from the raw UTF-16 bytes of title.

In the python shell, running in a terminal window on OSX:

>>> # this should be the same as your title.human_value:
>>> print ''.join( chr(x) for x in [208, 156, 208, 184, 208, 
              187, 208, 190, 208, 185, 32, 208, 156, 208, 
              176, 208, 188, 209, 131, 208, 187, 208, 181, 
              32, 208, 190, 209, 130, 32, 208, 156, 208, 
              176, 208, 185, 208, 184, 44, 32, 49, 49, 32, 
              209, 143, 208, 189, 208, 178, 208, 176, 209, 
              128, 209, 143, 32, 49, 57, 52, 52, 46])

Милой Мамуле от Майи, 11 января 1944.

Your console may not support Cyrillic characters. You might try setting the font in the Command Prompt to "Lucida Console" -- a more modern vector font is more likely to support it correctly than the historical bitmapped fonts that cmd defaults to.

Python: extract Cyrillic string from EXIF

Tags:

python

string

encoding

unicode

exif

texnic

2 Answers

MRAB

Russell Borogove

Recent Activity

Donate For Us

Python: extract Cyrillic string from EXIF

Tags:

python

string

encoding

unicode

exif

texnic

2 Answers

MRAB

Russell Borogove

Related questions

Recent Activity

Donate For Us