Python's glob module and unix' find command don't recognize non-ascii

Question

I am on Mac OS X 10.8.2

When I try to find files with filenames that contain non-ASCII-characters I get no results although I know for sure that they are existing. Take for example the console input

> find */Bärlauch*

I get no results. But if I try without the umlaut I get

> find */B*rlauch*
images/Bärlauch1.JPG

So the file is definitely existing. If I rename the file replacing 'ä' by 'ae' the file is being found.

Similarily the Python module glob is not able to find the file:

>>> glob.glob('*/B*rlauch*')
['images/Bärlauch1.JPG']
>>> glob.glob('*/Bärlauch*')
[]

I figured out it must have something to do with the encoding but my terminal is set to be utf-8 and I am using Python 3.3.0 which uses unicode strings.

Martijn Pieters · Accepted Answer

Mac OS X uses denormalized characters always for filenames on HFS+. Use unicodedata.normalize('NFD', pattern) to denormalize the glob pattern.

import unicodedata

glob.glob(unicodedata.normalize('NFD', '*/Bärlauch*'))

Katriel · Answer

Python programs are fundamentally text files. Conventionally, people write them using only characters from the ASCII character set, and thus do not have to think about the encoding they write them in: all character sets agree on how ASCII characters should be decoded.

You have written a Python program using a non-ASCII character. Your program thus comes with an implicit encoding (which you haven't mentioned): to save such a file, you have to decide how you are going to represent a-umlaut on disk. I would guess that perhaps your editor has chosen something non-Unicode for you.

Anyway, there are two ways around such a problem: either you can restrict yourself to using only ASCII characters in the source code of your program, or you can declare to Python that you want it to read the text file with a specific encoding.

To do the former, you should replace the a-umlaut with its Unicode escape sequence (which I think is \x0228 but can't test at the moment). To do the latter, you should add a coding declaration at the top of the file:

# -*- coding: <your encoding> -*-

Python's glob module and unix' find command don't recognize non-ascii

Tags:

python

unix

encoding

LifeIsHealthy

2 Answers

Martijn Pieters

Katriel

Recent Activity

Donate For Us

Python's glob module and unix' find command don't recognize non-ascii

Tags:

python

unix

encoding

LifeIsHealthy

2 Answers

Martijn Pieters

Katriel

Related questions

Recent Activity

Donate For Us