Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python's glob module and unix' find command don't recognize non-ascii

I am on Mac OS X 10.8.2

When I try to find files with filenames that contain non-ASCII-characters I get no results although I know for sure that they are existing. Take for example the console input

> find */Bärlauch*

I get no results. But if I try without the umlaut I get

> find */B*rlauch*
images/Bärlauch1.JPG

So the file is definitely existing. If I rename the file replacing 'ä' by 'ae' the file is being found.

Similarily the Python module glob is not able to find the file:

>>> glob.glob('*/B*rlauch*')
['images/Bärlauch1.JPG']
>>> glob.glob('*/Bärlauch*')
[]

I figured out it must have something to do with the encoding but my terminal is set to be utf-8 and I am using Python 3.3.0 which uses unicode strings.

like image 331
LifeIsHealthy Avatar asked Jan 06 '13 18:01

LifeIsHealthy


2 Answers

Mac OS X uses denormalized characters always for filenames on HFS+. Use unicodedata.normalize('NFD', pattern) to denormalize the glob pattern.

import unicodedata

glob.glob(unicodedata.normalize('NFD', '*/Bärlauch*'))
like image 65
Martijn Pieters Avatar answered Sep 18 '22 04:09

Martijn Pieters


Python programs are fundamentally text files. Conventionally, people write them using only characters from the ASCII character set, and thus do not have to think about the encoding they write them in: all character sets agree on how ASCII characters should be decoded.

You have written a Python program using a non-ASCII character. Your program thus comes with an implicit encoding (which you haven't mentioned): to save such a file, you have to decide how you are going to represent a-umlaut on disk. I would guess that perhaps your editor has chosen something non-Unicode for you.

Anyway, there are two ways around such a problem: either you can restrict yourself to using only ASCII characters in the source code of your program, or you can declare to Python that you want it to read the text file with a specific encoding.

To do the former, you should replace the a-umlaut with its Unicode escape sequence (which I think is \x0228 but can't test at the moment). To do the latter, you should add a coding declaration at the top of the file:

# -*- coding: <your encoding> -*-
like image 27
Katriel Avatar answered Sep 20 '22 04:09

Katriel