I am storing data about files that exist on a OSX HFS+ filesystem. I later want to iterate over the stored data and figure out if each file still exists. For my purposes, I care about filename case sensitivity, so if the case of a filename has changed I would consider the file to no longer exist.
I started out by trying
os.path.isfile(filename)
but on a normal install of OSX on HFS+, this returns True even if the filename case does not match. I am looking for a way to write a isfile() function that cares about case even when the filesystem does not.
os.path.normcase() and os.path.realpath() both return the filename in whatever case I pass into them.
Edit:
I now have two functions that seem to work on filenames limited to ASCII. I don't know how unicode or other characters might affect this.
The first is based off answers given here by omz and Alex L.
def does_file_exist_case_sensitive1a(fname):
if not os.path.isfile(fname): return False
path, filename = os.path.split(fname)
search_path = '.' if path == '' else path
for name in os.listdir(search_path):
if name == filename : return True
return False
The second is probably even less efficient.
def does_file_exist_case_sensitive2(fname):
if not os.path.isfile(fname): return False
m = re.search('[a-zA-Z][^a-zA-Z]*\Z', fname)
if m:
test = string.replace(fname, fname[m.start()], '?', 1)
print test
actual = glob.glob(test)
return len(actual) == 1 and actual[0] == fname
else:
return True # no letters in file, case sensitivity doesn't matter
Here is a third based off DSM's answer.
def does_file_exist_case_sensitive3(fname):
if not os.path.isfile(fname): return False
path, filename = os.path.split(fname)
search_path = '.' if path == '' else path
inodes = {os.stat(x).st_ino: x for x in os.listdir(search_path)}
return inodes[os.stat(fname).st_ino] == filename
I don't expect that these will perform well if I have thousands of files in a single directory. I'm still hoping for something that feels more efficient.
Another shortcoming I noticed while testing these is that they only check the filename for a case match. If I pass them a path that includes directory names none of these functions so far check the case of the directory names.
This answer complements the existing ones by providing functions, adapted from Alex L's answer, that:
import os, unicodedata
def gettruecasepath(path): # IMPORTANT: <path> must be a Unicode string
if not os.path.lexists(path): # use lexists to also find broken symlinks
raise OSError(2, u'No such file or directory', path)
isosx = sys.platform == u'darwin'
if isosx: # convert to NFD for comparison with os.listdir() results
path = unicodedata.normalize('NFD', path)
parentpath, leaf = os.path.split(path)
# find true case of leaf component
if leaf not in [ u'.', u'..' ]: # skip . and .. components
leaf_lower = leaf.lower() # if you use Py3.3+: change .lower() to .casefold()
found = False
for leaf in os.listdir(u'.' if parentpath == u'' else parentpath):
if leaf_lower == leaf.lower(): # see .casefold() comment above
found = True
if isosx:
leaf = unicodedata.normalize('NFC', leaf) # convert to NFC for return value
break
if not found:
# should only happen if the path was just deleted
raise OSError(2, u'Unexpectedly not found in ' + parentpath, leaf_lower)
# recurse on parent path
if parentpath not in [ u'', u'.', u'..', u'/', u'\\' ] and \
not (sys.platform == u'win32' and
os.path.splitdrive(parentpath)[1] in [ u'\\', u'/' ]):
parentpath = gettruecasepath(parentpath) # recurse
return os.path.join(parentpath, leaf)
def istruecasepath(path): # IMPORTANT: <path> must be a Unicode string
return gettruecasepath(path) == unicodedata.normalize('NFC', path)
gettruecasepath()
gets the case-exact representation as stored in the filesystem of the specified path (absolute or relative) path, if it exists:
u
; e.g., u'Motörhead'
; str variables: convert with, e.g., strVar.decode('utf8')
.
and ..
), except that multiple path separators are collapsed, and, on Windows, the returned path always uses \
as the path separator.OSError
exception is thrown if the path does not exist, or if you do not have permission to access it.istruecasepath()
uses gettruecasepath()
to compare the input path to the path as stored in the filesystem.
Caveat: Since these functions need to examine all directory entries at every level of the input path (as specified), they will be slow - unpredictably so, as performance will correspond to how many items the directories examined contain. Read on for background information.
It is curious that neither OSX nor Windows provide a native API method that directly solves this problem.
While on Windows you can cleverly combine two API methods to solve the problem, on OSX there is no alternative that I'm aware of to the - unpredictably - slow enumeration of directory contents on each level of the path examined, as employed above.
HFS+ (OSX' filesystem) stores filenames in decomposed Unicode form (NFD), which causes problems when comparing such names to in-memory Unicode strings in most programming languages, which are usually in composed Unicode form (NFC).
For instance, a path with non-ASCII character ü
that you specify as a literal in your source code will be represented as single Unicode codepoint, U+00FC
; this is an example of NFC: the 'C' stands for composed, because the letter base letter u
and its diacritic ¨
(a combining diaeresis) form a single letter.
By contrast, if you use ü
as a part of an HFS+ filename, it is translated to NFD form, which results in 2 Unicode codepoints: the base letter u
(U+0075
), followed by the combining diaeresis (̈
, U+0308
) as a separate codepoint; the 'D' stands for decomposed, because the character is decomposed into the base letter and its associated diacritic.
Even though the Unicode standard deems these 2 representations (canonically) equivalent, most programming languages, including Python, do not recognize such equivalence.
In the case of Python, you must use unicodedata.normalize()
to convert both strings to the same form before comparing.
(Side note: Unicode normal forms are separate from Unicode encodings, though the differing numbers of Unicode code points typically also impact the number of bytes needed to encode each form. In the example above, the single-codepoint ü
(NFC) requires 2 bytes to encode in UTF-8 (U+00FC
-> 0xC3 0xBC
), whereas the two-codepoint ü
(NFD) requires 3 bytes (U+0075
-> 0x75
, and U+0308
-> 0xCC 0x88
)).
Following on from omz's post - something like this might work:
import os
def getcase(filepath):
path, filename = os.path.split(filepath)
for fname in os.listdir(path):
if filename.lower() == fname.lower():
return os.path.join(path, fname)
print getcase('/usr/myfile.txt')
Here's a crazy thought I had. Disclaimer: I don't know nearly enough about filesystems to consider edge cases, so take this merely as something which happened to work. Once.
>>> !ls
A.txt b.txt
>>> inodes = {os.stat(x).st_ino: x for x in os.listdir(".")}
>>> inodes
{80827580: 'A.txt', 80827581: 'b.txt'}
>>> inodes[os.stat("A.txt").st_ino]
'A.txt'
>>> inodes[os.stat("a.txt").st_ino]
'A.txt'
>>> inodes[os.stat("B.txt").st_ino]
'b.txt'
>>> inodes[os.stat("b.txt").st_ino]
'b.txt'
You could use something like os.listdir
and check if the list contains the file name you're looking for.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With