Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In python on OSX with HFS+ how can I get the correct case of an existing filename?

I am storing data about files that exist on a OSX HFS+ filesystem. I later want to iterate over the stored data and figure out if each file still exists. For my purposes, I care about filename case sensitivity, so if the case of a filename has changed I would consider the file to no longer exist.

I started out by trying

os.path.isfile(filename)

but on a normal install of OSX on HFS+, this returns True even if the filename case does not match. I am looking for a way to write a isfile() function that cares about case even when the filesystem does not.

os.path.normcase() and os.path.realpath() both return the filename in whatever case I pass into them.

Edit:

I now have two functions that seem to work on filenames limited to ASCII. I don't know how unicode or other characters might affect this.

The first is based off answers given here by omz and Alex L.

def does_file_exist_case_sensitive1a(fname):
    if not os.path.isfile(fname): return False
    path, filename = os.path.split(fname)
    search_path = '.' if path == '' else path
    for name in os.listdir(search_path):
        if name == filename : return True
    return False

The second is probably even less efficient.

def does_file_exist_case_sensitive2(fname):
    if not os.path.isfile(fname): return False
    m = re.search('[a-zA-Z][^a-zA-Z]*\Z', fname)
    if m:
        test = string.replace(fname, fname[m.start()], '?', 1)
        print test
        actual = glob.glob(test)
        return len(actual) == 1 and actual[0] == fname
    else:
        return True  # no letters in file, case sensitivity doesn't matter

Here is a third based off DSM's answer.

def does_file_exist_case_sensitive3(fname):
    if not os.path.isfile(fname): return False
    path, filename = os.path.split(fname)
    search_path = '.' if path == '' else path
    inodes = {os.stat(x).st_ino: x for x in os.listdir(search_path)}
    return inodes[os.stat(fname).st_ino] == filename

I don't expect that these will perform well if I have thousands of files in a single directory. I'm still hoping for something that feels more efficient.

Another shortcoming I noticed while testing these is that they only check the filename for a case match. If I pass them a path that includes directory names none of these functions so far check the case of the directory names.

like image 862
Keith Avatar asked Jan 25 '13 03:01

Keith


4 Answers

This answer complements the existing ones by providing functions, adapted from Alex L's answer, that:

  • also work with non-ASCII characters
  • process all path components (not just the last)
  • work with both Python 2.x and 3.x
  • as a bonus, also work on Windows (there are better Windows-specific solutions - see https://stackoverflow.com/a/2114975/45375 - but the functions here are cross-platform and require no additional packages)
import os, unicodedata

def gettruecasepath(path): # IMPORTANT: <path> must be a Unicode string
  if not os.path.lexists(path): # use lexists to also find broken symlinks
    raise OSError(2, u'No such file or directory', path)
  isosx = sys.platform == u'darwin'
  if isosx: # convert to NFD for comparison with os.listdir() results
    path = unicodedata.normalize('NFD', path)
  parentpath, leaf = os.path.split(path)
  # find true case of leaf component
  if leaf not in [ u'.', u'..' ]: # skip . and .. components
    leaf_lower = leaf.lower() # if you use Py3.3+: change .lower() to .casefold()
    found = False
    for leaf in os.listdir(u'.' if parentpath == u'' else parentpath):
      if leaf_lower == leaf.lower(): # see .casefold() comment above
          found = True
          if isosx:
            leaf = unicodedata.normalize('NFC', leaf) # convert to NFC for return value
          break
    if not found:
      # should only happen if the path was just deleted
      raise OSError(2, u'Unexpectedly not found in ' + parentpath, leaf_lower)
  # recurse on parent path
  if parentpath not in [ u'', u'.', u'..', u'/', u'\\' ] and \
                not (sys.platform == u'win32' and 
                     os.path.splitdrive(parentpath)[1] in [ u'\\', u'/' ]):
      parentpath = gettruecasepath(parentpath) # recurse
  return os.path.join(parentpath, leaf)


def istruecasepath(path): # IMPORTANT: <path> must be a Unicode string
  return gettruecasepath(path) == unicodedata.normalize('NFC', path)
  • gettruecasepath() gets the case-exact representation as stored in the filesystem of the specified path (absolute or relative) path, if it exists:

    • The input path must be a Unicode string:
      • Python 3.x: strings are natively Unicode - no extra action needed.
      • Python 2.x: literals: prefix with u; e.g., u'Motörhead'; str variables: convert with, e.g., strVar.decode('utf8')
    • The string returned is a Unicode string in NFC (composed normal form). NFC is returned even on OSX, where the filesystem (HFS+) stores names in NFD (decomposed normal form).
      NFC is returned, because it is far more common than NFD, and Python doesn't recognize equivalent NFC and NFD strings as (conceptually) identical. See below for background information.
    • The path returned retains the structure of the input path (relative vs. absolute, components such as . and ..), except that multiple path separators are collapsed, and, on Windows, the returned path always uses \ as the path separator.
    • On Windows, a drive / UNC-share component, if present, is retained as-is.
    • An OSError exception is thrown if the path does not exist, or if you do not have permission to access it.
    • If you use this function on a case-sensitive filesystem, e.g., on Linux with ext4, it effectively degrades to indicating whether the input path exists in the exact case specified or not.
  • istruecasepath() uses gettruecasepath() to compare the input path to the path as stored in the filesystem.

Caveat: Since these functions need to examine all directory entries at every level of the input path (as specified), they will be slow - unpredictably so, as performance will correspond to how many items the directories examined contain. Read on for background information.


Background

Native API support (lack thereof)

It is curious that neither OSX nor Windows provide a native API method that directly solves this problem.

While on Windows you can cleverly combine two API methods to solve the problem, on OSX there is no alternative that I'm aware of to the - unpredictably - slow enumeration of directory contents on each level of the path examined, as employed above.

Unicode normal forms: NFC vs. NFD

HFS+ (OSX' filesystem) stores filenames in decomposed Unicode form (NFD), which causes problems when comparing such names to in-memory Unicode strings in most programming languages, which are usually in composed Unicode form (NFC).

For instance, a path with non-ASCII character ü that you specify as a literal in your source code will be represented as single Unicode codepoint, U+00FC; this is an example of NFC: the 'C' stands for composed, because the letter base letter u and its diacritic ¨ (a combining diaeresis) form a single letter.

By contrast, if you use ü as a part of an HFS+ filename, it is translated to NFD form, which results in 2 Unicode codepoints: the base letter u (U+0075), followed by the combining diaeresis (̈, U+0308) as a separate codepoint; the 'D' stands for decomposed, because the character is decomposed into the base letter and its associated diacritic.

Even though the Unicode standard deems these 2 representations (canonically) equivalent, most programming languages, including Python, do not recognize such equivalence.
In the case of Python, you must use unicodedata.normalize() to convert both strings to the same form before comparing.

(Side note: Unicode normal forms are separate from Unicode encodings, though the differing numbers of Unicode code points typically also impact the number of bytes needed to encode each form. In the example above, the single-codepoint ü (NFC) requires 2 bytes to encode in UTF-8 (U+00FC -> 0xC3 0xBC), whereas the two-codepoint ü (NFD) requires 3 bytes (U+0075 -> 0x75, and U+0308 -> 0xCC 0x88)).

like image 87
mklement0 Avatar answered Nov 18 '22 09:11

mklement0


Following on from omz's post - something like this might work:

import os

def getcase(filepath):
    path, filename = os.path.split(filepath)
    for fname in os.listdir(path):
        if filename.lower() == fname.lower():
            return os.path.join(path, fname)

print getcase('/usr/myfile.txt')
like image 45
Alex L Avatar answered Nov 18 '22 09:11

Alex L


Here's a crazy thought I had. Disclaimer: I don't know nearly enough about filesystems to consider edge cases, so take this merely as something which happened to work. Once.

>>> !ls
A.txt   b.txt
>>> inodes = {os.stat(x).st_ino: x for x in os.listdir(".")}
>>> inodes
{80827580: 'A.txt', 80827581: 'b.txt'}
>>> inodes[os.stat("A.txt").st_ino]
'A.txt'
>>> inodes[os.stat("a.txt").st_ino]
'A.txt'
>>> inodes[os.stat("B.txt").st_ino]
'b.txt'
>>> inodes[os.stat("b.txt").st_ino]
'b.txt'
like image 39
DSM Avatar answered Nov 18 '22 10:11

DSM


You could use something like os.listdir and check if the list contains the file name you're looking for.

like image 2
omz Avatar answered Nov 18 '22 10:11

omz