I have a directory with thousands of descendants (at least 1,000, probably no more than 20,000). Given a file path (which is guaranteed to exist), I want to know where that file can be found inside that directory -- including via symlinks.
For example, given:
/base
./elsewhere/myfile
./base
is a symlink to /realbase
/realbase/foo
is a symlink to /elsewhere
./realbase/bar/baz
is a symlink to /elsewhere/myfile
.I want to find the paths /base/foo/myfile
and /base/bar/baz
.
I could do this by recursively checking every symlink in /base
, but this would be very slow. I'm hoping that there's a more graceful solution.
This is for a Sublime Text plugin. When the user saves a file, we want to detect whether it is in the Sublime configuration directory. In particular, we want to do so even if the file is symlinked from inside the config directory and the user is editing the file at its physical path (e.g. inside their Dropbox directory). There may be other applications as well.
Sublime runs on Linux, Windows, and the Mac OS, and so ideally should the solution.
As the most common shells expands paths, to create a relative symlink you should quote the "source path". Also, the -s parameter means --symbolic . ln creates hard links by default.
Symbolic link can be of two types Relative or Absolute.
Symlink, also known as a symbolic link in Linux, creates a link to a file or a directory for easier access. To put it in another way, symlinks are links that points to another file or folder in your system, quite similar to the shortcuts in Windows. Some users refer to symlinks as soft-links.
File paths are a common stumbling block for novice programmers. First, what’s the difference between a relative file path and an absolute file path? A relative path describes the location of a file relative to the current (working) directory*. An absolute path describes the location from the root directory.
I think what you want is: As the most common shells expands paths, to create a relative symlink you should quote the "source path". Also, the -s parameter means --symbolic. ln creates hard links by default. If you are certain there is no chance of your mount points changing you can achieve what you are trying to do in a bash shell like this:
The PATHNAME function returns the physical path for a given fileref or libref. Determining location of file using filref. Using the pathname function on a fileref returns the path of the filename. The LOG shows the location of the files as below: Using PATHNAME on a libref returns the path to that directory. And here is the LOG below.
The Resolve-Path cmdlet has a -Relative parameter that will return a path relative to the current directory: Show activity on this post. In case the directory doesn't exist, or the base directory should be different from the current working directory, there's a .NET method [IO.Path]::GetRelativePath (String from, String to).
This, like many things, is more complex than it might appear on the surface.
Each entity in the file system points at an inode
, which describes the content of the file. Entities are the things you see - files, directories, sockets, block devices, character devices, etc...
The content of a single "file" can be accessed via one or more paths - each of these paths is called a "hard link". Hard links can only point at files on the same filesystem, they cannot cross the boundary of a filesystem.
It is also possible for a path to address a "symbolic link", which can point at another path - that path doesn't have to exist, it can be another symbolic link, it can be on another filesystem, or it can point back at the original path producing an infinite loop.
It is impossible to locate all links (symbolic or hard) that point at a particular entity without scanning the entire tree.
Before we get into this... some comments:
stat()
on every file at some point, you're going to struggle coming up with a better solution that isn't significantly more complex (such as maintaining an index database, with all the issues that introduces)As mentioned, we must scan (index) the whole tree. I know it's not what you want to do, but it's impossible without doing this...
To do this, you need to collect inodes, not filenames, and review them after the fact... there may be some optimisation here, but I've tried to keep it simple to prioritise understanding.
The following function will produce this structure for us:
def get_map(scan_root):
# this dict will have device IDs at the first level (major / minor) ...
# ... and inodes IDs at the second level
# each inode will have the following keys:
# - 'type' the entity's type - i.e: dir, file, socket, etc...
# - 'links' a list of all found hard links to the inode
# - 'symlinks' a list of all found symlinks to the inode
# e.g: entities[2049][4756]['links'][0] path to a hard link for inode 4756
# entities[2049][4756]['symlinks'][0] path to a symlink that points at an entity with inode 4756
entity_map = {}
for root, dirs, files in os.walk(scan_root):
root = '.' + root[len(scan_root):]
for path in [ os.path.join(root, _) for _ in files ]:
try:
p_stat = os.stat(path)
except OSError as e:
if e.errno == 2:
print('Broken symlink [%s]... skipping' % ( path ))
continue
if e.errno == 40:
print('Too many levels of symbolic links [%s]... skipping' % ( path ))
continue
raise
p_dev = p_stat.st_dev
p_ino = p_stat.st_ino
if p_dev not in entity_map:
entity_map[p_dev] = {}
e_dev = entity_map[p_dev]
if p_ino not in e_dev:
e_dev[p_ino] = {
'type': get_type(p_stat.st_mode),
'links': [],
'symlinks': [],
}
e_ino = e_dev[p_ino]
if os.lstat(path).st_ino == p_ino:
e_ino['links'].append(path)
else:
e_ino['symlinks'].append(path)
return entity_map
I've produced an example tree that looks like this:
$ tree --inodes
.
├── [ 67687] 4 -> 5
├── [ 67676] 5 -> 4
├── [ 67675] 6 -> dead
├── [ 67676] a
│ └── [ 67679] 1
├── [ 67677] b
│ └── [ 67679] 2 -> ../a/1
├── [ 67678] c
│ └── [ 67679] 3
└── [ 67687] d
└── [ 67688] 4
4 directories, 7 files
The output of this function is:
$ places
Broken symlink [./6]... skipping
Too many levels of symbolic links [./5]... skipping
Too many levels of symbolic links [./4]... skipping
{201: {67679: {'links': ['./a/1', './c/3'],
'symlinks': ['./b/2'],
'type': 'file'},
67688: {'links': ['./d/4'], 'symlinks': [], 'type': 'file'}}}
If we are interested in ./c/3
, then you can see that just looking at symlinks (and ignoring hard links) would cause us to miss ./a/1
...
By subsequently searching for the path we are interested in, we can find all other references within this tree:
def filter_map(entity_map, filename):
for dev, inodes in entity_map.items():
for inode, info in inodes.items():
if filename in info['links'] or filename in info['symlinks']:
return info
$ places ./a/1
Broken symlink [./6]... skipping
Too many levels of symbolic links [./5]... skipping
Too many levels of symbolic links [./4]... skipping
{'links': ['./a/1', './c/3'], 'symlinks': ['./b/2'], 'type': 'file'}
The full source for this demo is below. Note that I've used relative paths to keep things simple, but it would be sensible to update this to use absolute paths. Additionally, any symlink that points outside the tree will not currently have a corresponding link
... that's an exercise for the reader.
It might also be an idea to collect the data while you're filling the tree (if that's something that would work with your process)... you can use inotify
to deal with this nicely - there's even a python module.
#!/usr/bin/env python3
import os, sys, stat
from pprint import pprint
def get_type(mode):
if stat.S_ISDIR(mode):
return 'directory'
if stat.S_ISCHR(mode):
return 'character'
if stat.S_ISBLK(mode):
return 'block'
if stat.S_ISREG(mode):
return 'file'
if stat.S_ISFIFO(mode):
return 'fifo'
if stat.S_ISLNK(mode):
return 'symlink'
if stat.S_ISSOCK(mode):
return 'socket'
return 'unknown'
def get_map(scan_root):
# this dict will have device IDs at the first level (major / minor) ...
# ... and inodes IDs at the second level
# each inode will have the following keys:
# - 'type' the entity's type - i.e: dir, file, socket, etc...
# - 'links' a list of all found hard links to the inode
# - 'symlinks' a list of all found symlinks to the inode
# e.g: entities[2049][4756]['links'][0] path to a hard link for inode 4756
# entities[2049][4756]['symlinks'][0] path to a symlink that points at an entity with inode 4756
entity_map = {}
for root, dirs, files in os.walk(scan_root):
root = '.' + root[len(scan_root):]
for path in [ os.path.join(root, _) for _ in files ]:
try:
p_stat = os.stat(path)
except OSError as e:
if e.errno == 2:
print('Broken symlink [%s]... skipping' % ( path ))
continue
if e.errno == 40:
print('Too many levels of symbolic links [%s]... skipping' % ( path ))
continue
raise
p_dev = p_stat.st_dev
p_ino = p_stat.st_ino
if p_dev not in entity_map:
entity_map[p_dev] = {}
e_dev = entity_map[p_dev]
if p_ino not in e_dev:
e_dev[p_ino] = {
'type': get_type(p_stat.st_mode),
'links': [],
'symlinks': [],
}
e_ino = e_dev[p_ino]
if os.lstat(path).st_ino == p_ino:
e_ino['links'].append(path)
else:
e_ino['symlinks'].append(path)
return entity_map
def filter_map(entity_map, filename):
for dev, inodes in entity_map.items():
for inode, info in inodes.items():
if filename in info['links'] or filename in info['symlinks']:
return info
entity_map = get_map(os.getcwd())
if len(sys.argv) == 2:
entity_info = filter_map(entity_map, sys.argv[1])
pprint(entity_info)
else:
pprint(entity_map)
I've run this on my system out of curiosity. It's a 6x disk ZFS RAID-Z2 pool on an i7-7700K with plenty of data to play with. Admittedly this will run somewhat slower on lower-spec systems...
Some benchmarks to consider:
Using simple maths, that's about 1140 stat()
calls per second with an empty cache, or ~90k stat()
calls per second once the cache has been filled - I don't think that stat()
is as slow as you think it is!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With