Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python glob but against a list of strings rather than the filesystem

I want to be able to match a pattern in glob format to a list of strings, rather than to actual files in the filesystem. Is there any way to do this, or convert a glob pattern easily to a regex?

like image 912
Jason S Avatar asked Dec 31 '14 21:12

Jason S


People also ask

What is the difference between glob and Re in Python?

Regular expressions are used in commands / functions for pattern matching in text. For example in the pattern parameter of grep , or in programming languages. File name globbing is used by shells for matching file and directory names using wildcards.

What is glob glob () in Python?

Python glob. glob() method returns a list of files or folders that matches the path specified in the pathname argument. This function takes two arguments, namely pathname, and recursive flag. pathname : Absolute (with full path and the file name) or relative (with UNIX shell-style wildcards).

What type of object does glob return?

glob (short for global) is used to return all file paths that match a specific pattern. We can use glob to search for a specific file pattern, or perhaps more usefully, search for files where the filename matches a certain pattern by using wildcard characters.


1 Answers

The glob module uses the fnmatch module for individual path elements.

That means the path is split into the directory name and the filename, and if the directory name contains meta characters (contains any of the characters [, * or ?) then these are expanded recursively.

If you have a list of strings that are simple filenames, then just using the fnmatch.filter() function is enough:

import fnmatch  matching = fnmatch.filter(filenames, pattern) 

but if they contain full paths, you need to do more work as the regular expression generated doesn't take path segments into account (wildcards don't exclude the separators nor are they adjusted for cross-platform path matching).

You can construct a simple trie from the paths, then match your pattern against that:

import fnmatch import glob import os.path from itertools import product   # Cross-Python dictionary views on the keys  if hasattr(dict, 'viewkeys'):     # Python 2     def _viewkeys(d):         return d.viewkeys() else:     # Python 3     def _viewkeys(d):         return d.keys()   def _in_trie(trie, path):     """Determine if path is completely in trie"""     current = trie     for elem in path:         try:             current = current[elem]         except KeyError:             return False     return None in current   def find_matching_paths(paths, pattern):     """Produce a list of paths that match the pattern.      * paths is a list of strings representing filesystem paths     * pattern is a glob pattern as supported by the fnmatch module      """     if os.altsep:  # normalise         pattern = pattern.replace(os.altsep, os.sep)     pattern = pattern.split(os.sep)      # build a trie out of path elements; efficiently search on prefixes     path_trie = {}     for path in paths:         if os.altsep:  # normalise             path = path.replace(os.altsep, os.sep)         _, path = os.path.splitdrive(path)         elems = path.split(os.sep)         current = path_trie         for elem in elems:             current = current.setdefault(elem, {})         current.setdefault(None, None)  # sentinel      matching = []      current_level = [path_trie]     for subpattern in pattern:         if not glob.has_magic(subpattern):             # plain element, element must be in the trie or there are             # 0 matches             if not any(subpattern in d for d in current_level):                 return []             matching.append([subpattern])             current_level = [d[subpattern] for d in current_level if subpattern in d]         else:             # match all next levels in the trie that match the pattern             matched_names = fnmatch.filter({k for d in current_level for k in d}, subpattern)             if not matched_names:                 # nothing found                 return []             matching.append(matched_names)             current_level = [d[n] for d in current_level for n in _viewkeys(d) & set(matched_names)]      return [os.sep.join(p) for p in product(*matching)             if _in_trie(path_trie, p)] 

This mouthful can quickly find matches using globs anywhere along the path:

>>> paths = ['/foo/bar/baz', '/spam/eggs/baz', '/foo/bar/bar'] >>> find_matching_paths(paths, '/foo/bar/*') ['/foo/bar/baz', '/foo/bar/bar'] >>> find_matching_paths(paths, '/*/bar/b*') ['/foo/bar/baz', '/foo/bar/bar'] >>> find_matching_paths(paths, '/*/[be]*/b*') ['/foo/bar/baz', '/foo/bar/bar', '/spam/eggs/baz'] 
like image 186
Martijn Pieters Avatar answered Sep 29 '22 01:09

Martijn Pieters