Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use a regular expression (glob) to search a file tree

How do I adapt a search tree to handle limited regular expressions?

Given a file name, I need to find all nodes matching that file name. Nodes may contain usual file name globs (* and ?). Since this is a search tree, speed is of the essence.

I should add that the most important case for speed is the average time to rule out a match. In most cases matching will fail.

If the tree contained the following nodes:

foo, bar, foo*, *bar, foo?bar 
  • Searching for "foo" would return nodes 1 and 3.
  • Searching for "bar" would return nodes 2 and 4.
  • Searching for "fob" would return no nodes.
  • Searching for "fooxbar" would return node 5.
  • Searching for "foobar" would return nodes 3 and 4.
like image 949
Kris Braun Avatar asked Feb 25 '09 18:02

Kris Braun


People also ask

How do I use glob to find a file?

Python glob() Method to Search Files We need to import Python's built-in glob module to use the glob() function. Python glob. glob() method returns a list of files or folders that matches the path specified in the pathname argument. This function takes two arguments, namely pathname, and recursive flag.

How do I use glob to find files recursively?

We can use the function glob. glob() or glob. iglob() directly from glob module to retrieve paths recursively from inside the directories/files and subdirectories/subfiles. Note: When recursive is set True “ ** ” followed by path separator ('./**/') will match any files or directories.

What does glob glob () do in Python?

glob (short for global) is used to return all file paths that match a specific pattern. We can use glob to search for a specific file pattern, or perhaps more usefully, search for files where the filename matches a certain pattern by using wildcard characters.

What is glob search?

The glob() function searches for all the pathnames matching pattern according to the rules used by the libc glob() function, which is similar to the rules used by common shells.


1 Answers

An Aho-Corasick search tree would fit the bill. "Tries" is a very good article about this sort of thing and the Etrie implementation used in Evolution to replace regex searching.

To do the whole string matching, you can add beginning and ending anchor states. If scanning multi-line data, you could add the newline to the begin and end. You could also remove the part where it adds the cross linking for the partial matching starting a different match. This also allows faster exclusion.

Another algorithm for checking for membership in a string set is CritBit. This doesn't have Regex, but it is simple and tests complete strings.

like image 161
sfossen Avatar answered Sep 25 '22 12:09

sfossen