Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A general regex to extract file paths (not urls)

Tags:

python

regex

I am trying to parse urls and filepaths from files using Python. I already have a url regex.

Issue

I want a regex pattern that extracts file paths from a string. Requirements:

  • exclusive (does not include urls)
  • OS-independent, i.e. Windows and UNIX style paths e.g. (C:\, \\, /)
  • all path types, i.e. absolute & relative paths e.g. (/, ../)

Please assist by modifying my attempt below or suggesting an improved pattern.

Attempt

Here is the regex I have so far:

(?:[A-Z]:|\\|(?:\.{1,2}[\/\\])+)[\w+\\\s_\(\)\/]+(?:\.\w+)*

Description

  • (?:[A-Z]:|\\|(?:\.{1,2}[\/\\])+): any preceding drive letter, backslash or dotted path
  • [\w+\\\s_\(\)\/]+: any path-like characters - alphanumerics, slashes, parens, underscores, ...
  • (?:\.\w+)*: optional extension

Result

enter image description here

Note: I have confirmed these results in Python using an input list of strings and the re module.

Expected

This regex satisfies most of my requirements - namely excluding urls while extracting most file paths. However, I would like to match all paths (including UNIX-style paths that begin with a single slash, e.g. /foo/bar.txt) without matching urls.

Research

I have not found a general solution. Most work tends to satisfy specific cases.

SO Posts

  • How to write a regex to match multiple file path
  • Regex for extracting filename from path
  • regex for finding file paths
  • Python regular expression for Windows file path

External sites

  • Validate a Windows Path
  • Regex that matches path, filename and extension
like image 315
pylang Avatar asked Oct 17 '22 06:10

pylang


1 Answers

You could split the problem in 3 alternative patterns: (note that I didn't implement all character exclusions for path/file names)

  • Non-quoted Windows paths
  • quoted Windows paths
  • unix paths

This would give something like this:

((((?<!\w)[A-Z,a-z]:)|(\.{1,2}\\))([^\b%\/\|:\n\"]*))|("\2([^%\/\|:\n\"]*)")|((?<!\w)(\.{1,2})?(?<!\/)(\/((\\\b)|[^ \b%\|:\n\"\\\/])+)+\/?)

Broken down:

Wind-Non-Quoted: ((((?<!\w)[A-Z,a-z]:)|(\.{1,2}\\))([^\b%\/\|:\n\"]*))
Wind-Quoted:     ("\2([^%\/\|:\n\"]*)")
Unix:            ((?<!\w)(\.{1,2})?(?<!\/)(\/((\\\b)|[^ \b%\|:\n\"\\\/])+)+\/?)


Wind-Non-Quoted:
    prefix: (((?<!\w)[A-Z,a-z]:)|(\.{1,2}\\))
         drive: ((?<!\w)[A-Z,a-z]:) *Lookback to ensure single letter*
      relative: (\.{1,2}\\))
      path: ([^\b%\/\|:\n\"]*))     *Excluding invalid name characters (The list is not complete)*

Wind-Quoted:
    prefix: \2                *Reuses the one from non-Quoted*
      path: ([^%\/\|:\n\"]*)  *Save as above but does not exclude spaces*

Unix:
    prefix: (?<!\w)(\.{1,2})?                . or .. not preceded by letters
      path: (?<!\/)                          repeated /name (exclusions as above)
            (\/((\\\b)|[^ \b%\|:\n\"\\\/])+) not preceded by /
            \/?                              optionally ending with /

            *(excluding the double slashes is intended to prevent matching urls)*
like image 158
Alain T. Avatar answered Oct 20 '22 10:10

Alain T.