I am trying to parse urls and filepaths from files using Python. I already have a url regex.
Issue
I want a regex pattern that extracts file paths from a string. Requirements:
C:\
, \\
, /
)/
, ../
)Please assist by modifying my attempt below or suggesting an improved pattern.
Attempt
Here is the regex I have so far:
(?:[A-Z]:|\\|(?:\.{1,2}[\/\\])+)[\w+\\\s_\(\)\/]+(?:\.\w+)*
Description
(?:[A-Z]:|\\|(?:\.{1,2}[\/\\])+)
: any preceding drive letter, backslash or dotted path [\w+\\\s_\(\)\/]+
: any path-like characters - alphanumerics, slashes, parens, underscores, ...(?:\.\w+)*
: optional extensionResult
Note: I have confirmed these results in Python using an input list of strings and the re
module.
Expected
This regex satisfies most of my requirements - namely excluding urls while extracting most file paths. However, I would like to match all paths (including UNIX-style paths that begin with a single slash, e.g. /foo/bar.txt
) without matching urls.
Research
I have not found a general solution. Most work tends to satisfy specific cases.
SO Posts
External sites
You could split the problem in 3 alternative patterns: (note that I didn't implement all character exclusions for path/file names)
This would give something like this:
((((?<!\w)[A-Z,a-z]:)|(\.{1,2}\\))([^\b%\/\|:\n\"]*))|("\2([^%\/\|:\n\"]*)")|((?<!\w)(\.{1,2})?(?<!\/)(\/((\\\b)|[^ \b%\|:\n\"\\\/])+)+\/?)
Broken down:
Wind-Non-Quoted: ((((?<!\w)[A-Z,a-z]:)|(\.{1,2}\\))([^\b%\/\|:\n\"]*))
Wind-Quoted: ("\2([^%\/\|:\n\"]*)")
Unix: ((?<!\w)(\.{1,2})?(?<!\/)(\/((\\\b)|[^ \b%\|:\n\"\\\/])+)+\/?)
Wind-Non-Quoted:
prefix: (((?<!\w)[A-Z,a-z]:)|(\.{1,2}\\))
drive: ((?<!\w)[A-Z,a-z]:) *Lookback to ensure single letter*
relative: (\.{1,2}\\))
path: ([^\b%\/\|:\n\"]*)) *Excluding invalid name characters (The list is not complete)*
Wind-Quoted:
prefix: \2 *Reuses the one from non-Quoted*
path: ([^%\/\|:\n\"]*) *Save as above but does not exclude spaces*
Unix:
prefix: (?<!\w)(\.{1,2})? . or .. not preceded by letters
path: (?<!\/) repeated /name (exclusions as above)
(\/((\\\b)|[^ \b%\|:\n\"\\\/])+) not preceded by /
\/? optionally ending with /
*(excluding the double slashes is intended to prevent matching urls)*
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With