Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting file extension using pattern matching in python

Tags:

python

regex

I am trying to find the extension of a file, given its name as a string. I know I can use the function os.path.splitext but it does not work as expected in case my file extension is .tar.gz or .tar.bz2 as it gives the extensions as gz and bz2 instead of tar.gz and tar.bz2 respectively.
So I decided to find the extension of files myself using pattern matching.

print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz')group('ext')
>>> gz            # I want this to come as 'tar.gz'
print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.bz2')group('ext')
>>> bz2           # I want this to come 'tar.bz2'

I am using (?P<ext>...) in my pattern matching as I also want to get the extension.

Please help.

like image 415
Pushpak Dagade Avatar asked Jun 29 '11 18:06

Pushpak Dagade


People also ask

How do I match a filename in Python?

fnmatch() compares a single file name against a pattern and returns TRUE if they match else returns FALSE. The comparison is case-sensitive when the operating system uses a case-sensitive file system. The special characters and their functions used in shell-style wildcards are : '*' – matches everything.

Does glob use regex?

The pattern rules for glob are not regular expressions. Instead, they follow standard Unix path expansion rules. There are only a few special characters: two different wild-cards, and character ranges are supported.

What is meant by pattern matching in Python?

Pattern matching involves providing a pattern and an associated action to be taken if the data fits the pattern. At its simplest, pattern matching works like the switch statement in C/ C++/ JavaScript or Java. Matching a subject value against one or more cases.


2 Answers

root,ext = os.path.splitext('a.tar.gz')
if ext in ['.gz', '.bz2']:
   ext = os.path.splitext(root)[1] + ext

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

like image 171
phihag Avatar answered Oct 20 '22 00:10

phihag


>>> print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz').group('ext')
gz
>>> print re.compile(r'^.*?[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz').group('ext')
tar.gz
>>>

The ? operator tries to find the minimal match, so instead of .* eating ".tar" as well, .*? finds the minimal match that allows .tar.gz to be matched.

like image 39
Omri Barel Avatar answered Oct 20 '22 01:10

Omri Barel