Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the way to extract file extension from file name in Python?

Tags:

python

string

The file names are dynamic and I need to extract the file extension. The file names look like this: parallels-workstation-parallels-en_US-6.0.13976.769982.run.sh

20090209.02s1.1_sequence.txt
SRR002321.fastq.bz2
hello.tar.gz
ok.txt

For the first one I want to extract txt, for the second one I want to extract fastq.bz2, for the third one I want to extract tar.gz.

I am using os module to get the file extension as:

import os.path
extension = os.path.splitext('hello.tar.gz')[1][1:]

This gives me only gz which is fine if the file name is ok.txt but for this one I want the extension to be tar.gz.

like image 285
pynovice Avatar asked Mar 23 '23 16:03

pynovice


2 Answers

import os

def splitext(path):
    for ext in ['.tar.gz', '.tar.bz2']:
        if path.endswith(ext):
            return path[:-len(ext)], path[-len(ext):]
    return os.path.splitext(path)

assert splitext('20090209.02s1.1_sequence.txt')[1] == '.txt'
assert splitext('SRR002321.fastq.bz2')[1] == '.bz2'
assert splitext('hello.tar.gz')[1] == '.tar.gz'
assert splitext('ok.txt')[1] == '.txt'

Removing dot:

import os

def splitext(path):
    for ext in ['.tar.gz', '.tar.bz2']:
        if path.endswith(ext):
            path, ext = path[:-len(ext)], path[-len(ext):]
            break
    else:
        path, ext = os.path.splitext(path)
    return path, ext[1:]

assert splitext('20090209.02s1.1_sequence.txt')[1] == 'txt'
assert splitext('SRR002321.fastq.bz2')[1] == 'bz2'
assert splitext('hello.tar.gz')[1] == 'tar.gz'
assert splitext('ok.txt')[1] == 'txt'
like image 51
falsetru Avatar answered Apr 06 '23 06:04

falsetru


Your rules are arbitrary, how is the computer supposed to guess when it's ok for the extension to have a . in it?

At best you'll have to have a set of exceptional extensions, eg {'.bz2', '.gz'} and add some extra logic yourself

>>> paths = """20090209.02s1.1_sequence.txt
... SRR002321.fastq.bz2
... hello.tar.gz
... ok.txt""".splitlines()
>>> import os
>>> def my_split_ext(path):
...     name, ext = os.path.splitext(path)
...     if ext in {'.bz2', '.gz'}:
...         name, ext2 = os.path.splitext(name)
...         ext = ext2 + ext
...     return name, ext
... 
>>> map(my_split_ext, paths)
[('20090209.02s1.1_sequence', '.txt'), ('SRR002321', '.fastq.bz2'), ('hello', '.tar.gz'), ('ok', '.txt')]
like image 20
John La Rooy Avatar answered Apr 06 '23 04:04

John La Rooy