Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting the file extensions from file names in pandas

I have a column FileName in pandas dataframe which consists of strings containing filenames of the form . The filename can contain dots('.') in them. For example, a a.b.c.d.txt is a txt file. I just want to have another column FileType column containing only the file extensions.

Sample DataFrame:

FileName

a.b.c.d.txt

j.k.l.exe

After processing:

FileName    FileType

a.b.c.d.txt txt

j.k.l.exe   exe

I tried the following:

X['FileType'] = X.FileName.str.split(pat='.')

This help me split the string on .. But how do I get the last element i.e. the file extension?

Something like

X['FileType'] = X.FileName.str.split(pat='.')[-1]

X['FileType'] = X.FileName.str.split(pat='.').pop(-1)

did not give the desired output.

like image 211
Shridhar R Kulkarni Avatar asked May 17 '18 03:05

Shridhar R Kulkarni


People also ask

How do I separate filenames and extensions in Python?

Use the os. path. splitext() method to split a filename on the name and extension, e.g. filename, extension = os.

How do you get a Python extension?

We can use Python os module splitext() function to get the file extension. This function splits the file path into a tuple having two values - root and extension.

How do I unpack a pandas list?

If you need to create or unpack lists in your DataFrames, you can make use of the Series. str. split() and df. explode() methods respectively.

Which name is the extension file name in Python language?

ext is the extension of file file.


2 Answers

Option 1
apply

df['FileType'] = df.FileName.apply(lambda x: x.split('.')[-1])

Option 2
Use str twice

df['FileType'] = df.FileName.str.split('.').str[-1]

Option 2b
Use rsplit (thanks @cᴏʟᴅsᴘᴇᴇᴅ)

df['FileType'] = df.FileName.str.rsplit('.', 1).str[-1]

All result in:

      FileName FileType
0  a.b.c.d.txt      txt
1    j.k.l.exe      exe

Python 3.6.4, Pandas 0.22.0

like image 67
user3483203 Avatar answered Sep 28 '22 16:09

user3483203


If you don't want to split the extension from the filename, then I would recommend a list comprehension—

comprehension with str.rsplit

df['FileType'] = [f.rsplit('.', 1)[-1] for f in df.FileName.tolist()]
df

      FileName FileType
0  a.b.c.d.txt      txt
1    j.k.l.exe      exe

If you want to split the path and the filename, there are a couple of options.

os.path.splitext

import os

pd.DataFrame(
    [os.path.splitext(f) for f in df.FileName], 
    columns=['Name', 'Type']
)
 
      Name  Type
0  a.b.c.d  .txt
1    j.k.l  .exe

str.extract

df.FileName.str.extract(r'(?P<FileName>.*)(?P<FileType>\..*)', expand=True)

      Name  Type
0  a.b.c.d  .txt
1    j.k.l  .exe
like image 35
cs95 Avatar answered Sep 28 '22 17:09

cs95