Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove ends of string entries in pandas DataFrame column

I have a pandas Dataframe with one column a list of files

import pandas as pd
df = pd.read_csv('fname.csv')

df.head()

filename    A    B    C
fn1.txt   2    4    5
fn2.txt   1    2    1
fn3.txt   ....
....

I would like to delete the file extension .txt from each entry in filename. How do I accomplish this?

I tried:

df['filename'] = df['filename'].map(lambda x: str(x)[:-4])

but when I look at the column entries afterwards with df.head(), nothing has changed.

How does one do this?

like image 483
ShanZhengYang Avatar asked May 03 '16 10:05

ShanZhengYang


People also ask

How do I remove part of a string in pandas?

Another option you have when it comes to removing unwanted parts from strings in pandas, is pandas. Series. str. extract() method that is used to extract capture groups in the regex pat as columns in a DataFrame.

How do I remove suffix from columns in pandas?

Remove Suffix from column names in Pandas You can use the string rstrip() function or the string replace() function to remove suffix from column names.

How do you remove the last character of a string in a column in Python?

Using the rstrip function to Remove Last Character From String in Python. The string method rstrip is used to remove the characters from the right side of the string that is given to it.


5 Answers

I think you can use str.replace with regex .txt$' ( $ - matches the end of the string):

import pandas as pd

df = pd.DataFrame({'A': {0: 2, 1: 1}, 
                   'C': {0: 5, 1: 1}, 
                   'B': {0: 4, 1: 2}, 
                   'filename': {0: "txt.txt", 1: "x.txt"}}, 
                columns=['filename','A','B', 'C'])

print df
  filename  A  B  C
0  txt.txt  2  4  5
1    x.txt  1  2  1

df['filename'] = df['filename'].str.replace(r'.txt$', '')
print df
  filename  A  B  C
0      txt  2  4  5
1        x  1  2  1

df['filename'] = df['filename'].map(lambda x: str(x)[:-4])
print df
  filename  A  B  C
0      txt  2  4  5
1        x  1  2  1

df['filename'] = df['filename'].str[:-4]
print df
  filename  A  B  C
0      txt  2  4  5
1        x  1  2  1

EDIT:

rstrip can remove more characters, if the end of strings contains some characters of striped string (in this case ., t, x):

Example:

print df
  filename  A  B  C
0  txt.txt  2  4  5
1    x.txt  1  2  1

df['filename'] = df['filename'].str.rstrip('.txt')

print df
  filename  A  B  C
0           2  4  5
1           1  2  1
like image 64
jezrael Avatar answered Oct 12 '22 15:10

jezrael


You can use str.rstrip to remove the endings:

df['filename'] = df['filename'].str.rstrip('.txt')

should work

like image 27
EdChum Avatar answered Oct 12 '22 17:10

EdChum


You may want:

df['filename'] = df.apply(lambda x: x['filename'][:-4], axis = 1)

like image 37
Paweł Kordek Avatar answered Oct 12 '22 17:10

Paweł Kordek


use list comprehension

df['filename'] = [x[:-4] for x in df['filename']]
like image 33
Blue Moon Avatar answered Oct 12 '22 15:10

Blue Moon


update 2021 + speedtest

Starting from pandas 1.4, the equivalent of str.removesuffix, the pandas.Series.str.removesuffix is implemented, so one can use

df['filename'].str.removesuffix('.txt')

speed test

tl;dr: the fastest is

dat["fname"].map(lambda x: x[:-4] if x[-4:] == ".txt" else x)

In the speed test, I wanted to consider the different methods collected in this SO page. I excluded rstrip, because it would strip other than .txt endings too, and as regexp contains conditional, therefore it would be fair to modify the other functions too so that they remove the last 4 chars only if they are .txt.

The testing code is

import pandas as pd
import time

ITER = 10


def rm_re(dat: pd.DataFrame) -> pd.Series:
    """Use regular expression."""
    return dat["fname"].str.replace(r'.txt$', '', regex=True)


def rm_map(dat: pd.DataFrame) -> pd.Series:
    """Use pandas map, find occurrences and remove with []"""
    where = dat["fname"].str.endswith(".txt")
    dat.loc[where, "fname"] = dat["fname"].map(lambda x: x[:-4])
    return dat["fname"]


def rm_map2(dat: pd.DataFrame) -> pd.Series:
    """Use pandas map with lambda conditional."""
    return dat["fname"].map(lambda x: x[:-4] if x[-4:] == ".txt" else x)


def rm_apply_str_suffix(dat: pd.DataFrame) -> pd.Series:
    """Use str method suffix with pandas apply"""
    return dat["fname"].apply(str.removesuffix, args=(".txt",))


def rm_suffix(dat: pd.DataFrame) -> pd.Series:
    """Use pandas removesuffix from version 1.6"""
    return dat["fname"].str.removesuffix(".txt")


functions = [rm_map2, rm_apply_str_suffix, rm_map, rm_suffix, rm_re]
for base in range(12, 23):
    size = 2**base
    data = pd.DataFrame({"fname": ["fn"+str(i) for i in range(size)]})
    data.update(data.sample(frac=.5)["fname"]+".txt")
    for func in functions:
        diff = 0
        for _ in range(ITER):
            data_copy = data.copy()
            start = time.process_time()
            func(data_copy)
            diff += time.process_time() - start

        print(diff, end="\t")

The output is plotted below:

runtime for different functions

It can be seen from the plot that the slowest solution is the regexp, and the fastest is the pandas.Series.map with a conditional. In later versions of pandas, this may change and I'd expect an improvement in pandas.Series.str.removesuffix, as it has a greater potential in vectorization.

Pandas had to be installed from the source as of 2021-11-30, because version 1.4 is in the developement stage only. I installed it by following the instructions from pandas dev repo, by cloning the project and installing with python setup.py install.

My machine:

  • AMD Ryzen 5 2400G with Radeon Vega Graphics, 3.60 GHz
  • Windows 10 20H2
  • Python 3.10.0, pandas.version '1.4.0.dev0+1267.gaee662a7e3', numpy.version '1.21.4'
like image 36
DanielTuzes Avatar answered Oct 12 '22 17:10

DanielTuzes