Oftentimes I am tasked with performing some sort of replace or substitution operation on data in a Series or DataFrames column(s).
For example, given a Series of strings,
s = pd.Series(['foo', 'another foo bar', 'baz'])
0 foo
1 another foo bar
2 baz
dtype: object
The goal would be to replace all occurrences of "foo" with "bar", to get
0 bar
1 another bar bar
2 baz
Name: A, dtype: object
At this point I am usually confused as there are two options I can use to solve this: replace
, and str.replace
. The confusion arises from the fact that I am unsure as to which is the right method to use, or what the difference (if any) between them is.
What are the main differences between replace
and str.replace
, and what are the benefits/caveats of using either?
String.prototype.replace() The replace() method returns a new string with one, some, or all matches of a pattern replaced by a replacement .
Python String replace() Method The replace() method replaces a specified phrase with another specified phrase. Note: All occurrences of the specified phrase will be replaced, if nothing else is specified.
replace in JavaScript. To use RegEx, the first argument of replace will be replaced with regex syntax, for example /regex/ . This syntax serves as a pattern where any parts of the string that match it will be replaced with the new substring. The string 3foobar4 matches the regex /\d.
Pandas DataFrame replace() Method The replace() method replaces the specified value with another specified value. The replace() method searches the entire DataFrame and replaces every case of the specified value.
Skip to the TLDR; at the bottom of this answer for a brief summary of the differences.
It is easy to understand the difference if you think of these two methods in terms of their utility.
.str.replace
is a method with a very specific purpose—to perform string or regex substitution on string data.
OTOH, .replace
is more of an all-purpose Swiss Army knife which can replace anything with anything else (and yes, this includes string and regex).
Consider the simple DataFrame below, this will form the basis of our forthcoming discussion.
# Setup
df = pd.DataFrame({
'A': ['foo', 'another foo bar', 'baz'],
'B': [0, 1, 0]
})
df
A B
0 foo 0
1 another foo bar 1
2 baz 0
The main differences between the two functions can be summarised in terms of
Use str.replace
for substring replacements on a single string column, and replace
for any general replacement on one or more columns.
The docs market str.replace
as a method for "simple string replacement", so this should be your first choice when performing string/regex substitution on a pandas Series or column—think of it as a "vectorised" equivalent to python's string replace()
function (or re.sub()
to be more accurate).
# simple substring replacement
df['A'].str.replace('foo', 'bar', regex=False)
0 bar
1 another bar bar
2 baz
Name: A, dtype: object
# simple regex replacement
df['A'].str.replace('ba.', 'xyz')
0 foo
1 another foo xyz
2 xyz
Name: A, dtype: object
replace
works for string as well as non-string replacement. What's more, it is also meant to **work for multiple columns at a time (you can access replace
as a DataFrame method df.replace()
as well, if you need to replace values across the entire DataFrame.
# DataFrame-wide replacement
df.replace({'foo': 'bar', 1: -1})
A B
0 bar 0
1 another foo bar -1
2 baz 0
str.replace
can replace one thing at a time. replace
lets you perform multiple independent replacements, i.e., replace many things at once.
You can only specify a single substring or regex pattern to str.replace
. repl
can be a callable (see the docs), so there's room to get creative with regex to somewhat simulate multiple substring replacements, but these solutions are hacky at best).
A common pandaic (pandorable, pandonic) pattern is to use str.replace
to remove multiple unwanted substrings by pipe-separating substrings using the regex OR pipe |
, and the replacement string is ''
(the empty string).
replace
should be preferred when you have multiple independent replacements of the form {'pat1': 'repl1', 'pat2':
repl2, ...}
. There are various ways of specifying independent replacements (lists, Series, dicts, etc). See the documentation.
To illustrate the difference,
df['A'].str.replace('foo', 'text1').str.replace('bar', 'text2')
0 text1
1 another text1 text2
2 baz
Name: A, dtype: object
Would be better expressed as
df['A'].replace({'foo': 'text1', 'bar': 'text2'}, regex=True)
0 text1
1 another text1 text2
2 baz
Name: A, dtype: object
In the context of string operations, str.replace
enables regex replacement by default. replace
only performs a full match unless the regex=True
switch is used.
Everything you do with str.replace
, you can do with replace
as well. However, it is important to note the following differences in the default behaviour of both methods.
str.replace
will replace every occurrence of the substring, replace
will only perform whole word matches by defaultstr.replace
interprets the first argument as a regular expression unless you specify regex=False
. replace
is the exact opposite.Contrast the difference between
df['A'].replace('foo', 'bar')
0 bar
1 another foo bar
2 baz
Name: A, dtype: object
And
df['A'].replace('foo', 'bar', regex=True)
0 bar
1 another bar bar
2 baz
Name: A, dtype: object
It is also worth mentioning that you can only perform string replacement when regex=True
. So, for example, df.replace({'foo': 'bar', 1: -1}, regex=True)
would be invalid.
To summarise, the main differences are,
Purpose. Use
str.replace
for substring replacements on a single string column, andreplace
for any general replacement on one or more columns.Usage.
str.replace
can replace one thing at a time.replace
lets you perform multiple independent replacements, i.e., replace many things at once.Default behavior.
str.replace
enables regex replacement by default.replace
only performs a full match unless theregex=True
switch is used.
If you are comparing str.replace
with replace
, I would assume that you are thinking of replacing strings only.
The two thumb rules that help (especially when using .apply()
and lambda
) are:
df.replace({dict})
. Remember the defaults as mentioned by cs95
or in the docs.str.replace()
: lambda x: x.str.replace('^default$', '', regex = True, case = False)
.One final thing to note is that the inplace
parameter is only available in the replace
function and not in str.replace
which may be a deciding factor in your code especially if you are chaining.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With