Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas .str.replace and case insensitivity

Making the replace case insensitive does not seem to have an effect in the following example (I want to replace jr. or Jr. with jr):

In [0]: pd.Series('Jr. eng').str.replace('jr.', 'jr', regex=False, case=False)
Out[0]: 0    Jr. eng

Why? What am I misunderstanding?

like image 442
Toby Avatar asked Dec 20 '18 07:12

Toby


People also ask

Is pandas Str case sensitive?

str. contains has a case parameter that is True by default. Set it to False to do a case insensitive match.

How do you conditionally replace values in pandas?

You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.

Is replace case sensitive Python?

Is the String replace function case sensitive? Yes, the replace function is case sensitive. That means, the word “this” has a different meaning to “This” or “THIS”. In the following example, a string is created with the different case letters, that is followed by using the Python replace string method.

How do you change STR in pandas?

You can replace substring of pandas DataFrame column by using DataFrame. replace() method. This method by default finds the exact sting match and replaces it with the specified value. Use regex=True to replace substring.


1 Answers

The case argument is actually a convenience as an alternative to specifying flags=re.IGNORECASE. It has no bearing on replacement if the replacement is not regex-based.

So, when regex=True, these are your possible choices:

pd.Series('Jr. eng').str.replace(r'jr\.', 'jr', regex=True, case=False)
# pd.Series('Jr. eng').str.replace(r'jr\.', 'jr', case=False)

0    jr eng
dtype: object

Or,

pd.Series('Jr. eng').str.replace(r'jr\.', 'jr', regex=True, flags=re.IGNORECASE)
# pd.Series('Jr. eng').str.replace(r'jr\.', 'jr', flags=re.IGNORECASE)

0    jr eng
dtype: object

You can also get cheeky and bypass both keyword arguments by incorporating the case insensitivity flag as part of the pattern as ?i. See

pd.Series('Jr. eng').str.replace(r'(?i)jr\.', 'jr')
0    jr eng
dtype: object

Note
You will need to escape the period \. in regex mode, because the unescaped dot is a meta-character with a different meaning (match any character). If you want to dynamically escape meta-chars in patterns, you can use re.escape.

For more information on flags and anchors, see this section of the docs and re HOWTO.


From the source code, it is clear that the "case" argument is ignored if regex=False. See

# Check whether repl is valid (GH 13438, GH 15055)
if not (is_string_like(repl) or callable(repl)):
    raise TypeError("repl must be a string or callable")

is_compiled_re = is_re(pat)
if regex:
    if is_compiled_re:
        if (case is not None) or (flags != 0):
            raise ValueError("case and flags cannot be set"
                             " when pat is a compiled regex")
    else:
        # not a compiled regex
        # set default case
        if case is None:
            case = True

        # add case flag, if provided
        if case is False:
            flags |= re.IGNORECASE
    if is_compiled_re or len(pat) > 1 or flags or callable(repl):
        n = n if n >= 0 else 0
        compiled = re.compile(pat, flags=flags)
        f = lambda x: compiled.sub(repl=repl, string=x, count=n)
    else:
        f = lambda x: x.replace(pat, repl, n)

You can see the case argument is only checked inside the if statement.

IOW, the only way is to ensure regex=True so that replacement is regex-based.

like image 88
cs95 Avatar answered Sep 24 '22 00:09

cs95