Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove a character from some rows in a dataframe column?

I have a large DataFrame that I need to clean, as a sample please look at this dataframe:

import pandas as pd

cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Suzuki'],
        'Price': ['22000.T','25000.T','27000','.TPX','.NKM1']
        }

df = pd.DataFrame(cars, columns = ['Brand', 'Price'])

print (df)

I want to remove '.T' from the end of the words, and only '.' from the beginning of the rows that contain the.

by the following line of code, I could remove the '.T'

df['Price'].replace('.T', '', regex=True)

but it also removed the 'T' from the '.TPX'

any advice on this is appreciated.

0    22000
1    25000
2    27000
3       PX
4    .NKM1
Name: Price, dtype: object

Also for removing the '.' when I add this line

f['Price'].replace('.', '', regex=True)

I get a different dataframe as what I expected

0    
1    
2    
3    
4    
Name: Price, dtype: object
like image 667
sam_sam Avatar asked Mar 19 '21 13:03

sam_sam


People also ask

How do I remove a character from a column in a data frame?

To remove a character in an R data frame column, we can use gsub function which will replace the character with blank. For example, if we have a data frame called df that contains a character column say x which has a character ID in each value then it can be removed by using the command gsub("ID","",as.

How do you remove a specific word from a DataFrame in Python?

With the replace() function, we can create a new string where the specified value is replaced by another specified value. We can use the replace() function to remove words from a string. To remove a given word from a string, you can use replace() and pass an empty string as the replacement value as shown below.

How do I remove part of a string in pandas?

Another option you have when it comes to removing unwanted parts from strings in pandas, is pandas. Series. str. extract() method that is used to extract capture groups in the regex pat as columns in a DataFrame.

How do I change a character in a column in pandas?

We can replace characters using str. replace() method is basically replacing an existing string or character in a string with a new one. we can replace characters in strings is for the entire dataframe as well as for a particular column.

How do I replace a character in a Dataframe?

(1) Replace character/s under a single DataFrame column: df ['column name'] = df ['column name'].str.replace ('old character','new character') (2) Replace character/s under the entire DataFrame: df = df.replace ('old character','new character', regex=True)

How to remove a character in an R data frame column?

To remove a character in an R data frame column, we can use gsub function which will replace the character with blank. For example, if we have a data frame called df that contains a character column say x which has a character ID in each value then it can be removed by using the command gsub ("ID","",as.character (df$x)).

How to remove the rows with special characters in Excel?

In this article we will learn how to remove the rows with special characters i.e; if a row contains any value which contains special characters like @, %, &, $, #, +, -, *, /, etc. then drop such row and modify the data. To drop such types of rows, first, we have to search rows having special characters per column and then drop.

How to replace underscore (“_”) character with pipe (“|” in a Dataframe?

Let’s create a simple DataFrame with two columns that contain strings: This is how the DataFrame would look like: The goal is to replace the underscore (“_”) character with a pipe (“|”) character under the ‘ first_set ‘ column. To achieve this goal, you’ll need to add the following syntax to the code:


6 Answers

You can match either a dot at the start of the string, or match .T at the end. Then use an empty string in the replacement.

\A\.|\.T\Z

For example

import pandas as pd

cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Suzuki'],
        'Price': ['22000.T','25000.T','27000','.TPX','.NKM1']
        }

df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
df['Price'] = df['Price'].replace(r"\A\.|\.T\Z", "", regex=True)
print(df)

Output

            Brand  Price
0     Honda Civic  22000
1  Toyota Corolla  25000
2      Ford Focus  27000
3         Audi A4    TPX
4          Suzuki   NKM1
like image 64
The fourth bird Avatar answered Oct 27 '22 13:10

The fourth bird


Another way would be to use numpy.where and evaluate your conditions using str.startswith and str.endswith:

import numpy as np

p = df['Price'].str
df['Price'] = np.where(p.startswith('.'),p.replace('.','',regex=True),
                         np.where(p.endswith('.T'),p.replace('.T','',regex=True),p))

This will check whether df['Price'] starts with a . or ends with a .T and replace them.

            Brand  Price
0     Honda Civic  22000
1  Toyota Corolla  25000
2      Ford Focus  27000
3         Audi A4    TPX
4          Suzuki   NKM1
like image 8
sophocles Avatar answered Oct 27 '22 12:10

sophocles


Series.str.replace

df['Price'] = df['Price'].str.replace(r'^(?:\.)?(.*?)(?:\.T)?$', r'\g<1>')

Series.str.extract

df['Price'] = df['Price'].str.extract(r'^(?:\.)?(.*?)(?:\.T)?$', expand=False)

            Brand  Price
0     Honda Civic  22000
1  Toyota Corolla  25000
2      Ford Focus  27000
3         Audi A4    TPX
4          Suzuki   NKM1

Regex details:

  • ^ : Assert position at the start of line
  • (?:\.) : Non capturing group which matches the character .
  • ? : Matches the previous non capturing group zero or one time
  • (.*?) : Capturing group which matches any character except line terminators zero or more times but as few times as possible (lazy match)
  • (?:\.T) : Non capturing group which matches .T
  • ? : Matches the previous non capturing group zero or one time
  • $ : Asserts position at the end of the line

See the Regex demo

like image 6
Shubham Sharma Avatar answered Oct 27 '22 13:10

Shubham Sharma


You should be able to what you want with anchors and what's called a positive lookbehind.

df['Price'].replace('(?<=.)\.T$', '', regex=True)

With regular expressions, there's special characters that have added functionality. Here, the '$' means ends with. So if you want to just affect stings that end in '.T' you want to add that to the end. The part of the expression that is the lookbehind is '(?<=.)'. The parentheses signify a group.

I don't really know how to explain it other than it's kind of similar to how CSS classes work, which really isn't that great of an example.

The '?<=.' is the actual parameters for the lookbehind, it tells the regex engine to match any character ( the '.' ) before the match that exists outside the group ( '.T' ).

To replace the words starting with '.' is very simple. It's just the opposite anchor,

df['Price'].replace('^\.', '', regex=True)

https://regex101.com/ is a great website to help build your regexes. It will also explain what your regex does.

like image 4
zelarian Avatar answered Oct 27 '22 12:10

zelarian


You can also use numpy.select:

In [178]: import numpy as np

In [179]: conds = [df.Price.str.endswith('.T'), df.Price.str.startswith('.')]
In [182]: choices = [df.Price.str.replace('.T', '', regex=True), df.Price.str.replace('.', '', regex=True)]

In [189]: df.Price = np.select(conds, choices, default=df.Price)

In [190]: df
Out[190]: 
            Brand  Price
0     Honda Civic  22000
1  Toyota Corolla  25000
2      Ford Focus  27000
3         Audi A4    TPX
4          Suzuki   NKM1
like image 4
Mayank Porwal Avatar answered Oct 27 '22 12:10

Mayank Porwal


I want to explain why you got that result. This is because . has special meaning when used in pattern, re docs list of special characters starts with

. (Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

So when you mean literal . you need to escape it, consider following example

df = pd.DataFrame({"col1":["3.45"]})
df["unescaped"] = df.col1.replace(r'.','X',regex=True)
df["escaped"] = df.col1.replace(r'\.','X',regex=True)
print(df)

output

   col1 unescaped escaped
0  3.45      XXXX    3X45

Note that I used so called raw-string here, which allows more readable form of escaping characters with special meaning in pattern (without raw-string I would have to write '\\.', consult re docs for more information). If you struggle with regular expression pattern I suggest using regex101.com to get its explanation.

like image 3
Daweo Avatar answered Oct 27 '22 13:10

Daweo