I have a large <code>DataFrame</code> that I need to clean, as a sample please look at this dataframe: <pre class="prettyprint"><code>import pandas as pd cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Suzuki'], 'Price': ['22000.T','25000.T','27000','.TPX','.NKM1'] } df = pd.DataFrame(cars, columns = ['Brand', 'Price']) print (df) </code></pre> I want to remove <code>'.T'</code> from the end of the words, and only <code>'.'</code> from the beginning of the rows that contain the. by the following line of code, I could remove the <code>'.T'</code> <pre class="prettyprint"><code>df['Price'].replace('.T', '', regex=True) </code></pre> but it also removed the <code>'T'</code> from the <code>'.TPX'</code> any advice on this is appreciated. <pre class="prettyprint"><code>0 22000 1 25000 2 27000 3 PX 4 .NKM1 Name: Price, dtype: object </code></pre> Also for removing the <code>'.'</code> when I add this line <pre class="prettyprint"><code>f['Price'].replace('.', '', regex=True) </code></pre> I get a different dataframe as what I expected <pre class="prettyprint"><code>0 1 2 3 4 Name: Price, dtype: object </code></pre>

<h3><code>Series.str.replace</code></h3> <pre class="prettyprint"><code>df['Price'] = df['Price'].str.replace(r'^(?:\.)?(.*?)(?:\.T)?$', r'\g<1>') </code></pre> <h3><code>Series.str.extract</code></h3> <pre class="prettyprint"><code>df['Price'] = df['Price'].str.extract(r'^(?:\.)?(.*?)(?:\.T)?$', expand=False) </code></pre> <hr> <pre class="prettyprint"><code> Brand Price 0 Honda Civic 22000 1 Toyota Corolla 25000 2 Ford Focus 27000 3 Audi A4 TPX 4 Suzuki NKM1 </code></pre> Regex details: <ul> <li> <code>^</code> : Assert position at the start of line</li> <li> <code>(?:\.)</code> : Non capturing group which matches the character <code>.</code> </li> <li> <code>?</code> : Matches the previous non capturing group zero or one time</li> <li> <code>(.*?)</code> : Capturing group which matches any character except line terminators zero or more times but as few times as possible (<code>lazy match</code>)</li> <li> <code>(?:\.T)</code> : Non capturing group which matches <code>.T</code> </li> <li> <code>?</code> : Matches the previous non capturing group zero or one time</li> <li> <code>$</code> : Asserts position at the end of the line</li> </ul> See the <code>Regex demo</code>

How to remove a character from some rows in a dataframe column?

Tags:

python

regex

pandas

dataframe

data-cleaning

I have a large DataFrame that I need to clean, as a sample please look at this dataframe:

import pandas as pd

cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Suzuki'],
        'Price': ['22000.T','25000.T','27000','.TPX','.NKM1']
        }

df = pd.DataFrame(cars, columns = ['Brand', 'Price'])

print (df)

I want to remove '.T' from the end of the words, and only '.' from the beginning of the rows that contain the.

by the following line of code, I could remove the '.T'

df['Price'].replace('.T', '', regex=True)

but it also removed the 'T' from the '.TPX'

any advice on this is appreciated.

0    22000
1    25000
2    27000
3       PX
4    .NKM1
Name: Price, dtype: object

Also for removing the '.' when I add this line

f['Price'].replace('.', '', regex=True)

I get a different dataframe as what I expected

0    
1    
2    
3    
4    
Name: Price, dtype: object

667

asked Mar 19 '21 13:03

sam_sam

6 Answers

You can match either a dot at the start of the string, or match .T at the end. Then use an empty string in the replacement.

\A\.|\.T\Z

For example

import pandas as pd

cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Suzuki'],
        'Price': ['22000.T','25000.T','27000','.TPX','.NKM1']
        }

df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
df['Price'] = df['Price'].replace(r"\A\.|\.T\Z", "", regex=True)
print(df)

Output

            Brand  Price
0     Honda Civic  22000
1  Toyota Corolla  25000
2      Ford Focus  27000
3         Audi A4    TPX
4          Suzuki   NKM1

answered Oct 27 '22 13:10

The fourth bird

Another way would be to use numpy.where and evaluate your conditions using str.startswith and str.endswith:

import numpy as np

p = df['Price'].str
df['Price'] = np.where(p.startswith('.'),p.replace('.','',regex=True),
                         np.where(p.endswith('.T'),p.replace('.T','',regex=True),p))

This will check whether df['Price'] starts with a . or ends with a .T and replace them.

            Brand  Price
0     Honda Civic  22000
1  Toyota Corolla  25000
2      Ford Focus  27000
3         Audi A4    TPX
4          Suzuki   NKM1

answered Oct 27 '22 12:10

sophocles

`Series.str.replace`

df['Price'] = df['Price'].str.replace(r'^(?:\.)?(.*?)(?:\.T)?$', r'\g<1>')

`Series.str.extract`

df['Price'] = df['Price'].str.extract(r'^(?:\.)?(.*?)(?:\.T)?$', expand=False)

            Brand  Price
0     Honda Civic  22000
1  Toyota Corolla  25000
2      Ford Focus  27000
3         Audi A4    TPX
4          Suzuki   NKM1

Regex details:

^ : Assert position at the start of line
(?:\.) : Non capturing group which matches the character .
? : Matches the previous non capturing group zero or one time
(.*?) : Capturing group which matches any character except line terminators zero or more times but as few times as possible (lazy match)
(?:\.T) : Non capturing group which matches .T
? : Matches the previous non capturing group zero or one time
$ : Asserts position at the end of the line

See the Regex demo

answered Oct 27 '22 13:10

Shubham Sharma

You should be able to what you want with anchors and what's called a positive lookbehind.

df['Price'].replace('(?<=.)\.T$', '', regex=True)

With regular expressions, there's special characters that have added functionality. Here, the '$' means ends with. So if you want to just affect stings that end in '.T' you want to add that to the end. The part of the expression that is the lookbehind is '(?<=.)'. The parentheses signify a group.

I don't really know how to explain it other than it's kind of similar to how CSS classes work, which really isn't that great of an example.

The '?<=.' is the actual parameters for the lookbehind, it tells the regex engine to match any character ( the '.' ) before the match that exists outside the group ( '.T' ).

To replace the words starting with '.' is very simple. It's just the opposite anchor,

df['Price'].replace('^\.', '', regex=True)

https://regex101.com/ is a great website to help build your regexes. It will also explain what your regex does.

answered Oct 27 '22 12:10

zelarian

You can also use numpy.select:

In [178]: import numpy as np

In [179]: conds = [df.Price.str.endswith('.T'), df.Price.str.startswith('.')]
In [182]: choices = [df.Price.str.replace('.T', '', regex=True), df.Price.str.replace('.', '', regex=True)]

In [189]: df.Price = np.select(conds, choices, default=df.Price)

In [190]: df
Out[190]: 
            Brand  Price
0     Honda Civic  22000
1  Toyota Corolla  25000
2      Ford Focus  27000
3         Audi A4    TPX
4          Suzuki   NKM1

answered Oct 27 '22 12:10

Mayank Porwal

I want to explain why you got that result. This is because . has special meaning when used in pattern, re docs list of special characters starts with

. (Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

So when you mean literal . you need to escape it, consider following example

df = pd.DataFrame({"col1":["3.45"]})
df["unescaped"] = df.col1.replace(r'.','X',regex=True)
df["escaped"] = df.col1.replace(r'\.','X',regex=True)
print(df)

output

   col1 unescaped escaped
0  3.45      XXXX    3X45

Note that I used so called raw-string here, which allows more readable form of escaping characters with special meaning in pattern (without raw-string I would have to write '\\.', consult re docs for more information). If you struggle with regular expression pattern I suggest using regex101.com to get its explanation.

answered Oct 27 '22 13:10

Daweo

Related questions
                            
                                Get Scrapy crawler output/results in script file function
                            
                                Pandas dataframe to count matrix
                            
                                How to print multiple non-consecutive values from a list with Python 3.5.1
                            
                                Finding All The Keys With the Same Value in a Python Dictionary [duplicate]
                            
                                How to groupby based on two columns in pandas?
                            
                                How can I multiply a vector and a matrix in tensorflow without reshaping?
                            
                                Divide Pyspark Dataframe Column by Column in other Pyspark Dataframe when ID Matches
                            
                                can't remove python pip
                            
                                trouble aligning ticks for matplotlib twinx axes
                            
                                Install openexr in python doesn't work
                            
                                ImportError: No module named 'wordcloud'
                            
                                Plot multiple bars for categorical data
                            
                                range countdown to zero
                            
                                Can I create a local numpy random seed?
                            
                                Skip specific set of columns when reading excel frame - pandas
                            
                                How to randomly split a DataFrame into several smaller DataFrames?
                            
                                How to perform time series analysis that contains multiple groups in Python using fbProphet or other models?
                            
                                Simply using parsec in python
                            
                                Google image download with python cannot download images
                            
                                ModuleNotFoundError: No module named 'fcntl' [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to remove a character from some rows in a dataframe column?

Tags:

python

regex

pandas

dataframe

data-cleaning

sam_sam

People also ask

6 Answers

The fourth bird

sophocles

`Series.str.replace`

`Series.str.extract`

Shubham Sharma

zelarian

Mayank Porwal

Daweo

Recent Activity

Donate For Us