I have a large DataFrame
that I need to clean, as a sample please look at this dataframe:
import pandas as pd
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Suzuki'],
'Price': ['22000.T','25000.T','27000','.TPX','.NKM1']
}
df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
print (df)
I want to remove '.T'
from the end of the words, and only '.'
from the beginning of the rows that contain the.
by the following line of code, I could remove the '.T'
df['Price'].replace('.T', '', regex=True)
but it also removed the 'T'
from the '.TPX'
any advice on this is appreciated.
0 22000
1 25000
2 27000
3 PX
4 .NKM1
Name: Price, dtype: object
Also for removing the '.'
when I add this line
f['Price'].replace('.', '', regex=True)
I get a different dataframe as what I expected
0
1
2
3
4
Name: Price, dtype: object
To remove a character in an R data frame column, we can use gsub function which will replace the character with blank. For example, if we have a data frame called df that contains a character column say x which has a character ID in each value then it can be removed by using the command gsub("ID","",as.
With the replace() function, we can create a new string where the specified value is replaced by another specified value. We can use the replace() function to remove words from a string. To remove a given word from a string, you can use replace() and pass an empty string as the replacement value as shown below.
Another option you have when it comes to removing unwanted parts from strings in pandas, is pandas. Series. str. extract() method that is used to extract capture groups in the regex pat as columns in a DataFrame.
We can replace characters using str. replace() method is basically replacing an existing string or character in a string with a new one. we can replace characters in strings is for the entire dataframe as well as for a particular column.
(1) Replace character/s under a single DataFrame column: df ['column name'] = df ['column name'].str.replace ('old character','new character') (2) Replace character/s under the entire DataFrame: df = df.replace ('old character','new character', regex=True)
To remove a character in an R data frame column, we can use gsub function which will replace the character with blank. For example, if we have a data frame called df that contains a character column say x which has a character ID in each value then it can be removed by using the command gsub ("ID","",as.character (df$x)).
In this article we will learn how to remove the rows with special characters i.e; if a row contains any value which contains special characters like @, %, &, $, #, +, -, *, /, etc. then drop such row and modify the data. To drop such types of rows, first, we have to search rows having special characters per column and then drop.
Let’s create a simple DataFrame with two columns that contain strings: This is how the DataFrame would look like: The goal is to replace the underscore (“_”) character with a pipe (“|”) character under the ‘ first_set ‘ column. To achieve this goal, you’ll need to add the following syntax to the code:
You can match either a dot at the start of the string, or match .T
at the end. Then use an empty string in the replacement.
\A\.|\.T\Z
For example
import pandas as pd
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Suzuki'],
'Price': ['22000.T','25000.T','27000','.TPX','.NKM1']
}
df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
df['Price'] = df['Price'].replace(r"\A\.|\.T\Z", "", regex=True)
print(df)
Output
Brand Price
0 Honda Civic 22000
1 Toyota Corolla 25000
2 Ford Focus 27000
3 Audi A4 TPX
4 Suzuki NKM1
Another way would be to use numpy.where
and evaluate your conditions using str.startswith
and str.endswith
:
import numpy as np
p = df['Price'].str
df['Price'] = np.where(p.startswith('.'),p.replace('.','',regex=True),
np.where(p.endswith('.T'),p.replace('.T','',regex=True),p))
This will check whether df['Price']
starts with a .
or ends with a .T
and replace them.
Brand Price
0 Honda Civic 22000
1 Toyota Corolla 25000
2 Ford Focus 27000
3 Audi A4 TPX
4 Suzuki NKM1
Series.str.replace
df['Price'] = df['Price'].str.replace(r'^(?:\.)?(.*?)(?:\.T)?$', r'\g<1>')
Series.str.extract
df['Price'] = df['Price'].str.extract(r'^(?:\.)?(.*?)(?:\.T)?$', expand=False)
Brand Price
0 Honda Civic 22000
1 Toyota Corolla 25000
2 Ford Focus 27000
3 Audi A4 TPX
4 Suzuki NKM1
Regex details:
^
: Assert position at the start of line(?:\.)
: Non capturing group which matches the character .
?
: Matches the previous non capturing group zero or one time(.*?)
: Capturing group which matches any character except line terminators zero or more times but as few times as possible (lazy match
)(?:\.T)
: Non capturing group which matches .T
?
: Matches the previous non capturing group zero or one time$
: Asserts position at the end of the lineSee the Regex demo
You should be able to what you want with anchors and what's called a positive lookbehind.
df['Price'].replace('(?<=.)\.T$', '', regex=True)
With regular expressions, there's special characters that have added functionality. Here, the '$' means ends with. So if you want to just affect stings that end in '.T' you want to add that to the end. The part of the expression that is the lookbehind is '(?<=.)'. The parentheses signify a group.
I don't really know how to explain it other than it's kind of similar to how CSS classes work, which really isn't that great of an example.
The '?<=.' is the actual parameters for the lookbehind, it tells the regex engine to match any character ( the '.' ) before the match that exists outside the group ( '.T' ).
To replace the words starting with '.' is very simple. It's just the opposite anchor,
df['Price'].replace('^\.', '', regex=True)
https://regex101.com/ is a great website to help build your regexes. It will also explain what your regex does.
You can also use numpy.select
:
In [178]: import numpy as np
In [179]: conds = [df.Price.str.endswith('.T'), df.Price.str.startswith('.')]
In [182]: choices = [df.Price.str.replace('.T', '', regex=True), df.Price.str.replace('.', '', regex=True)]
In [189]: df.Price = np.select(conds, choices, default=df.Price)
In [190]: df
Out[190]:
Brand Price
0 Honda Civic 22000
1 Toyota Corolla 25000
2 Ford Focus 27000
3 Audi A4 TPX
4 Suzuki NKM1
I want to explain why you got that result. This is because .
has special meaning when used in pattern, re docs list of special characters starts with
.
(Dot.) In the default mode, this matches any character except a newline. If theDOTALL
flag has been specified, this matches any character including a newline.
So when you mean literal .
you need to escape it, consider following example
df = pd.DataFrame({"col1":["3.45"]})
df["unescaped"] = df.col1.replace(r'.','X',regex=True)
df["escaped"] = df.col1.replace(r'\.','X',regex=True)
print(df)
output
col1 unescaped escaped
0 3.45 XXXX 3X45
Note that I used so called raw-string here, which allows more readable form of escaping characters with special meaning in pattern (without raw-string I would have to write '\\.'
, consult re
docs for more information). If you struggle with regular expression pattern I suggest using regex101.com to get its explanation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With