I have a dataframe which looks like this:
A B C
1 red78 square big235
2 green circle small123
3 blue45 triangle big657
I need to be able to remove the non-numeric characters from all the rows in column C so that my dataframe looks like:
A B C
1 red78 square 235
2 green circle 123
3 blue45 triangle 657
I tried using the following but get the error expected string or buffer:
import re
dfOutput.imgID = dfOutput.imgID.apply(re.sub('[^0-9]','', dfOutput.imgID), axis = 0)
What should I do instead?
Code to create dataframe:
dfObject = pd.DataFrame()
dfObject.set_value(1, 'A', 'red78')
dfObject.set_value(1, 'B', 'square')
dfObject.set_value(1, 'C', 'big235')
dfObject.set_value(2, 'A', 'green')
dfObject.set_value(2, 'B', 'circle')
dfObject.set_value(2, 'C', 'small123')
dfObject.set_value(3, 'A', 'blue45')
dfObject.set_value(3, 'B', 'triangle')
dfObject.set_value(3, 'C', 'big657')
In order to remove all non-numeric characters from a string, replace() function is used. replace() Function: This function searches a string for a specific value, or a RegExp, and returns a new string where the replacement is done.
To remove a character in an R data frame column, we can use gsub function which will replace the character with blank. For example, if we have a data frame called df that contains a character column say x which has a character ID in each value then it can be removed by using the command gsub("ID","",as.
Use str.extract
and pass a regex pattern to extract just the numeric parts:
In[40]:
dfObject['C'] = dfObject['C'].str.extract('(\d+)', expand=False)
dfObject
Out[40]:
A B C
1 red78 square 235
2 green circle 123
3 blue45 triangle 657
If needed you can cast to int
:
dfObject['C'] = dfObject['C'].astype(int)
To remove all non-digit characters from strings in a Pandas column you should use str.replace
with \D+
or [^0-9]+
patterns:
dfObject['C'] = dfObject['C'].str.replace(r'\D+', '')
Or, since in Python 3, \D
is fully Unicode-aware by default and thus does not match non-ASCII digits (like ۱۲۳۴۵۶۷۸۹
, see proof) you should consider
dfObject['C'] = dfObject['C'].str.replace(r'[^0-9]+', '')
So,
import re
print ( re.sub( r'\D+', '', '1۱۲۳۴۵۶۷۸۹0') ) # => 1۱۲۳۴۵۶۷۸۹0
print ( re.sub( r'[^0-9]+', '', '1۱۲۳۴۵۶۷۸۹0') ) # => 10
You can use .str.replace
with a regex:
dfObject['C'] = dfObject.C.str.replace(r"[a-zA-Z]",'')
output:
A B C
1 red78 square 235
2 green circle 123
3 blue45 triangle 657
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With