I'm attempting to clean up some of the Data that I have from an excel file. The file contains 7400 rows and 18 columns, which includes a list of customers with their respective addresses and other data. The problem that I'm encountering is that some of the cities are misspelled which distorts the information and makes it difficult for further processing.
SURNAME | ADDRESS | CITY
0 Jenson | 252 Des Chênes | D.DO
1 Jean | 236 Gouin | DOLLARD
2 Denis | 993 Boul. Gouin | DOLLARD-DES-ORMEAUX
3 Bradford | 1690 Dollard #7 | DDO
4 Alisson | 115 Du Buisson | IL PERROT
5 Abdul | 9877 Boul. Gouin | Pierrefonds
6 O'Neil | 5 Du College | Ile Bizard
7 Bundy | 7345 Sherbrooke | ILLE Perot
8 Darcy | 8671 Anthony #2 | ILE Perrot
9 Adams | 845 Georges | Pierrefonds
In the above example D.DO, DOLLARD, DDO should be spelled DOLLARD-DES-ORMEAUX and IL PERROT, ILLE PEROT, ILE PERROT should be spelled ILE-PERROT.
I've been able to replace the values using:
df["CITY"].replace(to_replace={"D.DO", "DOLLARD", "DDO"}, value="DOLLARD-DES-ORMEAUX", regex=True)
df["CITY"].replace(to_replace={"IL PERROT", "ILLE PEROT", "ILE PERROT"}, value="ILE-PERROT", regex=True)
Is there some way of combining the above operations into one? I've tried:
df["CITY"].replace({to_replace={"D.DO", "DOLLARD", "DDO"}, value="DOLLARD-DES-ORMEAUX", to_replace={"IL PERROT", "ILLE PEROT", "ILE PERROT"}, value="ILE-PERROT"}, regex=True)
but I've had no luck
Pandas replace multiple values in column replace. By using DataFrame. replace() method we will replace multiple values with multiple new strings or text for an individual DataFrame column. This method searches the entire Pandas DataFrame and replaces every specified value.
Method 3: Replace multiple characters using re.subn() is similar to sub() in all ways, except in its way of providing output. It returns a tuple with a count of the total of replacement and the new string rather than just the string.
You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.
try .replace({}, regex=True)
method:
replacements = {
'CITY': {
r'(D.*DO|DOLLARD.*)': 'DOLLARD-DES-ORMEAUX',
r'I[lL]*[eE]*.*': 'ILLE Perot'}
}
df.replace(replacements, regex=True, inplace=True)
print(df)
Output:
SURNAME ADDRESS CITY
0 Jenson 252 Des Chênes DOLLARD-DES-ORMEAUX
1 Jean 236 Gouin DOLLARD-DES-ORMEAUX
2 Denis 993 Boul. Gouin DOLLARD-DES-ORMEAUX
3 Bradford 1690 Dollard #7 DOLLARD-DES-ORMEAUX
4 Alisson 115 Du Buisson ILLE Perot
5 Abdul 9877 Boul. Gouin Pierrefonds
6 O'Neil 5 Du College ILLE Perot
7 Bundy 7345 Sherbrooke ILLE Perot
8 Darcy 8671 Anthony #2 ILLE Perot
9 Adams 845 Georges Pierrefonds
You can create a dictionary of replacements and then iterate through them, using 'loc' for replacement.
target_for_values = {
'DOLLARD-DES-ORMEAUX': ['D.DO', 'DOLLARD', 'DDO'],
'ILE-PERROT': ['IL PERROT', 'ILLE PEROT', 'ILE PERROT']}
for k, v in target_for_values.iteritems():
df.loc[df.CITY.str.upper().isin(v), 'CITY'] = k
>>> df.CITY
CITY
0 C.DO
1 DOLLARD-DES-ORMEAUX
2 DOLLARD-DES-ORMEAUX
3 DOLLARD-DES-ORMEAUX
4 ILE-PERROT
5 Pierrefonds
6 Ile Bizard
7 ILE-PERROT
8 ILE-PERROT
9 Pierrefonds
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With