Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tokenize data in Python(converting data into patterns)

I have a dataframe which is like the one below:

Name      | City

Apple     | Tokyo
Papaya    | Pune
TimGru334 | Shanghai
236577    | Delhi

I need to iterate through each value and need to tokenise data in Python. To explain in detail:

  • For the value 'Apple', this should be converted to 'ccccc' where c indicates a character.
  • For 'TimGru334', this should be converted to 'ccccccddd'
  • Consider the value '236577', this should be converted to 'dddddd' where d indicates a digit/number.

Can someone help me out please?

P.S: I'm new to the platform, so please excuse me if I'm wrong in any manner. Thanks in advance :)

like image 897
Proton Avatar asked Mar 04 '23 17:03

Proton


1 Answers

Use Series.replace - first non numeric and then numeric values - order of values in lists is important:

df['Name'] = df['Name'].replace(['\D', '\d'], ['c','d'], regex=True)
print (df)
        Name      City
0      ccccc     Tokyo
1     cccccc      Pune
2  ccccccddd  Shanghai
3     dddddd     Delhi

If need replace all columns:

df = df.replace(['\D', '\d'], ['c','d'], regex=True)
print (df)
        Name      City
0      ccccc     ccccc
1     cccccc      cccc
2  ccccccddd  cccccccc
3     dddddd     ccccc
like image 54
jezrael Avatar answered Mar 16 '23 20:03

jezrael