Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to standardize strings between rows in a pandas DataFrame?

I have the following pandas DataFrame in Python3.x:

import pandas as pd

dict1 = {
    'ID':['first', 'second', 'third', 'fourth', 'fifth'], 
    'pattern':['AAABCDEE', 'ABBBBD', 'CCCDE', 'AA', 'ABCDE']
}

df = pd.DataFrame(dict1)

>>> df
       ID   pattern
0   first  AAABCDEE
1  second    ABBBBD
2   third     CCCDE
3  fourth        AA
4   fifth     ABCDE

There are two columns, ID and pattern. The string in pattern with the longest length is in the first row, len('AAABCDEE'), which is length 8.

My goal is to standardize the strings such that these are the same length, with the trailing spaces as ?.

Here is what the output should look like:

>>> df
       ID   pattern
0   first  AAABCDEE
1  second  ABBBBD?? 
2   third  CCCDE???
3  fourth  AA??????
4   fifth  ABCDE???

If I was able to make the trailing spaces NaN, then I could try something like:

df = df.applymap(lambda x: int(x) if pd.notnull(x) else str("?"))

But I'm not sure how one would efficiently (1) find the longest string in pattern and (2) then add NaN add the end of the strings up to this length? This may be a convoluted approach...

like image 415
ShanZhengYang Avatar asked Nov 29 '22 21:11

ShanZhengYang


1 Answers

You can use Series.str.ljust for this, after acquiring the max string length in the column.

df.pattern.str.ljust(df.pattern.str.len().max(), '?')

# 0    AAABCDEE
# 1    ABBBBD??
# 2    CCCDE???
# 3    AA??????
# 4    ABCDE???
# Name: pattern, dtype: object

In the source for Pandas 0.22.0 here it can be seen that ljust is entirely equivalent to pad with side='right', so pick whichever you find more clear.

like image 82
miradulo Avatar answered Dec 04 '22 01:12

miradulo