Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sort strings with numbers in Pandas?

I have a Python Pandas Dataframe, in which a column named status contains three kinds of possible values: ok, must read x more books, does not read any books yet, where x is an integer higher than 0.

I want to sort status values according to the order above.

Example:

  name    status
0 Paul    ok
1 Jean    must read 1 more books
2 Robert  must read 2 more books
3 John    does not read any book yet

I've found some interesting hints, using Pandas Categorical and map but I don't know how to deal with variable values modifying strings.

How can I achieve that?

like image 701
Kfcaio Avatar asked Dec 24 '22 05:12

Kfcaio


2 Answers

Use:

a = df['status'].str.extract('(\d+)', expand=False).astype(float)

d = {'ok': a.max() + 1, 'does not read any book yet':-1}

df1 = df.iloc[(-df['status'].map(d).fillna(a)).argsort()]
print (df1)
     name                      status
0    Paul                          ok
2  Robert      must read 2 more books
1    Jean      must read 1 more books
3    John  does not read any book yet

Explanation:

  1. First extract integers by regex \d+
  2. Then dynamically create dictionary for map non numeric values
  3. Replace NaNs by fillna for numeric Series
  4. Get positions by argsort
  5. Select by iloc for sorted values
like image 189
jezrael Avatar answered Dec 25 '22 20:12

jezrael


You can use sorted with a custom function to calculate the indices which would be sort an array (much like numpy.argsort). Then feed to pd.DataFrame.iloc:

df = pd.DataFrame({'name': ['Paul', 'Jean', 'Robert', 'John'],
                   'status': ['ok', 'must read 20 more books',
                              'must read 3 more books', 'does not read any book yet']})

def sort_key(x):
    if x[1] == 'ok':
        return -1
    elif x[1] == 'does not read any book yet':
        return np.inf
    else:
        return int(x[1].split()[2])

idx = [idx for idx, _ in sorted(enumerate(df['status']), key=sort_key)]

df = df.iloc[idx, :]

print(df)

     name                      status
0    Paul                          ok
2  Robert      must read 3 more books
1    Jean     must read 20 more books
3    John  does not read any book yet
like image 34
jpp Avatar answered Dec 25 '22 20:12

jpp