Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort a string array by element lengths using NumPy

I want to sort a string array using numpy by the length of the elements.

>>> arr = ["year","month","eye","i","stream","key","house"]
>>> x = np.sort(arr, axis=-1, kind='mergesort')
>>> print(x)
['eye' 'house' 'i' 'key' 'month' 'stream' 'year']

But it sorts them in alphanumeric order. How can I sort them using numpy by their length?

like image 926
GGG Avatar asked Sep 18 '25 13:09

GGG


2 Answers

Add a helper array containing the lenghts of the strings, then use numpy's argsort which gives you the indices which would sort according to these lengths. Index the original data with these indices:

import numpy as np
arr = np.array(["year","month","eye","i","stream","key","house"])  # np-array needed for later indexing
arr_ = map(lambda x: len(x), arr)  # remark: py3 would work different here
x = arr[np.argsort(arr_)]
print(x)
like image 151
sascha Avatar answered Sep 21 '25 02:09

sascha


If I expand your list to arr1=arr*1000, the Python list sort using len as the key function is fastest.

In [77]: len(arr1)
Out[77]: 7000

In [78]: timeit sarr=sorted(arr1,key=len)
100 loops, best of 3: 3.03 ms per loop

In [79]: %%timeit
arrA=np.array(arr1)
larr=[len(i) for i in arrA]  # list comprehension works same as map
sarr=arrA[np.argsort(larr)]
   ....: 
100 loops, best of 3: 7.77 ms per loop

Converting the list to array takes about 1 ms (that conversion adds significant overhead, especially for small lists). Using an already created array, and np.char.str_len the time is still slower than Python sort.

In [83]: timeit sarr=arrA[np.argsort(np.char.str_len(arrA))]
100 loops, best of 3: 6.51 ms per loop

the np.char functions can be convenient, they still basically iterate over the list, applying the corresponding str method.

In general argsort gives you much of the same power as the key function.

like image 25
hpaulj Avatar answered Sep 21 '25 04:09

hpaulj