Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sort a 2D numpy object array based on a list

I have a 2D numpy object array:

aa = np.array([["aaa","05","1","a"],
               ["ccc","30","2","v"],
               ["ddd","50","2","v"],
               ["bbb","10","1","v"]])

and the following list:

sample_ids = ["aaa", "bbb", "ccc", "ddd"]

I would like to sort the numpy array based on the list so that I get the following:

[["aaa","05","1","a"],
 ["bbb","10","1","v"],
 ["ccc","30","2","v"],
 ["ddd","50","2","v"]]
 

Edit:

If there are keys (in sample_ids) that are not present in the array. The resulting array would not include these missing keys (i.e. no addition of empty rows). So if we have the following:

sample_ids = ["aaa", "bbb", "ccc", "ddd", "eee"]

The final array would still be the same.

Also, if the array would contain a row (i.e. row key) that is missing from the keys. That row would be left out of the resulting array as well.

Edit 2: Starting from Nick's answer, I came up with this to deal with absent keys.

sample_ids2 = ["aaa", "bbb", "eee", "ccc", "ddd"]

idxs = []
for i,v in enumerate(sample_ids2):
    if str(list(aa.T[0])).find(v) != -1:
        k = list(aa.T[0]).index(v)
        idxs.append(k)
    else:
        print(v + " was not found!!!")

print(aa[idxs])

Output:

[['aaa' '05' '1' 'a']
 ['bbb' '10' '1' 'v']
 ['ccc' '30' '2' 'v']
 ['ddd' '50' '2' 'v']]
like image 549
julio514 Avatar asked Mar 13 '26 14:03

julio514


2 Answers

Here are a couple of possible solutions. Using numpy:

subs = list(aa.T[0])
idxs = [subs.index(i) for i in sample_ids if i in subs]
res = aa[idxs]
# array([['aaa', '05', '1', 'a'],
#        ['bbb', '10', '1', 'v'],
#        ['ccc', '30', '2', 'v'],
#        ['ddd', '50', '2', 'v']], dtype='<U3')

Using pandas:

res = np.array(pd.DataFrame(aa).set_index(0).reindex(sample_ids).dropna().reset_index())
# array([['aaa', '05', '1', 'a'],
#        ['bbb', '10', '1', 'v'],
#        ['ccc', '30', '2', 'v'],
#        ['ddd', '50', '2', 'v']], dtype=object)

For both cases, if sample_ids = ["aaa", "bbb", "ccc", "ddd", "eee"], the output will be the same.

If sample_ids = ["ddd", "aaa", "bbb"], the output will be:

array([['ddd', '50', '2', 'v'],
       ['aaa', '05', '1', 'a'],
       ['bbb', '10', '1', 'v']])
like image 148
Nick Avatar answered Mar 15 '26 04:03

Nick


Inspired by @Nick's first approach:

# first build a dictionary of value: position
key = {k: i for i, k in enumerate(sample_ids)}
# {'aaa': 0, 'bbb': 1, 'ccc': 2, 'ddd': 3}

# then sort based on this key
out = aa[np.argsort(np.vectorize(key.get)(aa[:, 0]))]

If you want to be able to handle missing values (with -1 as default key to sort it first, if you prefer last use np.inf):

out = aa[np.argsort(np.vectorize(lambda x: key.get(x, -1))(aa[:, 0]))]

Output:

array([['aaa', '05', '1', 'a'],
       ['bbb', '10', '1', 'v'],
       ['ccc', '30', '2', 'v'],
       ['ddd', '50', '2', 'v']], dtype='<U3')
like image 43
mozway Avatar answered Mar 15 '26 04:03

mozway