Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use numpy.char.join?

A critical portion of my script relies on the concatenation of a large number of fixed-length strings. So I would like to use low-level numpy.char.join function instead of the classical python build str.join.

However, I can't get it to work right:

import numpy as np

# Example array.
array = np.array([
    ['a', 'b', 'c'],
    ['d', 'e', 'f'],
    ['g', 'h', 'i'],
    ], dtype='<U1')

# Now I wish to get:
# array(['abc', 'def', 'ghi'], dtype='<U3')

# But none of these is successful :(
np.char.join('', array)
np.char.join('', array.astype('<U3'))
np.char.join(np.array(''), array.astype('<U3'))
np.char.join(np.array('').astype('<U3'), array.astype('<U3'))
np.char.join(np.array(['', '', '']).astype('<U3'), array.astype('<U3'))
np.char.join(np.char.asarray(['', '', '']).astype('<U3'), np.char.asarray(array))
np.char.asarray(['', '', '']).join(array)
np.char.asarray(['', '', '']).astype('<U3').join(array.astype('<U3'))

.. and my initial array is always left unchanged.

What am I missing here?
What's numpy's most efficient way to concatenate each line of a large 2D <U1 array?


[EDIT]: Since performance is a concern, I have benchmarked proposed solutions. But I still don't know how to call np.char.join properly.

import numpy as np
import numpy.random as rd
from string import ascii_lowercase as letters
from time import time

# Build up an array with many random letters
n_lines = int(1e7)
n_columns = 4
array = np.array(list(letters))[rd.randint(0, len(letters), n_lines * n_columns)]
array = array.reshape((n_lines, n_columns))

# One quick-n-dirty way to benchmark.
class MeasureTime(object):
    def __enter__(self):
        self.tic = time()
    def __exit__(self, type, value, traceback):
        toc = time()
        print(f"{toc-self.tic:0.3f} seconds")


# And test three concatenations procedures.
with MeasureTime():
    # Involves str.join
    cat = np.apply_along_axis("".join, 1, array)

with MeasureTime():
    # Involves str.join
    cat = np.array(["".join(row) for row in array])

with MeasureTime():
    # Involve low-level np functions instead.
    # Here np.char.add for example.
    cat = np.char.add(
        np.char.add(np.char.add(array[:, 0], array[:, 1]), array[:, 2]), array[:, 3]
    )

outputs

41.722 seconds
19.921 seconds
15.206 seconds

on my machine.

Would np.char.join do better? How to make it work?

like image 478
iago-lito Avatar asked Oct 24 '25 02:10

iago-lito


1 Answers

On the original (3,3) array (timings may scale differently):

The chained np.char.add:

In [88]: timeit np.char.add(np.char.add(arr[:,0],arr[:,1]),arr[:,2])                           
29 µs ± 223 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

An equivalent approach, using object dtype. For python strings, '+' is a string join.

In [89]: timeit arr.astype(object).sum(axis=1)                                                 
14.1 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

For a list of strings, ''.join() is supposed to be faster than string sum. Plus it lets you specify a 'delimiter':

In [90]: timeit np.array([''.join(row) for row in arr])                                        
13.8 µs ± 41.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Without the conversion back to array:

In [91]: timeit [''.join(row) for row in arr]                                                      
10.2 µs ± 15.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Better yet, use tolist to convert the array to a list of lists of strings:

In [92]: timeit [''.join(row) for row in arr.tolist()]                                         
1.01 µs ± 1.81 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

the list comprehension equivalent of the nested np.char.add:

In [97]: timeit [row[0]+row[1]+row[2] for row in arr.tolist()]                                 
1.19 µs ± 2.68 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

numpy does not have low-level string code, at least not in the same sense that it has low-level compiled numeric code. It still depends on Python string code, even if it calls it from the C-API.

====

Since the strings are U1, we can view them as U3:

In [106]: arr.view('U3')                                                                       
Out[106]: 
array([['abc'],
       ['def'],
       ['ghi']], dtype='<U3')
In [107]: arr.view('U3').ravel()                                                               
Out[107]: array(['abc', 'def', 'ghi'], dtype='<U3')
In [108]: timeit arr.view('U3').ravel()                                                        
1.04 µs ± 9.81 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

===

To use np.char.join we have to collect the rows into some sort of tuple, list, etc. One way to do that is make an object dtype array, and fill it from the array:

In [110]: temp = np.empty(arr.shape[0], object)                                                
In [111]: temp                                                                                 
Out[111]: array([None, None, None], dtype=object)
In [112]: temp[:] = list(arr)                                                                  
In [113]: temp                                                                                 
Out[113]: 
array([array(['a', 'b', 'c'], dtype='<U1'),
       array(['d', 'e', 'f'], dtype='<U1'),
       array(['g', 'h', 'i'], dtype='<U1')], dtype=object)
In [114]: np.char.join('',temp)                                                                
Out[114]: array(['abc', 'def', 'ghi'], dtype='<U3')

or filling it with a list of lists:

In [115]: temp[:] = arr.tolist()                                                               
In [116]: temp                                                                                 
Out[116]: 
array([list(['a', 'b', 'c']), list(['d', 'e', 'f']),
       list(['g', 'h', 'i'])], dtype=object)
In [117]: np.char.join('',temp)                                                                
Out[117]: array(['abc', 'def', 'ghi'], dtype='<U3')

In [122]: %%timeit  
     ...: temp = np.empty(arr.shape[0], object) 
     ...: temp[:] = arr.tolist() 
     ...: np.char.join('', temp) 
     ...:  
     ...:                                                                                      
22.1 µs ± 69.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

====

To get a better idea of what np.char.join can do, compare it with split:

In [132]: temp                                                                                 
Out[132]: 
array([list(['a', 'b', 'c']), list(['d', 'e', 'f']),
       list(['g', 'h', 'i'])], dtype=object)
In [133]: b = np.char.join(',',temp)                                                           
In [134]: b                                                                                    
Out[134]: array(['a,b,c', 'd,e,f', 'g,h,i'], dtype='<U5')
In [135]: np.char.split(b,',')                                                                 
Out[135]: 
array([list(['a', 'b', 'c']), list(['d', 'e', 'f']),
       list(['g', 'h', 'i'])], dtype=object)

Another way to apply ''.join to the elements of the object array:

In [136]: np.frompyfunc(lambda s: ','.join(s), 1,1)(temp)                                      
Out[136]: array(['a,b,c', 'd,e,f', 'g,h,i'], dtype=object)
like image 131
hpaulj Avatar answered Oct 26 '25 19:10

hpaulj