How can I slice each element of a numpy array of strings?

Tags:

Numpy has some very useful string operations, which vectorize the usual Python string operations.

Compared to these operation and to pandas.str, the numpy strings module seems to be missing a very important one: the ability to slice each string in the array. For example,

a = numpy.array(['hello', 'how', 'are', 'you'])
numpy.char.sliceStr(a, slice(1, 3))
>>> numpy.array(['el', 'ow', 're' 'ou'])

Am I missing some obvious method in the module with this functionality? Otherwise, is there a fast vectorized way to achieve this?

776

asked Aug 19 '16 15:08

5 Answers

Here's a vectorized approach -

def slicer_vectorized(a,start,end):
    b = a.view((str,1)).reshape(len(a),-1)[:,start:end]
    return np.fromstring(b.tostring(),dtype=(str,end-start))

Sample run -

In [68]: a = np.array(['hello', 'how', 'are', 'you'])

In [69]: slicer_vectorized(a,1,3)
Out[69]: 
array(['el', 'ow', 're', 'ou'], 
      dtype='|S2')

In [70]: slicer_vectorized(a,0,3)
Out[70]: 
array(['hel', 'how', 'are', 'you'], 
      dtype='|S3')

Runtime test -

Testing out all the approaches posted by other authors that I could run at my end and also including the vectorized approach from earlier in this post.

Here's the timings -

In [53]: # Setup input array
    ...: a = np.array(['hello', 'how', 'are', 'you'])
    ...: a = np.repeat(a,10000)
    ...: 

# @Alberto Garcia-Raboso's answer
In [54]: %timeit slicer(1, 3)(a)
10 loops, best of 3: 23.5 ms per loop

# @hapaulj's answer
In [55]: %timeit np.frompyfunc(lambda x:x[1:3],1,1)(a)
100 loops, best of 3: 11.6 ms per loop

# Using loop-comprehension
In [56]: %timeit np.array([i[1:3] for i in a])
100 loops, best of 3: 12.1 ms per loop

# From this post
In [57]: %timeit slicer_vectorized(a,1,3)
1000 loops, best of 3: 787 µs per loop

192

answered Sep 26 '22 08:09

Divakar

Most, if not all the functions in np.char apply existing str methods to each element of the array. It's a little faster than direct iteration (or vectorize) but not drastically so.

There isn't a string slicer; at least not by that sort of name. Closest is indexing with a slice:

In [274]: 'astring'[1:3]
Out[274]: 'st'
In [275]: 'astring'.__getitem__
Out[275]: <method-wrapper '__getitem__' of str object at 0xb3866c20>
In [276]: 'astring'.__getitem__(slice(1,4))
Out[276]: 'str'

An iterative approach can be with frompyfunc (which is also used by vectorize):

In [277]: a = numpy.array(['hello', 'how', 'are', 'you'])
In [278]: np.frompyfunc(lambda x:x[1:3],1,1)(a)
Out[278]: array(['el', 'ow', 're', 'ou'], dtype=object)
In [279]: np.frompyfunc(lambda x:x[1:3],1,1)(a).astype('U2')
Out[279]: 
array(['el', 'ow', 're', 'ou'], 
      dtype='<U2')

I could view it as a single character array, and slice that

In [289]: a.view('U1').reshape(4,-1)[:,1:3]
Out[289]: 
array([['e', 'l'],
       ['o', 'w'],
       ['r', 'e'],
       ['o', 'u']], 
      dtype='<U1')

I still need to figure out how to convert it back to 'U2'.

In [290]: a.view('U1').reshape(4,-1)[:,1:3].copy().view('U2')
Out[290]: 
array([['el'],
       ['ow'],
       ['re'],
       ['ou']], 
      dtype='<U2')

The initial view step shows the databuffer as Py3 characters (these would be bytes in a S or Py2 string case):

In [284]: a.view('U1')
Out[284]: 
array(['h', 'e', 'l', 'l', 'o', 'h', 'o', 'w', '', '', 'a', 'r', 'e', '',
       '', 'y', 'o', 'u', '', ''], 
      dtype='<U1')

Picking the 1:3 columns amounts to picking a.view('U1')[[1,2,6,7,11,12,16,17]] and then reshaping and view. Without getting into details, I'm not surprised that it requires a copy.

answered Sep 22 '22 08:09

I completely agree that this is an omission, which is why I opened up PR #20694. If that gets accepted, you will be able to do exactly what you propose, but under the slightly more conventional name of np.char.slice_:

>>> a = np.array(['hello', 'how', 'are', 'you'])
>>> np.char.slice_(a, 1, 3)
array(['el', 'ow', 're' 'ou'])

The code in the PR is fully functional, so you can make a working copy of it, but it uses a couple of hacks to get around some limitations.

For this simple case, you can use simple slicing. Starting with numpy 1.23.0, you can view non-contiguous arrays under a dtype of different size (PR #20722). That means you can do

>>> a[:, None].view('U1')[:, 1:3].view('U2').squeeze()
array(['el', 'ow', 're' 'ou'])

answered Sep 23 '22 08:09

Mad Physicist

Related questions
                            
                                Accessing argument values for argparse in Python
                            
                                Why is super used so much in PySide/PyQt?
                            
                                What are __signature__ and __text_signature__ used for in Python 3.4
                            
                                Writing hex data into a file
                            
                                Python imports relative path
                            
                                How can I display an image using Pillow?
                            
                                Python 3 exception deletes variable in enclosing scope for unknown reason [duplicate]
                            
                                How to create ternary contour plot in Python?
                            
                                How can I keep test data after Django tests complete?
                            
                                Memory efficient sort of massive numpy array in Python
                            
                                What is the difference between skew and kurtosis functions in pandas vs. scipy?
                            
                                ValueError: setting an array element with a sequence. for Pandas
                            
                                Reorder levels of MultiIndex in a pandas DataFrame
                            
                                How to replace all values in a Pandas Dataframe not in a list? [duplicate]
                            
                                Using Boto3 to interact with amazon Aurora on RDS
                            
                                Average of a numpy array returns NaN
                            
                                overcome Graphdef cannot be larger than 2GB in tensorflow
                            
                                interpolate missing values 2d python
                            
                                How to remove the extra row (or column) after transpose() in Pandas
                            
                                Google Search Web Scraping with Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I slice each element of a numpy array of strings?

Tags:

python

arrays

string

slice

numpy

Martín Fixman

People also ask

5 Answers

Divakar

hpaulj

Martín Fixman

A. Garcia-Raboso

Mad Physicist

Recent Activity

Donate For Us