Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I slice each element of a numpy array of strings?

Numpy has some very useful string operations, which vectorize the usual Python string operations.

Compared to these operation and to pandas.str, the numpy strings module seems to be missing a very important one: the ability to slice each string in the array. For example,

a = numpy.array(['hello', 'how', 'are', 'you'])
numpy.char.sliceStr(a, slice(1, 3))
>>> numpy.array(['el', 'ow', 're' 'ou'])

Am I missing some obvious method in the module with this functionality? Otherwise, is there a fast vectorized way to achieve this?

like image 776
Martín Fixman Avatar asked Aug 19 '16 15:08

Martín Fixman


People also ask

How do you split elements in an NP array?

divide() in Python. numpy. divide(arr1, arr2, out = None, where = True, casting = 'same_kind', order = 'K', dtype = None) : Array element from first array is divided by elements from second element (all happens element-wise).

Can you slice Numpy arrays?

You can slice a numpy array is a similar way to slicing a list - except you can do it in more than one dimension. As with indexing, the array you get back when you index or slice a numpy array is a view of the original array. It is the same data, just accessed in a different order.

Can you slice strings in Python?

Python string supports slicing to create substring. Note that Python string is immutable, slicing creates a new substring from the source string and original string remains unchanged.

Can Numpy handle strings?

The elements of a NumPy array, or simply an array, are usually numbers, but can also be boolians, strings, or other objects.


5 Answers

Here's a vectorized approach -

def slicer_vectorized(a,start,end):
    b = a.view((str,1)).reshape(len(a),-1)[:,start:end]
    return np.fromstring(b.tostring(),dtype=(str,end-start))

Sample run -

In [68]: a = np.array(['hello', 'how', 'are', 'you'])

In [69]: slicer_vectorized(a,1,3)
Out[69]: 
array(['el', 'ow', 're', 'ou'], 
      dtype='|S2')

In [70]: slicer_vectorized(a,0,3)
Out[70]: 
array(['hel', 'how', 'are', 'you'], 
      dtype='|S3')

Runtime test -

Testing out all the approaches posted by other authors that I could run at my end and also including the vectorized approach from earlier in this post.

Here's the timings -

In [53]: # Setup input array
    ...: a = np.array(['hello', 'how', 'are', 'you'])
    ...: a = np.repeat(a,10000)
    ...: 

# @Alberto Garcia-Raboso's answer
In [54]: %timeit slicer(1, 3)(a)
10 loops, best of 3: 23.5 ms per loop

# @hapaulj's answer
In [55]: %timeit np.frompyfunc(lambda x:x[1:3],1,1)(a)
100 loops, best of 3: 11.6 ms per loop

# Using loop-comprehension
In [56]: %timeit np.array([i[1:3] for i in a])
100 loops, best of 3: 12.1 ms per loop

# From this post
In [57]: %timeit slicer_vectorized(a,1,3)
1000 loops, best of 3: 787 µs per loop
like image 192
Divakar Avatar answered Sep 26 '22 08:09

Divakar


Most, if not all the functions in np.char apply existing str methods to each element of the array. It's a little faster than direct iteration (or vectorize) but not drastically so.

There isn't a string slicer; at least not by that sort of name. Closest is indexing with a slice:

In [274]: 'astring'[1:3]
Out[274]: 'st'
In [275]: 'astring'.__getitem__
Out[275]: <method-wrapper '__getitem__' of str object at 0xb3866c20>
In [276]: 'astring'.__getitem__(slice(1,4))
Out[276]: 'str'

An iterative approach can be with frompyfunc (which is also used by vectorize):

In [277]: a = numpy.array(['hello', 'how', 'are', 'you'])
In [278]: np.frompyfunc(lambda x:x[1:3],1,1)(a)
Out[278]: array(['el', 'ow', 're', 'ou'], dtype=object)
In [279]: np.frompyfunc(lambda x:x[1:3],1,1)(a).astype('U2')
Out[279]: 
array(['el', 'ow', 're', 'ou'], 
      dtype='<U2')

I could view it as a single character array, and slice that

In [289]: a.view('U1').reshape(4,-1)[:,1:3]
Out[289]: 
array([['e', 'l'],
       ['o', 'w'],
       ['r', 'e'],
       ['o', 'u']], 
      dtype='<U1')

I still need to figure out how to convert it back to 'U2'.

In [290]: a.view('U1').reshape(4,-1)[:,1:3].copy().view('U2')
Out[290]: 
array([['el'],
       ['ow'],
       ['re'],
       ['ou']], 
      dtype='<U2')

The initial view step shows the databuffer as Py3 characters (these would be bytes in a S or Py2 string case):

In [284]: a.view('U1')
Out[284]: 
array(['h', 'e', 'l', 'l', 'o', 'h', 'o', 'w', '', '', 'a', 'r', 'e', '',
       '', 'y', 'o', 'u', '', ''], 
      dtype='<U1')

Picking the 1:3 columns amounts to picking a.view('U1')[[1,2,6,7,11,12,16,17]] and then reshaping and view. Without getting into details, I'm not surprised that it requires a copy.

like image 38
hpaulj Avatar answered Sep 22 '22 08:09

hpaulj


To solve this, so far I've been transforming the numpy array to a pandas Series and back. It is not a pretty solution, but it works and it works relatively fast.

a = numpy.array(['hello', 'how', 'are', 'you'])
pandas.Series(a).str[1:3].values
array(['el', 'ow', 're', 'ou'], dtype=object)
like image 27
Martín Fixman Avatar answered Sep 25 '22 08:09

Martín Fixman


Interesting omission... I guess you can always write your own:

import numpy as np

def slicer(start=None, stop=None, step=1):
    return np.vectorize(lambda x: x[start:stop:step], otypes=[str])

a = np.array(['hello', 'how', 'are', 'you'])
print(slicer(1, 3)(a))    # => ['el' 'ow' 're' 'ou']

EDIT: Here are some benchmarks using the text of Ulysses by James Joyce. It seems the clear winner is @hpaulj's last strategy. @Divakar gets into the race improving on @hpaulj's last strategy.

import numpy as np
import requests

ulysses = requests.get('http://www.gutenberg.org/files/4300/4300-0.txt').text
a = np.array(ulysses.split())

# Ufunc
def slicer(start=None, stop=None, step=1):
    return np.vectorize(lambda x: x[start:stop:step], otypes=[str])

%timeit slicer(1, 3)(a)
# => 1 loop, best of 3: 221 ms per loop

# Non-mutating loop
def loop1(a):
    out = np.empty(len(a), dtype=object)
    for i, word in enumerate(a):
        out[i] = word[1:3]

%timeit loop1(a)
# => 1 loop, best of 3: 262 ms per loop

# Mutating loop
def loop2(a):
    for i in range(len(a)):
        a[i] = a[i][1:3]

b = a.copy()
%timeit -n 1 -r 1 loop2(b)
# 1 loop, best of 1: 285 ms per loop

# From @hpaulj's answer
%timeit np.frompyfunc(lambda x:x[1:3],1,1)(a)
# => 10 loops, best of 3: 141 ms per loop

%timeit np.frompyfunc(lambda x:x[1:3],1,1)(a).astype('U2')
# => 1 loop, best of 3: 170 ms per loop

%timeit a.view('U1').reshape(len(a),-1)[:,1:3].astype(object).sum(axis=1)
# => 10 loops, best of 3: 60.7 ms per loop

def slicer_vectorized(a,start,end):
    b = a.view('S1').reshape(len(a),-1)[:,start:end]
    return np.fromstring(b.tostring(),dtype='S'+str(end-start))

%timeit slicer_vectorized(a,1,3)
# => The slowest run took 5.34 times longer than the fastest.
#    This could mean that an intermediate result is being cached.
#    10 loops, best of 3: 16.8 ms per loop
like image 37
A. Garcia-Raboso Avatar answered Sep 23 '22 08:09

A. Garcia-Raboso


I completely agree that this is an omission, which is why I opened up PR #20694. If that gets accepted, you will be able to do exactly what you propose, but under the slightly more conventional name of np.char.slice_:

>>> a = np.array(['hello', 'how', 'are', 'you'])
>>> np.char.slice_(a, 1, 3)
array(['el', 'ow', 're' 'ou'])

The code in the PR is fully functional, so you can make a working copy of it, but it uses a couple of hacks to get around some limitations.

For this simple case, you can use simple slicing. Starting with numpy 1.23.0, you can view non-contiguous arrays under a dtype of different size (PR #20722). That means you can do

>>> a[:, None].view('U1')[:, 1:3].view('U2').squeeze()
array(['el', 'ow', 're' 'ou'])
like image 2
Mad Physicist Avatar answered Sep 23 '22 08:09

Mad Physicist