This is of those "mostly asked out of pure curiosity (in possibly futile hope I will learn something)" questions.
I was investigating ways of saving memory on operations on massive numbers of strings, and for some scenarios it seems like string operations in numpy could be useful. However, I got somewhat surprising results:
import random
import string
milstr = [''.join(random.choices(string.ascii_letters, k=10)) for _ in range(1000000)]
npmstr = np.array(milstr, dtype=np.dtype(np.unicode_, 1000000))
Memory consumption using memory_profiler
:
%memit [x.upper() for x in milstr]
peak memory: 420.96 MiB, increment: 61.02 MiB
%memit np.core.defchararray.upper(npmstr)
peak memory: 391.48 MiB, increment: 31.52 MiB
So far, so good; however, timing results are surprising for me:
%timeit [x.upper() for x in milstr]
129 ms ± 926 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.core.defchararray.upper(npmstr)
373 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Why is that? I expected that since Numpy uses contiguous chunks of memory for its arrays AND its operations are vectorized (as the above numpy doc page says) AND numpy string arrays apparently use less memory so operating on them should at least potentially be more on-CPU cache-friendly, performance on arrays of strings would be at least similar to those in pure Python?
Environment:
Python 3.6.3 x64, Linux
numpy==1.14.1
Vectorized is used in two ways when talking about numpy
, and it`s not always clear which is meant.
The second point is what makes vectorized operations much faster than a for loop in python, and the multithreaded part is what makes them faster than a list comprehension. When commenters here state that vectorized code is faster, they're referring to the second case as well. However, in the numpy documentation, vectorized only refers to the first case. It means you can use a function directly on an array, without having to loop through all the elements and call it on each elements. In this sense it makes code more concise, but isn't necessarily any faster. Some vectorized operations do call multithreaded code, but as far as I know this is limited to linear algebra routines. Personally, I prefer using vectorized operatios since I think it is more readable than list comprehensions, even if performance is identical.
Now, for the code in question the documentation for np.char
(which is an alias for np.core.defchararray
), states
The
chararray
class exists for backwards compatibility with Numarray, it is not recommended for new development. Starting from numpy 1.4, if one needs arrays of strings, it is recommended to use arrays ofdtype
object_
,string_
orunicode_
, and use the free functions in thenumpy.char
module for fast vectorized string operations.
So there are four ways (one not recommended) to handle strings in numpy. Some testing is in order, since certainly each way will have different advantages and disadvantages. Using arrays defined as follows:
npob = np.array(milstr, dtype=np.object_)
npuni = np.array(milstr, dtype=np.unicode_)
npstr = np.array(milstr, dtype=np.string_)
npchar = npstr.view(np.chararray)
npcharU = npuni.view(np.chararray)
This creates arrays (or chararrays for the last two) with the following datatypes:
In [68]: npob.dtype
Out[68]: dtype('O')
In [69]: npuni.dtype
Out[69]: dtype('<U10')
In [70]: npstr.dtype
Out[70]: dtype('S10')
In [71]: npchar.dtype
Out[71]: dtype('S10')
In [72]: npcharU.dtype
Out[72]: dtype('<U10')
The benchmarks give quite a range of performance across these datatypes:
%timeit [x.upper() for x in test]
%timeit np.char.upper(test)
# test = milstr
103 ms ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
377 ms ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# test = npob
110 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
<error on second test, vectorized operations don't work with object arrays>
# test = npuni
295 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
323 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# test = npstr
125 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
125 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# test = npchar
663 ms ± 4.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
127 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# test = npcharU
887 ms ± 8.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
325 ms ± 3.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Surprisingly, using a plain old list of strings is still the fastest.
Numpy is competitive when the datatype is string_
or object_
, but once unicode is included performance becomes much worse.
The chararray
is by far the slowest, wether handling unicode or not.
It should be clear why it's not recommended for use.
Using unicode strings has a significant performance penalty. The docs state the following for differences between these types
For backward compatibility with Python 2 the
S
anda
typestrings remain zero-terminated bytes and np.string_ continues to map to np.bytes_. To use actual strings in Python 3 use U or np.unicode_. For signed bytes that do not need zero-termination b or i1 can be used.
In this case, where the character set does not require unicode it would make sense to use the faster string_
type.
If unicode was needed, you may get better performance by using a list, or a numpy array of type object_
if other numpy functionality is needed.
Another good example of when a list may be better is appending lots of data
So, takeaways from this:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With