Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Applying string operations to numpy arrays?

Tags:

python

numpy

Are there better ways to apply string operations to ndarrays rather than iterating over them? I would like to use a "vectorized" operation, but I can only think of using map (example shown) or list comprehensions.

Arr = numpy.rec.fromrecords(zip(range(5),'as far as i know'.split()),
                            names='name, strings')

print ''.join(map(lambda x: x[0].upper()+'.',Arr['strings']))
=> A.F.A.I.K.

For instance, in the R language, string operations are also vectorized:

> (string <- unlist(strsplit("as far as i know"," ")))
[1] "as"   "far"  "as"   "i"    "know"
> paste(sprintf("%s.",toupper(substr(string,1,1))),collapse="")
[1] "A.F.A.I.K."
like image 317
hatmatrix Avatar asked Nov 11 '11 05:11

hatmatrix


2 Answers

Yes, recent NumPy has vectorized string operations, in the numpy.char module. E.g., when you want to find all strings starting with a B in an array of strings, that's

>>> y = np.asarray("B-PER O O B-LOC I-LOC O B-ORG".split())
>>> y
array(['B-PER', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'B-ORG'], 
      dtype='|S5')
>>> np.char.startswith(y, 'B')
array([ True, False, False,  True, False, False,  True], dtype=bool)
like image 171
Fred Foo Avatar answered Sep 21 '22 08:09

Fred Foo


Update: See Larsman's answer to this question: Numpy recently added a numpy.char module for basic string operations.

Short answer: Numpy doesn't provide vectorized string operations. The idiomatic way is to do something like (where Arr is your numpy array):

print '.'.join(item.upper() for item in Arr['strings'])

Long answer, here's why numpy doesn't provide vectorized string operations: (and a good bit of rambling in between)

One size does not fit all when it comes to data structures.

Your question probably seems odd to people coming from a non-domain-specific programming language, but it makes a lot of sense to people coming from a domain-specific language.

Python gives you a wide variety of choices of data structures. Some data structures are better at some tasks than others.

First off, numpy array's aren't the default "hold-all" container in python. Python's builtin containers are very good at what they're designed for. Often, a list or a dict is what you want.

Numpy's ndarrays are for homogenous data.

In a nutshell, numpy doesn't have vectorized string operations.

ndarrays are a specialized container focusing on storing N-dimensional homogenous groups of items in the minimum amount of memory possible. The emphasis is really on minimizing memory usage (I'm biased, because that's mostly what I need them for, but it's a useful way to think of it.). Vectorized mathematical operations are just a nice side effect of having things stored in a contiguous block of memory.

Strings are usually of different lengths.

E.g. ['Dog', 'Cat', 'Horse']. Numpy takes the database-like approach of requiring you to define a length for your strings, but the simple fact that strings aren't expected to be a fixed length has a lot of implications.

Most useful string operations return variable length strings. (e.g. '.'.join(...) in your example)

Those that don't (e.g. upper, etc) you can mimic with other operations if you want to. (E.g. upper is roughly (x.view(np.uint8) - 32).view('S1'). I don't recommend that you do that, but you can...)

As a basic example: 'A' + 'B' yields 'AB'. 'AB' is not the same length as 'A' or 'B'. Numpy deals with other things that do this (e.g. np.uint8(4) + np.float(3.4)), but strings are much more flexible in length than numbers. ("Upcasting" and "downcasting" rules for numbers are pretty simple.)

Another reason numpy doesn't do it is that the focus is on numerical operations. 'A'**2 has no particular definition in python (You can certainly make a string class that does, but what should it be?). String arrays are second class citizens in numpy. They exist, but most operations aren't defined for them.

Python is already really good at handling string processing

The other (and really, the main) reason numpy doesn't try to offer string operations is that python is already really good at it.

Lists are fantastic flexible containers. Python has a huge set of very nice, very fast string operations. List comprehensions and generator expressions are fairly fast, and they don't suffer any overhead from trying to guess what the type or size of the returned item should be, as they don't care. (They just store a pointer to it.)

Also, iterating over numpy arrays in python is slower than iterating over a list or tuple in python, but for string operations, you're really best off just using the normal list/generator expressions. (e.g. print '.'.join(item.upper() for item in Arr['strings']) in your example) Better yet, don't use numpy arrays to store strings in the first place. It makes sense if you have a single column of a structured array with strings, but that's about it. Python gives you very rich and flexible data structures. Numpy arrays aren't the be-all and end-all, and they're a specialized case, not a generalized case.

Also, keep in mind that most of what you'd want to do with a numpy array

Learn Python, not just Numpy

I'm not trying to be cheeky here, but working with numpy arrays is very similar to a lot of things in Matlab or R or IDL, etc.

It's a familiar paradigm, and anyone's first instinct is to try to apply that same paradigm to the rest of the language.

Python is a lot more than just numpy. It's a multi-paradigm language, so it's easy to stick to the paradigms that you're already used to. Try to learn to "think in python" as well as just "thinking in numpy". Numpy provides a specific paradigm to python, but there's a lot more there, and some paradigms are a better fit for some tasks than others.

Part of this is becoming familiar with the strengths and weaknesses of different data containers (lists vs dicts vs tuples, etc), as well as different programming paradigms (e.g. object-oriented vs functional vs procedural, etc).

All in all, python has several different types of specialized data structures. This makes it somewhat different from domain-specific languages like R or Matlab, which have a few types of data structures, but focus on doing everything with one specific structure. (My experience with R is limited, so I may be wrong there, but that's my impression of it, anyway. It's certainly true of Matlab, anyway.)

At any rate, I'm not trying to rant here, but it took me quite awhile to stop writing Fortran in Matlab, and it took me even longer to stop writing Matlab in python. This rambling answer is very sort on concrete examples, but hopefully it makes at least a little bit of sense, and helps somewhat.

like image 35
Joe Kington Avatar answered Sep 22 '22 08:09

Joe Kington