Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple pandas / numpy 'indexing' in vectorized calcs

Sorry for a basic question. I'm sure the answer is quite simple, but I've banged my head against a wall for a while trying to figure it out. I'm new to python, but I understand the concept of vectorized computations. For instance, in the following (pretty trivial) piece of code:

import pandas as pd

ndx = ['a', 'b', 'c', 'd', 'e', 'f']
first = [3, 7, 2, 5, 9, 4]
second = [8, 9, 7, 3, 3, 7]

first = pd.DataFrame(first, index = ndx)
second = pd.DataFrame(second, index = ndx)

I know that first > second will return a boolean array, True where each element in a is greater than the corresponding element in b, matching the indexes. I understand that this rigid index matching is one of the benefits of using pandas, but...

Question: how can I efficiently reference "offset" indexes in a vectorized operation? For instance, what if I want to compare the next value in b with the current value in a (first['a'] > second['b'], first['b'] > second['c'], ...)? Along the same lines, what if I want to return True only if first['a'] is greater than both second['a] and second['b']?

I've written some code that does things like this iterating through the array by index. Here's an example:

        if next.at[curr.index[i], 'OI'] > curr.OI[i] and \
        next.at[curr.index[i+1], 'OI'] > curr.OI[i+1] and \
        next.at[curr.index[i], 'Vol'] > curr.Vol[i] and \
        next.at[curr.index[i+1], 'Vol'] > curr.Vol[i+1]:

(next and curr are DataFrames. OI and Vol are columns in those dataframes and i is my counter.) I know this is not pythonic and it's also super slow (which... hmmm... might be why it's not pythonic? lol)

Thank you in advance.

Summary: general question is how to reference offset elements in pandas (and numpy).

EDIT: Thank you to Jaime and TomAugspurger for the answers for np and pd below. Got it... makes sense.

Followup question: How can I implement the pandas shift with dataframes that are different lengths? Imagine I have two time series that overlap but one extends before and the other extends after the other. So there are non-matching values in each index, and the indexes are (almost certainly) different lengths. pandas will not allow shift() with this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-35914edbe0ff> in <module>()
----> 1 aa = q['OI'] > r['OI']

C:\Python27\lib\site-packages\pandas\core\ops.pyc in wrapper(self, other)
    540             name = _maybe_match_name(self, other)
    541             if len(self) != len(other):
--> 542                 raise ValueError('Series lengths must match to compare')
    543             return self._constructor(na_op(self.values, other.values),
    544                                      index=self.index, name=name)

ValueError: Series lengths must match to compare

I suppose I could do a step and take the set defined by the union of the indexes, but that seems like an inefficient extra step. (I'm trying to learn correct coding practice as much (or more) than simply getting my project to work.) Any ideas on this? Thank you in advance.

like image 554
user3241893 Avatar asked Jan 24 '26 03:01

user3241893


1 Answers

Not sure about pandas, but in numpy you do this type of things by comparing offset slices. With your example of "comparing the next value with the current one" you could do something like:

>>> first = np.array([3, 7, 2, 5, 9, 4])
>>> second = np.array([8, 9, 7, 3, 3, 7])
>>> first[:-1] > second[1:]
array([False, False, False,  True,  True], dtype=bool)

The slicing obviously does not compare the last element of first, or the first of second, to anything.

like image 147
Jaime Avatar answered Jan 25 '26 18:01

Jaime