Numpy array conditional matching

Tags:

I need to match two very large Numpy arrays (one is 20000 rows, another about 100000 rows) and I am trying to build a script to do it efficiently. Simple looping over the arrays is incredibly slow, can someone suggest a better way? Here is what I am trying to do: array datesSecondDict and array pwfs2Dates contain datetime values, I need to take each datetime value from array pwfs2Dates (smaller array) and see if there is a datetime value like that (plus minus 5 minutes) in array datesSecondDict (there might be more than 1). If there is one (or more) I populate a new array (of the same size as array pwfs2Dates) with the value (one of the values) from array valsSecondDict (which is just the array with the corresponding numerical values to datesSecondDict). Here is a solution by @unutbu and @joaquin that worked for me (thanks guys!):

import time
import datetime as dt
import numpy as np

def combineArs(dict1, dict2):
   """Combine data from 2 dictionaries into a list.
   dict1 contains primary data (e.g. seeing parameter).
   The function compares each timestamp in dict1 to dict2
   to see if there is a matching timestamp record(s)
   in dict2 (plus/minus 5 minutes).
   ==If yes: a list called data gets appended with the
   corresponding parameter value from dict2.
   (Note that if there are more than 1 record matching,
   the first occuring value gets appended to the list).
   ==If no: a list called data gets appended with 0."""
   # Specify the keys to use    
   pwfs2Key = 'pwfs2:dc:seeing'
   dimmKey = 'ws:seeFwhm'

   # Create an iterator for primary dict 
   datesPrimDictIter = iter(dict1[pwfs2Key]['datetimes'])

   # Take the first timestamp value in primary dict
   nextDatePrimDict = next(datesPrimDictIter)

   # Split the second dictionary into lists
   datesSecondDict = dict2[dimmKey]['datetime']
   valsSecondDict  = dict2[dimmKey]['values']

   # Define time window
   fiveMins = dt.timedelta(minutes = 5)
   data = []
   #st = time.time()
   for i, nextDateSecondDict in enumerate(datesSecondDict):
       try:
           while nextDatePrimDict < nextDateSecondDict - fiveMins:
               # If there is no match: append zero and move on
               data.append(0)
               nextDatePrimDict = next(datesPrimDictIter)
           while nextDatePrimDict < nextDateSecondDict + fiveMins:
               # If there is a match: append the value of second dict
               data.append(valsSecondDict[i])
               nextDatePrimDict = next(datesPrimDictIter)
       except StopIteration:
           break
   data = np.array(data)   
   #st = time.time() - st    
   return data

Thanks, Aina.

595

asked Dec 19 '11 17:12

Aina

1 Answers

Are the array dates sorted ?

If yes, you can speed up your comparisons by breaking from the inner loop comparison once its dates are bigger than the date given by the outer loop. In this way you will made a one-pass comparison instead of looping dimVals items len(pwfs2Vals) times
If no, maybe you should transform the current pwfs2Dates array to, for example, an array of pairs [(date, array_index),...] and then you can sort by date all your arrays to make the one-pass comparison indicated above and at the same time to be able to get the original indexes needed to set data[i]

for example if the arrays were already sorted (I use lists here, not sure you need arrays for that): (Edited: now using and iterator not to loop pwfs2Dates from the beginning on each step):

pdates = iter(enumerate(pwfs2Dates))
i, datei = pdates.next() 

for datej, valuej in zip(dimmDates, dimvals):
    while datei < datej - fiveMinutes:
        i, datei = pdates.next()
    while datei < datej + fiveMinutes:
        data[i] = valuej
        i, datei = pdates.next()

Otherwise, if they were not ordered and you created the sorted, indexed lists like this:

pwfs2Dates = sorted([(date, idx) for idx, date in enumerate(pwfs2Dates)])
dimmDates = sorted([(date, idx) for idx, date in enumerate(dimmDates)])

the code would be:
(Edited: now using and iterator not to loop pwfs2Dates from the beginning on each step):

pdates = iter(pwfs2Dates)
datei, i = pdates.next()

for datej, j in dimmDates:
    while datei < datej - fiveMinutes:
        datei, i = pdates.next()
    while datei < datej + fiveMinutes:
        data[i] = dimVals[j]
        datei, i = pdates.next()

great!

Note that dimVals:
```
dimVals  = np.array(dict1[dimmKey]['values'])
```
is not used in your code and can be eliminated.
Note that your code gets greatly simplified by looping through the array itself instead of using xrange

Edit: The answer from unutbu address some weak parts in the code above. I indicate them here for completness:

Use of next: next(iterator) is prefered to iterator.next(). iterator.next() is an exception to a conventional naming rule that has been fixed in py3k renaming this method as iterator.__next__().
Check for the end of the iterator with a try/except. After all the items in the iterator are finished the next call to next() produces an StopIteration Exception. Use try/except to kindly break out of the loop when that happens. For the specific case of the OP question this is not an issue, because the two arrrays are the same size so the for loop finishes at the same time than the iterator. So no exception is risen. However, there could be cases were dict1 and dict2 are not the same size. And in this case there is the posibility of an exception being risen. Question is: what is better, to use try/except or to prepare the arrays before looping by equalizing them to the shorter one.

192

answered Sep 18 '22 14:09

11 revs

Related questions
                            
                                converting binary string into float
                            
                                which similarity function of nltk.corpus.wordnet is Appropriate for find similarity of two words?
                            
                                Sparse coding in Python [closed]
                            
                                Two basic ANTLR questions
                            
                                Django ImageField "Upload a valid image. The file you uploaded was either not an image or a corrupted image."
                            
                                Convert PIL Image to Cairo ImageSurface
                            
                                When and how is a many-to-many relationship created when saving a model?
                            
                                Is there a good reference list for the names of the genericsetup import steps
                            
                                In Git, how do I configure a hook to run a server-side commands after a commit is accepted?
                            
                                cherrypy.request.body.read() error
                            
                                Populating a list with objects using list comprehension expression
                            
                                Counting significant figures in Python?
                            
                                Python, How to get all external ip addresses with multiple NICs
                            
                                python: which file is newer & by how much time
                            
                                Python - subprocesses and the python shell
                            
                                How to make this kind of equality array fast (in numpy)?
                            
                                Importing python modules for use in only one file
                            
                                International phone number validation
                            
                                Adding function to sys.excepthook
                            
                                Generic CRUD admin for Flask, with WTForms? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Numpy array conditional matching

Tags:

python

arrays

numpy

Aina

People also ask

1 Answers

11 revs

Recent Activity

Donate For Us