I need to match two very large Numpy arrays (one is 20000 rows, another about 100000 rows) and I am trying to build a script to do it efficiently. Simple looping over the arrays is incredibly slow, can someone suggest a better way? Here is what I am trying to do: array datesSecondDict
and array pwfs2Dates
contain datetime values, I need to take each datetime value from array pwfs2Dates
(smaller array) and see if there is a datetime value like that (plus minus 5 minutes) in array datesSecondDict
(there might be more than 1). If there is one (or more) I populate a new array (of the same size as array pwfs2Dates
) with the value (one of the values) from array valsSecondDict
(which is just the array with the corresponding numerical values to datesSecondDict
). Here is a solution by @unutbu and @joaquin that worked for me (thanks guys!):
import time
import datetime as dt
import numpy as np
def combineArs(dict1, dict2):
"""Combine data from 2 dictionaries into a list.
dict1 contains primary data (e.g. seeing parameter).
The function compares each timestamp in dict1 to dict2
to see if there is a matching timestamp record(s)
in dict2 (plus/minus 5 minutes).
==If yes: a list called data gets appended with the
corresponding parameter value from dict2.
(Note that if there are more than 1 record matching,
the first occuring value gets appended to the list).
==If no: a list called data gets appended with 0."""
# Specify the keys to use
pwfs2Key = 'pwfs2:dc:seeing'
dimmKey = 'ws:seeFwhm'
# Create an iterator for primary dict
datesPrimDictIter = iter(dict1[pwfs2Key]['datetimes'])
# Take the first timestamp value in primary dict
nextDatePrimDict = next(datesPrimDictIter)
# Split the second dictionary into lists
datesSecondDict = dict2[dimmKey]['datetime']
valsSecondDict = dict2[dimmKey]['values']
# Define time window
fiveMins = dt.timedelta(minutes = 5)
data = []
#st = time.time()
for i, nextDateSecondDict in enumerate(datesSecondDict):
try:
while nextDatePrimDict < nextDateSecondDict - fiveMins:
# If there is no match: append zero and move on
data.append(0)
nextDatePrimDict = next(datesPrimDictIter)
while nextDatePrimDict < nextDateSecondDict + fiveMins:
# If there is a match: append the value of second dict
data.append(valsSecondDict[i])
nextDatePrimDict = next(datesPrimDictIter)
except StopIteration:
break
data = np.array(data)
#st = time.time() - st
return data
Thanks, Aina.
It returns a new numpy array, after filtering based on a condition, which is a numpy-like array of boolean values. For example, if condition is array([[True, True, False]]) , and our array is a = ndarray([[1, 2, 3]]) , on applying a condition to array ( a[:, condition] ), we will get the array ndarray([[1 2]]) .
Slice a Range of Values from Two-dimensional Numpy Arrays For example, you can use the index [0:1, 0:2] to select the elements in first row, first two columns. You can flip these index values to select elements in the first two rows, first column.
In Python, NumPy NAN stands for not a number and is defined as a substitute for declaring value which are numerical values that are missing values in an array as NumPy is used to deal with arrays in Python and this can be initialized using numpy.
Are the array dates sorted ?
dimVals
items len(pwfs2Vals)
timespwfs2Dates
array to, for example,
an array of pairs [(date, array_index),...]
and then you can sort by
date all your arrays to make the one-pass comparison indicated above and at the
same time to be able to get the original indexes needed to set data[i]
for example if the arrays were already sorted (I use lists here, not sure you need arrays for that): (Edited: now using and iterator not to loop pwfs2Dates from the beginning on each step):
pdates = iter(enumerate(pwfs2Dates))
i, datei = pdates.next()
for datej, valuej in zip(dimmDates, dimvals):
while datei < datej - fiveMinutes:
i, datei = pdates.next()
while datei < datej + fiveMinutes:
data[i] = valuej
i, datei = pdates.next()
Otherwise, if they were not ordered and you created the sorted, indexed lists like this:
pwfs2Dates = sorted([(date, idx) for idx, date in enumerate(pwfs2Dates)])
dimmDates = sorted([(date, idx) for idx, date in enumerate(dimmDates)])
the code would be:
(Edited: now using and iterator not to loop pwfs2Dates from the beginning on each step):
pdates = iter(pwfs2Dates)
datei, i = pdates.next()
for datej, j in dimmDates:
while datei < datej - fiveMinutes:
datei, i = pdates.next()
while datei < datej + fiveMinutes:
data[i] = dimVals[j]
datei, i = pdates.next()
great!
..
Note that dimVals:
dimVals = np.array(dict1[dimmKey]['values'])
is not used in your code and can be eliminated.
Edit: The answer from unutbu address some weak parts in the code above. I indicate them here for completness:
next
: next(iterator)
is prefered to iterator.next()
.
iterator.next()
is an exception to a conventional naming rule that
has been fixed in py3k renaming this method as
iterator.__next__()
.try/except
. After all the
items in the iterator are finished the next call to next()
produces an StopIteration Exception. Use try/except
to kindly
break out of the loop when that happens. For the specific case of the
OP question this is not an issue, because the two arrrays are the same
size so the for loop finishes at the same time than the iterator. So no
exception is risen. However, there could be cases were dict1 and dict2
are not the same size. And in this case there is the posibility of an
exception being risen.
Question is: what is better, to use try/except or to prepare the arrays
before looping by equalizing them to the shorter one.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With