I was wondering if numpy
could be used to build the most basic cube model where all cross-combinations and their computed value are stored.
Let's take the following example of data:
AUTHOR BOOK YEAR SALES
Shakespeare Hamlet 2000 104.2
Shakespeare Hamlet 2001 99.0
Shakespeare Romeo 2000 27.0
Shakespeare Romeo 2001 19.0
Dante Inferno 2000 11.6
Dante Inferno 2001 12.6
And to be able to build something like:
YEAR TOTAL
AUTHOR BOOK 2000 2001
(ALL) (ALL) 142.8 130.6 273.4
Shakespeare (ALL) 131.2 118.0 249.2
Dante (ALL) 11.6 12.6 24.2
Shakespeare Hamlet 104.2 99.0 203.2
Shakespeare Romeo 27.0 19.0 46.0
Dante Inferno 11.6 12.6 24.2
I'm hoping that the usage of using something like meshgrid
might get me 75% there. Basically, I'd like to see if it's possible to build a structure of all pre-computed values with numpy
(not pandas) to build a structure so that I could retrieve the above result of all possible combination. For the sake of simplicity, let's only consider the SUM
as the only possible calculation. Perhaps this is a roundable way of asking, but could numpy
be the backbone of doing this, or do I need to use something else?
And finally, if not possible in numpy
how might this be stored in a MDA?
To cube each element in an array., element-wise, use the numpy. power() method in Python. Here, the 1st parameter is the base and the 2nd exponents. Since, we want the cube, the exponent is 3.
NumPy is a very popular python library for large multi-dimensional array and matrix processing, with the help of a large collection of high-level mathematical functions. It is very useful for fundamental scientific computations in Machine Learning.
For building NumPy, you’ll need a recent version of Cython. pip install . To perform an in-place build that can be run from the source folder run: Note: for build instructions to do development work on NumPy itself, see Setting up and using your development environment. Make sure to test your builds.
To cube each element in an array., element-wise, use the numpy.power () method in Python. Here, the 1st parameter is the base and the 2nd exponents. Since, we want the cube, the exponent is 3.
Building NumPy requires the following software installed: Python 3.6.x or newer Please note that the Python development headers also need to be installed, e.g., on Debian/Ubuntu one needs to install both python3 and python3-dev. On Windows and macOS this is normally not an issue.
Part of Numpy is now written in C++. You will also need a C++ compiler that complies with the C++11 standard. While a FORTRAN 77 compiler is not necessary for building NumPy, it is needed to run the numpy.f2py tests. These tests are skipped if the compiler is not auto-detected.
I think numpy
record arrays can be used for this task, below is my solution based on record arrays.
class rec_array():
def __init__(self,author=None,book=None,year=None,sales=None):
self.dtype = [('author','<U20'), ('book','<U20'),('year','<U20'),('sales',float)]
self.rec_array = np.rec.fromarrays((author,book,year,sales),dtype=self.dtype)
def add_record(self,author,book,year,sales):
new_rec = np.rec.fromarrays((author,book,year,sales),dtype=self.dtype)
if not self.rec_array.shape == ():
self.rec_array = np.hstack((self.rec_array,new_rec))
else:
self.rec_array = new_rec
def get_view(self,conditions):
"""
conditions:
A list of conditions, for example
[["author",<,"Shakespeare"],["year","<=","2000"]]
"""
mask = np.ones(self.rec_array.shape[0]).astype(bool)
for item in conditions:
field,op,target = item
field_op = "self.rec_array['%s'] %s '%s'" % (field,op,target)
mask &= eval(field_op)
selected_sales = self.rec_array['sales'][mask]
return np.sum(selected_sales)
Based on this rec_array
, given the data
author = 4*["Shakespeare"]+ 2*["Dante"]
book = 2*["Hamlet"] + 2*["Romeo"] + 2*["Inferno"]
year = 3*["2000", "2001"]
sales = [104.2, 99.0, 27.0, 19.0, 11.6, 12.6]
we create an instance
test = rec_array()
test.add_record(author,book,year,sales)
If, for example, you want the sales of Shakespeare's Romeo, you can simply do this
test.get_view([["author","==","Shakespeare"],["book","==","Romeo"]])
the output is 46.0
or, you can also do
test.get_view([["author","==","Shakespeare"],["year","<=","2000"]])
the output is 131.2
For the data-structure you could define the following class:
class Cube:
def __init__(self, row_index, col_index, data):
self.row_index = {r: i for i, r in enumerate(row_index)}
self.col_index = {c: i for i, c in enumerate(col_index)}
self.data = data
def __getitem__(self, item):
row, col = item
return self.data[self.row_index[row] , self.col_index[col]]
def __repr__(self):
return repr(self.data)
Basically a light wrapper around a two-dimensional numpy array. For computing the cross tabulation you could do something like this:
def _x_tab(rows, columns, values):
"""Function for computing the cross-tab of simple arrays"""
unique_values_all_cols, idx = zip(*(np.unique(col, return_inverse=True) for col in [rows, columns]))
shape_xt = [uniq_vals_col.size for uniq_vals_col in unique_values_all_cols]
xt = np.zeros(shape_xt, dtype=np.float)
np.add.at(xt, idx, values)
return unique_values_all_cols, xt
def make_index(a, r):
"""Make array of tuples"""
l = [tuple(row) for row in a[:, r]]
return make_object_array(l)
def make_object_array(l):
a = np.empty(len(l), dtype=object)
a[:] = l
return a
def fill_label(ar, le):
"""Fill missing parts with ALL label"""
missing = tuple(["ALL"] * le)
return [(e + missing)[:le] for e in ar]
def x_tab(rows, cols, values):
"""Main function for cross tabulation"""
_, l_cols = rows.shape
total_agg = []
total_idx = []
for i in range(l_cols + 1):
(idx, _), agg = _x_tab(make_index(rows, list(range(i))), cols, values)
total_idx.extend(fill_label(idx, l_cols))
total_agg.append(agg)
stacked_agg = np.vstack(total_agg)
stacked_agg_total = stacked_agg.sum(axis=1).reshape(-1, 1)
return Cube(total_idx, list(dict.fromkeys(cols)), np.concatenate((stacked_agg, stacked_agg_total), axis=1))
Suppose as input an arr
array:
[['Shakespeare' 'Hamlet' 2000 104.2]
['Shakespeare' 'Hamlet' 2001 99.0]
['Shakespeare' 'Romeo' 2000 27.0]
['Shakespeare' 'Romeo' 2001 19.0]
['Dante' 'Inferno' 2000 11.6]
['Dante' 'Inferno' 2001 12.6]]
Then x_tab
can be called like this:
result = x_tab(arr[:, [0, 1]], arr[:, 2], arr[:, 3])
print(result)
Output
array([[142.8, 130.6, 273.4],
[ 11.6, 12.6, 24.2],
[131.2, 118. , 249.2],
[ 11.6, 12.6, 24.2],
[104.2, 99. , 203.2],
[ 27. , 19. , 46. ]])
Note that this representation (repr) is just for the purpose of showing the results, you can change it at you see fit. Then you can access the cells of the cube as follows:
print(result[('Dante', 'ALL'), 2001])
print(result[('Dante', 'Inferno'), 2001])
print(result[('Shakespeare', 'Hamlet'), 2000])
Output
12.6
12.6
104.2
Notice that the bulk of the operations are in the _x_tab function, which uses pure numpy functions. At the same time it provides a flexible interface for any aggregation function you choose, just change the ufunc at this line:
np.add.at(xt, idx, values)
by any other from this list. For more information see the documentation on the at operator.
A working copy of the code can be found here. The above is based on this gist.
Note This assumes you are passing multiple columns for the index (rows parameter).
Here is a sketch of a solution, obviously you'd wrap in helper functions and classes to provide an easy interface. The idea is you map each unique name to an index (sequential here for simplicity) and then use that as the index to store the value in an array. It is sub-optimal in that you have to pad an array to the maximum size of the largest number of different items. The array is zeros otherwise so don't get included in sums. You could consider mask arrays and mask sum if you want to avoid adding zero elements.
import numpy as np
def get_dict(x):
return {a:i for i, a in enumerate(set(x))}
#Mapping name to unique contiguous numbers (obviously put in a fn or class)
author = 4*["Shakespeare"]+ 2*["Dante"]
book = 2*["Hamlet"] + 2*["Romeo"] + 2*["Inferno"]
year = 3*["2000", "2001"]
sales = [104.2, 99.0, 27.0, 19.0, 11.6, 12.6]
#Define dictonary of indices
d = get_dict(author)
d.update(get_dict(book))
d.update(get_dict(year))
#Index values to put in multi-dimension array
ai = [d[i] for i in author]
bi = [d[i] for i in book]
yi = [d[i] for i in year]
#Pad array up to maximum size
A = np.zeros([np.max(ai)+1, np.max(bi)+1, np.max(yi)+1])
#Store elements with unique name as index in 3D datacube
for n in range(len(sales)):
i = ai[n]; j = bi[n]; k = yi[n]
A[i,j,k] = sales[n]
#Now we can get the various sums, for example all sales
print("Total=", np.sum(A))
#All shakespeare (0)
print("All shakespeare=", np.sum(A[d["Shakespeare"],:,:]))
#All year 2001
print("All year 2001", np.sum(A[:,:,d["2001"]]))
#All Shakespeare in 2000
print("All Shakespeare in 2000", np.sum(A[d["Shakespeare"],:,d["2000"]]))
Just the class initialization:
import numpy as np
class Olap:
def __init__(self, values, headers, *locators):
self.labels = []
self.indices = []
self.headers = headers
self.shape = (len(l) for l in locators)
for loc in locators:
unique, ix = np.unique(loc, return_inverse = True)
self.labels.append(unique)
self.indices.append(ix)
self.arr = np.zeros(self.shape)
self.count = np.zeros(self.shape, dtype = int)
np.add.at(self.arr, tuple(self.indices), values)
np.add.at(self.count, tuple(self.indices), np.ones(values.shape))
author = 4*["Shakespeare"]+ 2*["Dante"]
book = 2*["Hamlet"] + 2*["Romeo"] + 2*["Inferno"]
year = 3*["2000", "2001"]
sales = [104.2, 99.0, 27.0, 19.0, 11.6, 12.6]
olap = Olap(sales, ["author", "book", "year"], author, book, year)
From there you can create summing functions using self.arr.sum()
along different axes, and can even average by using self.count.sum()
as well. You'll probably want some way to add more data (once again using np.add.at
to put them into arr
) - but your data structure is now Nd instead of tabular, which should give it the same benefits for high-dimensional data that pivot
does.
Not about to put all that into code (even for 400 rep) but it doesn't seem too complex once you make the multidimensional data structure.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With