Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Build a basic cube with numpy?

I was wondering if numpy could be used to build the most basic cube model where all cross-combinations and their computed value are stored.

Let's take the following example of data:

AUTHOR         BOOK          YEAR        SALES
Shakespeare    Hamlet        2000        104.2
Shakespeare    Hamlet        2001        99.0
Shakespeare    Romeo         2000        27.0
Shakespeare    Romeo         2001        19.0
Dante          Inferno       2000        11.6
Dante          Inferno       2001        12.6

And to be able to build something like:

                             YEAR                  TOTAL
AUTHOR            BOOK       2000       2001         
(ALL)             (ALL)      142.8      130.6      273.4
Shakespeare       (ALL)      131.2      118.0      249.2
Dante             (ALL)      11.6       12.6       24.2
Shakespeare       Hamlet     104.2      99.0       203.2
Shakespeare       Romeo      27.0       19.0       46.0
Dante             Inferno    11.6       12.6       24.2

I'm hoping that the usage of using something like meshgrid might get me 75% there. Basically, I'd like to see if it's possible to build a structure of all pre-computed values with numpy (not pandas) to build a structure so that I could retrieve the above result of all possible combination. For the sake of simplicity, let's only consider the SUM as the only possible calculation. Perhaps this is a roundable way of asking, but could numpy be the backbone of doing this, or do I need to use something else?

And finally, if not possible in numpy how might this be stored in a MDA?

like image 480
David542 Avatar asked Nov 14 '20 19:11

David542


People also ask

How do you cube in numpy?

To cube each element in an array., element-wise, use the numpy. power() method in Python. Here, the 1st parameter is the base and the 2nd exponents. Since, we want the cube, the exponent is 3.

Is numpy good for machine learning?

NumPy is a very popular python library for large multi-dimensional array and matrix processing, with the help of a large collection of high-level mathematical functions. It is very useful for fundamental scientific computations in Machine Learning.

How do I build NumPy on Cython?

For building NumPy, you’ll need a recent version of Cython. pip install . To perform an in-place build that can be run from the source folder run: Note: for build instructions to do development work on NumPy itself, see Setting up and using your development environment. Make sure to test your builds.

How to cube each element in an array in Python?

To cube each element in an array., element-wise, use the numpy.power () method in Python. Here, the 1st parameter is the base and the 2nd exponents. Since, we want the cube, the exponent is 3.

What software do I need to build NumPy?

Building NumPy requires the following software installed: Python 3.6.x or newer Please note that the Python development headers also need to be installed, e.g., on Debian/Ubuntu one needs to install both python3 and python3-dev. On Windows and macOS this is normally not an issue.

Is NumPy written in C or C++?

Part of Numpy is now written in C++. You will also need a C++ compiler that complies with the C++11 standard. While a FORTRAN 77 compiler is not necessary for building NumPy, it is needed to run the numpy.f2py tests. These tests are skipped if the compiler is not auto-detected.


Video Answer


4 Answers

I think numpy record arrays can be used for this task, below is my solution based on record arrays.

class rec_array():
    
    def __init__(self,author=None,book=None,year=None,sales=None):
        self.dtype = [('author','<U20'), ('book','<U20'),('year','<U20'),('sales',float)]
        self.rec_array = np.rec.fromarrays((author,book,year,sales),dtype=self.dtype)
        
    def add_record(self,author,book,year,sales):
        new_rec = np.rec.fromarrays((author,book,year,sales),dtype=self.dtype)
        if not self.rec_array.shape == ():
            self.rec_array = np.hstack((self.rec_array,new_rec))
        else:
            self.rec_array = new_rec
    
    def get_view(self,conditions):
        """
        conditions: 
            A list of conditions, for example 
            [["author",<,"Shakespeare"],["year","<=","2000"]]
        """
        mask = np.ones(self.rec_array.shape[0]).astype(bool)
        for item in conditions:
            field,op,target = item
            field_op = "self.rec_array['%s'] %s '%s'" % (field,op,target)
            mask &= eval(field_op)
        
        selected_sales = self.rec_array['sales'][mask]
        
        return np.sum(selected_sales)

Based on this rec_array, given the data

author = 4*["Shakespeare"]+ 2*["Dante"]
book = 2*["Hamlet"] + 2*["Romeo"] + 2*["Inferno"]
year = 3*["2000", "2001"]
sales = [104.2, 99.0, 27.0, 19.0, 11.6, 12.6]

we create an instance

test = rec_array()
test.add_record(author,book,year,sales)

If, for example, you want the sales of Shakespeare's Romeo, you can simply do this

test.get_view([["author","==","Shakespeare"],["book","==","Romeo"]])

the output is 46.0

or, you can also do

test.get_view([["author","==","Shakespeare"],["year","<=","2000"]])

the output is 131.2

like image 90
meTchaikovsky Avatar answered Oct 19 '22 19:10

meTchaikovsky


For the data-structure you could define the following class:

class Cube:

    def __init__(self, row_index, col_index, data):
        self.row_index = {r: i for i, r in enumerate(row_index)}
        self.col_index = {c: i for i, c in enumerate(col_index)}
        self.data = data

    def __getitem__(self, item):
        row, col = item
        return self.data[self.row_index[row] , self.col_index[col]]

    def __repr__(self):
        return repr(self.data)

Basically a light wrapper around a two-dimensional numpy array. For computing the cross tabulation you could do something like this:

def _x_tab(rows, columns, values):
    """Function for computing the cross-tab of simple arrays"""
    unique_values_all_cols, idx = zip(*(np.unique(col, return_inverse=True) for col in [rows, columns]))

    shape_xt = [uniq_vals_col.size for uniq_vals_col in unique_values_all_cols]

    xt = np.zeros(shape_xt, dtype=np.float)
    np.add.at(xt, idx, values)

    return unique_values_all_cols, xt


def make_index(a, r):
    """Make array of tuples"""
    l = [tuple(row) for row in a[:, r]]
    return make_object_array(l)


def make_object_array(l):
    a = np.empty(len(l), dtype=object)
    a[:] = l
    return a


def fill_label(ar, le):
    """Fill missing parts with ALL label"""
    missing = tuple(["ALL"] * le)
    return [(e + missing)[:le] for e in ar]

def x_tab(rows, cols, values):
    """Main function for cross tabulation"""
    _, l_cols = rows.shape

    total_agg = []
    total_idx = []
    for i in range(l_cols + 1):
        (idx, _), agg = _x_tab(make_index(rows, list(range(i))), cols, values)
        total_idx.extend(fill_label(idx, l_cols))
        total_agg.append(agg)

    stacked_agg = np.vstack(total_agg)
    stacked_agg_total = stacked_agg.sum(axis=1).reshape(-1, 1)

    return Cube(total_idx, list(dict.fromkeys(cols)), np.concatenate((stacked_agg, stacked_agg_total), axis=1))

Suppose as input an arr array:

[['Shakespeare' 'Hamlet' 2000 104.2]
 ['Shakespeare' 'Hamlet' 2001 99.0]
 ['Shakespeare' 'Romeo' 2000 27.0]
 ['Shakespeare' 'Romeo' 2001 19.0]
 ['Dante' 'Inferno' 2000 11.6]
 ['Dante' 'Inferno' 2001 12.6]]

Then x_tab can be called like this:

result = x_tab(arr[:, [0, 1]], arr[:, 2], arr[:, 3])
print(result)

Output

array([[142.8, 130.6, 273.4],
       [ 11.6,  12.6,  24.2],
       [131.2, 118. , 249.2],
       [ 11.6,  12.6,  24.2],
       [104.2,  99. , 203.2],
       [ 27. ,  19. ,  46. ]])

Note that this representation (repr) is just for the purpose of showing the results, you can change it at you see fit. Then you can access the cells of the cube as follows:

print(result[('Dante', 'ALL'), 2001])
print(result[('Dante', 'Inferno'), 2001])
print(result[('Shakespeare', 'Hamlet'), 2000])

Output

12.6
12.6
104.2

Notice that the bulk of the operations are in the _x_tab function, which uses pure numpy functions. At the same time it provides a flexible interface for any aggregation function you choose, just change the ufunc at this line:

np.add.at(xt, idx, values)

by any other from this list. For more information see the documentation on the at operator.

A working copy of the code can be found here. The above is based on this gist.

Note This assumes you are passing multiple columns for the index (rows parameter).

like image 24
Dani Mesejo Avatar answered Oct 19 '22 21:10

Dani Mesejo


Here is a sketch of a solution, obviously you'd wrap in helper functions and classes to provide an easy interface. The idea is you map each unique name to an index (sequential here for simplicity) and then use that as the index to store the value in an array. It is sub-optimal in that you have to pad an array to the maximum size of the largest number of different items. The array is zeros otherwise so don't get included in sums. You could consider mask arrays and mask sum if you want to avoid adding zero elements.

import numpy as np

def get_dict(x):
    return {a:i for i, a in enumerate(set(x))}

#Mapping name to unique contiguous numbers (obviously put in a fn or class)
author = 4*["Shakespeare"]+ 2*["Dante"]
book = 2*["Hamlet"] + 2*["Romeo"] + 2*["Inferno"]
year = 3*["2000", "2001"]
sales = [104.2, 99.0, 27.0, 19.0, 11.6, 12.6]

#Define dictonary of indices
d = get_dict(author)
d.update(get_dict(book))
d.update(get_dict(year)) 

#Index values to put in multi-dimension array
ai = [d[i] for i in author]
bi = [d[i] for i in book]
yi = [d[i] for i in year]

#Pad array up to maximum size
A = np.zeros([np.max(ai)+1, np.max(bi)+1, np.max(yi)+1])

#Store elements with unique name as index in 3D datacube
for n in range(len(sales)):
    i = ai[n]; j = bi[n]; k = yi[n]
    A[i,j,k] = sales[n]

#Now we can get the various sums, for example all sales
print("Total=", np.sum(A))

#All shakespeare (0)
print("All shakespeare=", np.sum(A[d["Shakespeare"],:,:]))

#All year 2001
print("All year 2001", np.sum(A[:,:,d["2001"]]))

#All Shakespeare in 2000
print("All Shakespeare in 2000", np.sum(A[d["Shakespeare"],:,d["2000"]]))
like image 22
Ed Smith Avatar answered Oct 19 '22 19:10

Ed Smith


Just the class initialization:

import numpy as np

class Olap:
    def __init__(self, values, headers, *locators):
        self.labels = []
        self.indices = []
        self.headers = headers
        self.shape = (len(l) for l in locators)
        for loc in locators:
            unique, ix = np.unique(loc, return_inverse = True)
            self.labels.append(unique)
            self.indices.append(ix)
        self.arr   = np.zeros(self.shape)
        self.count = np.zeros(self.shape, dtype = int)
        np.add.at(self.arr, tuple(self.indices), values)
        np.add.at(self.count, tuple(self.indices), np.ones(values.shape))

author = 4*["Shakespeare"]+ 2*["Dante"]
book = 2*["Hamlet"] + 2*["Romeo"] + 2*["Inferno"]
year = 3*["2000", "2001"]
sales = [104.2, 99.0, 27.0, 19.0, 11.6, 12.6]


olap = Olap(sales, ["author", "book", "year"], author, book, year)

From there you can create summing functions using self.arr.sum() along different axes, and can even average by using self.count.sum() as well. You'll probably want some way to add more data (once again using np.add.at to put them into arr) - but your data structure is now Nd instead of tabular, which should give it the same benefits for high-dimensional data that pivot does.

Not about to put all that into code (even for 400 rep) but it doesn't seem too complex once you make the multidimensional data structure.

like image 3
Daniel F Avatar answered Oct 19 '22 19:10

Daniel F