Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way of performing arithmetic operations on entire Frame in Python datatable?

This question is about the recent h2o datatable package. I want to replace pandas code with this library to enhance performance.

The question is simple: I need to divide/sum/multiply/substract an entire Frame or various selected columns by a number.

In pandas, to divide all the columns excluding the first by 3, one could write:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "C0": np.random.randn(10000), 
    "C1": np.random.randn(10000)
})
df.iloc[:,1:] = df.iloc[:,1:]/3

In the datatable package, one can do this just for one selected column:

import datatable as dt
from datatable import f

df = dt.Frame(np.random.randn(1000000))
df[:, "C1"] = dt.Frame(np.random.randn(1000000))
for i in range(1,df.shape[1]): df[:,i] = df[:,f[i]/3]

By now, in Python 3.6 (I don't know about the 3.7 version), the FrameProxy f doesn't admit slices. I'm just asking if there's a better way to perform this kind of Frame arithmetic operations than a loop, I haven't found it on the Documentation.

EDIT:

Latest commit #1962 has added a feature related to this question. If I'm able to run the latest source version, I'll add myself an answer including that new feature.

like image 294
carrasco Avatar asked Mar 03 '23 21:03

carrasco


2 Answers

You are correct that f-symbol does not support slice expressions currently (which is an interesting idea btw, perhaps this could be added in the future?)

However, the right-hand side of the assignment can be a list of expressions, allowing you to write the following:

df = dt.Frame(C0=np.random.randn(1000000),
              C1=np.random.randn(1000000))

df[:, 1:] = [f[i]/3 for i in range(1, df.ncols)]
like image 187
Pasha Avatar answered Apr 07 '23 10:04

Pasha


As of January 2019, both the Python 3.6 and 3.7 versions of datatable installed via pip support slices with f-expressions and it is documented. Thus, the solution is straightforward.

import datatable as dt
from datatable import f
import numpy as np

# generate some data to test
df = dt.Frame(C0=np.random.randn(1000000),
              C1=np.random.randn(1000000))

df[:, 1:] = df[:, f[1:]/3]
like image 24
carrasco Avatar answered Apr 07 '23 10:04

carrasco