Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python numpy or pandas equivalent of the R function sweep()

What is the numpy or pandas equivalent of the R function sweep()?

To elaborate: in R let's say we have a coefficient vector, say beta (numeric type) and an array, say data (20x5 numeric type). I want to superimpose the vector on each row of the array and multiply the corresponding elements. And then return the resultant (20x5) array I could achieve this using sweep().

Equivalent sample R code:

beta <-  c(10, 20, 30, 40)
data <- array(1:20,c(5,4))
sweep(data,MARGIN=2,beta,`*`)
#---------------
 > data
      [,1] [,2] [,3] [,4]
 [1,]    1    6   11   16
 [2,]    2    7   12   17
 [3,]    3    8   13   18
 [4,]    4    9   14   19
 [5,]    5   10   15   20

 > beta
 [1] 10 20 30 40

 > sweep(data,MARGIN=2,beta,`*`)
      [,1] [,2] [,3] [,4]
 [1,]   10  120  330  640
 [2,]   20  140  360  680
 [3,]   30  160  390  720
 [4,]   40  180  420  760
 [5,]   50  200  450  800

I have heard exciting things about numpy and pandas in Python and it seems to have a lot of R like commands. What would be the fastest way to achieve the same using these libraries? The actual data has millions of rows and around 50 columns. The beta vector is of course conformable with data.

like image 878
sriramn Avatar asked Apr 16 '14 18:04

sriramn


People also ask

What is the pandas equivalent in R?

Pandas for Python and Dplyr for R are the two most popular libraries for working with tabular/structured data for many Data Scientists.

Is NumPy similar to R?

NumPy belongs to "Data Science Tools" category of the tech stack, while R can be primarily classified under "Languages". NumPy is an open source tool with 11.1K GitHub stars and 3.67K GitHub forks. Here's a link to NumPy's open source repository on GitHub.

What is the equivalent of dplyr in Python?

Dplython. Package dplython is dplyr for Python users. It provide infinite functionality for data preprocessing.

Can you use NumPy in R?

While the RcppCNPy package provides functions for the simple reading and writing of NumPy files, we can also use the reticulate package to access the NumPy functionality directly from R.


1 Answers

Pandas has an apply() method too, apply being what R's sweep() uses under the hood. (Note that the MARGIN argument is "equivalent" to the axis argument in many pandas functions, except that it takes values 0 and 1 rather than 1 and 2).

np.random.seed = 1    
beta = pd.Series(np.random.randn(5))    
data = pd.DataFrame(np.random.randn(20, 5))

You can use an apply with a function which is called on each row:

data.apply(lambda row: row * beta, axis=1)

Note: that axis=0 would apply to each column, this is the default as data is stored column-wise and so column-wise operations are more efficient.

However, in this case it's easy to make significantly faster (and more readable) to vectorize, simply by multiplying row-wise:

In [21]: data.apply(lambda row: row * beta, axis=1).head()
Out[21]:
          0         1         2         3         4
0 -0.024827 -1.465294 -0.416155 -0.369182 -0.649587
1  0.026433  0.355915 -0.672302  0.225446 -0.520374
2  0.042254 -1.223200 -0.545957  0.103864 -0.372855
3  0.086367  0.218539 -1.033671  0.218388 -0.598549
4  0.203071 -3.402876  0.192504 -0.147548 -0.726001

In [22]: data.mul(beta, axis=1).head()  # just show first few rows with head
Out[22]:
          0         1         2         3         4
0 -0.024827 -1.465294 -0.416155 -0.369182 -0.649587
1  0.026433  0.355915 -0.672302  0.225446 -0.520374
2  0.042254 -1.223200 -0.545957  0.103864 -0.372855
3  0.086367  0.218539 -1.033671  0.218388 -0.598549
4  0.203071 -3.402876  0.192504 -0.147548 -0.726001

Note: this is slightly more robust / allows more control than using *.

You can do the same in numpy (ie data.values here), either multiplying directly, this will be faster as it doesn't worry about data-alignment, or using vectorize rather than apply.

like image 62
Andy Hayden Avatar answered Sep 27 '22 22:09

Andy Hayden