What is the <code>numpy</code> or <code>pandas</code> equivalent of the R function <code>sweep()</code>? To elaborate: in R let's say we have a coefficient vector, say <code>beta</code> (numeric type) and an array, say <code>data</code> (20x5 numeric type). I want to superimpose the vector on each row of the array and multiply the corresponding elements. And then return the resultant (20x5) array I could achieve this using <code>sweep()</code>. Equivalent sample <code>R</code> code: <pre class="prettyprint"><code>beta <- c(10, 20, 30, 40) data <- array(1:20,c(5,4)) sweep(data,MARGIN=2,beta,`*`) #--------------- > data [,1] [,2] [,3] [,4] [1,] 1 6 11 16 [2,] 2 7 12 17 [3,] 3 8 13 18 [4,] 4 9 14 19 [5,] 5 10 15 20 > beta [1] 10 20 30 40 > sweep(data,MARGIN=2,beta,`*`) [,1] [,2] [,3] [,4] [1,] 10 120 330 640 [2,] 20 140 360 680 [3,] 30 160 390 720 [4,] 40 180 420 760 [5,] 50 200 450 800 </code></pre> I have heard exciting things about <code>numpy</code> and <code>pandas</code> in Python and it seems to have a lot of <code>R</code> like commands. What would be the fastest way to achieve the same using these libraries? The actual data has millions of rows and around 50 columns. The <code>beta</code> vector is of course conformable with data.

Pandas has an <code>apply()</code> method too, apply being what R's <code>sweep()</code> uses under the hood. (Note that the MARGIN argument is "equivalent" to the <code>axis</code> argument in many pandas functions, except that it takes values 0 and 1 rather than 1 and 2). <pre class="prettyprint"><code>np.random.seed = 1 beta = pd.Series(np.random.randn(5)) data = pd.DataFrame(np.random.randn(20, 5)) </code></pre> You can use an apply with a function which is called on each row: <pre class="prettyprint"><code>data.apply(lambda row: row * beta, axis=1) </code></pre> Note: that <code>axis=0</code> would apply to each column, this is the default as data is stored column-wise and so column-wise operations are more efficient. However, in this case it's easy to make significantly faster (and more readable) to vectorize, simply by multiplying row-wise: <pre class="prettyprint"><code>In [21]: data.apply(lambda row: row * beta, axis=1).head() Out[21]: 0 1 2 3 4 0 -0.024827 -1.465294 -0.416155 -0.369182 -0.649587 1 0.026433 0.355915 -0.672302 0.225446 -0.520374 2 0.042254 -1.223200 -0.545957 0.103864 -0.372855 3 0.086367 0.218539 -1.033671 0.218388 -0.598549 4 0.203071 -3.402876 0.192504 -0.147548 -0.726001 In [22]: data.mul(beta, axis=1).head() # just show first few rows with head Out[22]: 0 1 2 3 4 0 -0.024827 -1.465294 -0.416155 -0.369182 -0.649587 1 0.026433 0.355915 -0.672302 0.225446 -0.520374 2 0.042254 -1.223200 -0.545957 0.103864 -0.372855 3 0.086367 0.218539 -1.033671 0.218388 -0.598549 4 0.203071 -3.402876 0.192504 -0.147548 -0.726001 </code></pre> Note: this is slightly more robust / allows more control than using <code>*</code>. You can do the same in numpy (ie <code>data.values</code> here), either multiplying directly, this will be faster as it doesn't worry about data-alignment, or using vectorize rather than apply.

Python numpy or pandas equivalent of the R function sweep()

Tags:

python

arrays

pandas

r

numpy

What is the numpy or pandas equivalent of the R function sweep()?

To elaborate: in R let's say we have a coefficient vector, say beta (numeric type) and an array, say data (20x5 numeric type). I want to superimpose the vector on each row of the array and multiply the corresponding elements. And then return the resultant (20x5) array I could achieve this using sweep().

Equivalent sample R code:

beta <-  c(10, 20, 30, 40)
data <- array(1:20,c(5,4))
sweep(data,MARGIN=2,beta,`*`)
#---------------
 > data
      [,1] [,2] [,3] [,4]
 [1,]    1    6   11   16
 [2,]    2    7   12   17
 [3,]    3    8   13   18
 [4,]    4    9   14   19
 [5,]    5   10   15   20

 > beta
 [1] 10 20 30 40

 > sweep(data,MARGIN=2,beta,`*`)
      [,1] [,2] [,3] [,4]
 [1,]   10  120  330  640
 [2,]   20  140  360  680
 [3,]   30  160  390  720
 [4,]   40  180  420  760
 [5,]   50  200  450  800

I have heard exciting things about numpy and pandas in Python and it seems to have a lot of R like commands. What would be the fastest way to achieve the same using these libraries? The actual data has millions of rows and around 50 columns. The beta vector is of course conformable with data.

878

asked Apr 16 '14 18:04

sriramn

1 Answers

Pandas has an apply() method too, apply being what R's sweep() uses under the hood. (Note that the MARGIN argument is "equivalent" to the axis argument in many pandas functions, except that it takes values 0 and 1 rather than 1 and 2).

np.random.seed = 1    
beta = pd.Series(np.random.randn(5))    
data = pd.DataFrame(np.random.randn(20, 5))

You can use an apply with a function which is called on each row:

data.apply(lambda row: row * beta, axis=1)

Note: that axis=0 would apply to each column, this is the default as data is stored column-wise and so column-wise operations are more efficient.

However, in this case it's easy to make significantly faster (and more readable) to vectorize, simply by multiplying row-wise:

In [21]: data.apply(lambda row: row * beta, axis=1).head()
Out[21]:
          0         1         2         3         4
0 -0.024827 -1.465294 -0.416155 -0.369182 -0.649587
1  0.026433  0.355915 -0.672302  0.225446 -0.520374
2  0.042254 -1.223200 -0.545957  0.103864 -0.372855
3  0.086367  0.218539 -1.033671  0.218388 -0.598549
4  0.203071 -3.402876  0.192504 -0.147548 -0.726001

In [22]: data.mul(beta, axis=1).head()  # just show first few rows with head
Out[22]:
          0         1         2         3         4
0 -0.024827 -1.465294 -0.416155 -0.369182 -0.649587
1  0.026433  0.355915 -0.672302  0.225446 -0.520374
2  0.042254 -1.223200 -0.545957  0.103864 -0.372855
3  0.086367  0.218539 -1.033671  0.218388 -0.598549
4  0.203071 -3.402876  0.192504 -0.147548 -0.726001

Note: this is slightly more robust / allows more control than using *.

You can do the same in numpy (ie data.values here), either multiplying directly, this will be faster as it doesn't worry about data-alignment, or using vectorize rather than apply.

answered Sep 27 '22 22:09

Andy Hayden

Related questions
                            
                                Ending with a for loop in python
                            
                                Why np.array([1e5])**2 is different from np.array([100000])**2 in Python?
                            
                                Error Install Pandas for Python on Mac OS X
                            
                                Pyramid login and logout page return 404, the rest of the app works fine
                            
                                How to get the state of Qcheckbox that is present inside the QGroupbox in PyQt
                            
                                fabfile.py not working: No module named Crypto
                            
                                Python - how to add integers (possibly in a list?)
                            
                                Trigger script upon email receipt
                            
                                pkg_resources: get own distribution?
                            
                                Rounding dates in Python
                            
                                How to join two dataframes on datetime index autofill non matched rows with nan
                            
                                Why can't this python script find the libclang dll?
                            
                                Group labels in matplotlib barchart using Pandas MultiIndex
                            
                                Python HTTP server not responding on POST request
                            
                                How can I draw a CART tree in Python, as I can in R?
                            
                                Pandas dataframe transpose, to_csv
                            
                                Can I run a bash script in Python and keep any env variables it exports?
                            
                                Why is it a syntax error to invoke a method on a numeric literal in Python?
                            
                                How to substract multidimensional array in Python?
                            
                                Make subset of array, based on values of two other arrays in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With