Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy Array, Data must be 1-dimensional

I am attempting to reproduce MatLab code in Python and am stumbling with a MatLab matrix. The block of code in MatLab is below:

for i = 1:Np
    y = returns(:,i);
    sgn = modified_sign(y); 
    X = [ones(Tp,1) sgn.*log(prices(:,i).*volumes(:,i))];

I am having a hard time creating 'X' without getting the "Data Must be 1 Dimensional Error. Below is one of my attempts, of many trying to reproduce this section of code:

lam = np.empty([Tp,Np]) * np.nan
for i in range(0,Np):
    y=returns.iloc[:,i]
    sgn = modified_sign(y)
    #X = np.array([[np.ones([Tp,1]),np.multiply(np.multiply(sgn,np.log(prices.iloc[:,i])),volumes.iloc[:,i])]])
    X = np.concatenate([np.ones([Tp,1]),np.column_stack(np.array([sgn*np.log(prices.iloc[:,i])*volumes[:,i]]))],axis=1)

Tp and Np are the length and width of the prices series

crsp['PRC'].to_frame().shape = (9455,1)
Tp, Np = crsp['PRC'].to_frame().shape 

Tr and Nr are the length and width of the returns series

crsp['RET'].to_frame().shape = (9455,1)
Tr, Nr = crsp['RET'].to_frame().shape

Tv and Nv are the length and width of the volume series

crsp['VOL'].to_frame().shape = (9455,1)
Tv, Nv = crsp['VOL'].to_frame().shape

The ones array:

np.ones([Tp,1])

would be (9455,1)

Sample Volume Data:

    DATE    VOLAVG
1979-12-04  8880.9912591051
1979-12-05  8867.545284586622
1979-12-06  8872.264687564875
1979-12-07  8876.922134551494
1979-12-10  8688.765365448506
1979-12-11  8695.279567657451
1979-12-12  8688.865033222592
1979-12-13  8684.095435684647
1979-12-14  8684.534550736667
1979-12-17  8879.694444444445

Sample Price Data

    DATE    AVGPRC
1979-12-04  25.723484200567693
1979-12-05  25.839463450495863
1979-12-06  26.001899852224145
1979-12-07  25.917628864251874
1979-12-10  26.501898917349788
1979-12-11  26.448652367425804
1979-12-12  26.475906537182407
1979-12-13  26.519610746585908
1979-12-14  26.788873713159944
1979-12-17  26.38583047822484

Sample Return Data

    DATE    RET
1979-12-04  0.008092780873338423
1979-12-05  0.004498557619416754
1979-12-06  0.006266692192175238
1979-12-07  -0.0032462182943131523
1979-12-10  0.022292999386413825
1979-12-11  -0.002011180868938034
1979-12-12  0.001029925340138238
1979-12-13  0.0016493553247958206
1979-12-14  0.010102153877941776
1979-12-17  -0.015159499602784175

What I am ultimately trying to achieve is an (9455,2) array where X.iloc[:,0]=1 and X.iloc[:,2]=log(price)*volume for each row.

I referenced the MatLab to Numpy document online (https://docs.scipy.org/doc/numpy-dev/user/numpy-for-matlab-users.html) and checked out various other StackOverflow posts to no avail.

For context, modified_sign is an external function, prices is a DataFrame slice, as is returns. Np is the width (think df.shape[1]) of the price DataFrame and Tp is df.shape[0]. This is esentially creating a column of 1s and log(price)*volume to be used in a regression for each series of returns where each df is (TxN) where T is dates and N is securities. Any guidance you can provide would be greatly appreciated.

like image 794
Robert Garrison Avatar asked Jul 14 '17 23:07

Robert Garrison


1 Answers

The problem is that numpy can have 1D array (vectors) while MATLAB cannot. So when you create the np.ones([Tp,1]) array, it is creating a 2D array where one dimension has a size of 1. In MATLAB, that is considered a "vector", but in numpy it isn't.

So what you need to do is give np.ones a single value. This will result in a vector (unlike in MATLAB where it will result in a 2D square matrix). The same rule applies to np.zeros and any other function that takes dimensions as inputs.

So this should work:

X = np.column_stack([np.ones(Tp), sgn*np.log(prices.iloc[:,1])*volumes.iloc[:,1]])

That being said, you are losing most of the advantage of using pandas by doing it this way. It would be much better to combine the DataFrames into one using the dates as indices, then create a new column with the calculation. Assuming the dates are the indices, something like this should work (if the dates are indices use set_index to make them indices):

data = pd.concat([returns, prices, volumes], axis=1)
data['sign'] = modified_sign(data['ret')
data['X0'] = 1
data['X1'] = data['sign']*np.log(data['AVGPRC'])*data['VOLAVG']

Of course you would replace X0 and X1 with more informative names, and I am not sure you even need X0 using this approach, but that would get you a much easier-to-work-with data structure.

Also, if your dates are strings you should convert them to pandas dates. They are much nicer to work with than strings.

like image 175
TheBlackCat Avatar answered Oct 11 '22 19:10

TheBlackCat