I am trying to join two numpy arrays. In one I have a set of columns/features after running TF-IDF on a single column of text. In the other I have one column/feature which is an integer. So I read in a column of train and test data, run TF-IDF on this, and then I want to add another integer column because I think this will help my classifier learn more accurately how it should behave. Unfortunately, I am getting the error in the title when I try and run <code>hstack</code> to add this single column to my other numpy array. Here is my code : <pre class="prettyprint"><code> #reading in test/train data for TF-IDF traindata = list(np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,2]) testdata = list(np.array(p.read_csv('FinalTestCSVFin.csv', delimiter=";"))[:,2]) #reading in labels for training y = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-2] #reading in single integer column to join AlexaTrainData = p.read_csv('FinalCSVFin.csv', delimiter=";")[["alexarank"]] AlexaTestData = p.read_csv('FinalTestCSVFin.csv', delimiter=";")[["alexarank"]] AllAlexaAndGoogleInfo = AlexaTestData.append(AlexaTrainData) tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1) #tf-idf object rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, C=1, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None) #Classifier X_all = traindata + testdata #adding test and train data to put into tf-idf lentrain = len(traindata) #find length of train data tfv.fit(X_all) #fit tf-idf on all our text X_all = tfv.transform(X_all) #transform it X = X_all[:lentrain] #reduce to size of training set AllAlexaAndGoogleInfo = AllAlexaAndGoogleInfo[:lentrain] #reduce to size of training set X_test = X_all[lentrain:] #reduce to size of training set #printing debug info, output below : print "X.shape => " + str(X.shape) print "AllAlexaAndGoogleInfo.shape => " + str(AllAlexaAndGoogleInfo.shape) print "X_all.shape => " + str(X_all.shape) #line we get error on X = np.hstack((X, AllAlexaAndGoogleInfo)) </code></pre> Below is the output and error message : <pre class="prettyprint"><code>X.shape => (7395, 238377) AllAlexaAndGoogleInfo.shape => (7395, 1) X_all.shape => (10566, 238377) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-12-2b310887b5e4> in <module>() 31 print "X_all.shape => " + str(X_all.shape) 32 #X = np.column_stack((X, AllAlexaAndGoogleInfo)) ---> 33 X = np.hstack((X, AllAlexaAndGoogleInfo)) 34 sc = preprocessing.StandardScaler().fit(X) 35 X = sc.transform(X) C:\Users\Simon\Anaconda\lib\site-packages\numpy\core\shape_base.pyc in hstack(tup) 271 # As a special case, dimension 0 of 1-dimensional arrays is "horizontal" 272 if arrs[0].ndim == 1: --> 273 return _nx.concatenate(arrs, 0) 274 else: 275 return _nx.concatenate(arrs, 1) ValueError: all the input arrays must have same number of dimensions </code></pre> What is causing my problem here? How can I fix this? As far as I can see I should be able to join these columns? What have I misunderstood? Thank you. Edit : Using the method in the answer below gets the following error : <pre class="prettyprint"><code>--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-16-640ef6dd335d> in <module>() ---> 36 X = np.column_stack((X, AllAlexaAndGoogleInfo)) 37 sc = preprocessing.StandardScaler().fit(X) 38 X = sc.transform(X) C:\Users\Simon\Anaconda\lib\site-packages\numpy\lib\shape_base.pyc in column_stack(tup) 294 arr = array(arr,copy=False,subok=True,ndmin=2).T 295 arrays.append(arr) --> 296 return _nx.concatenate(arrays,1) 297 298 def dstack(tup): ValueError: all the input array dimensions except for the concatenation axis must match exactly </code></pre> Interestingly, I tried to print the <code>dtype</code> of X and this worked fine : <pre class="prettyprint"><code>X.dtype => float64 </code></pre> However, trying to print the dtype of <code>AllAlexaAndGoogleInfo</code> like so : <pre class="prettyprint"><code>print "AllAlexaAndGoogleInfo.dtype => " + str(AllAlexaAndGoogleInfo.dtype) </code></pre> produces : <pre class="prettyprint"><code>'DataFrame' object has no attribute 'dtype' </code></pre>

As <code>X</code> is a sparse array, instead of <code>numpy.hstack</code>, use <code>scipy.sparse.hstack</code> to join the arrays. In my opinion the error message is kind of misleading here. This minimal example illustrates the situation: <pre class="prettyprint"><code>import numpy as np from scipy import sparse X = sparse.rand(10, 10000) xt = np.random.random((10, 1)) print 'X shape:', X.shape print 'xt shape:', xt.shape print 'Stacked shape:', np.hstack((X,xt)).shape #print 'Stacked shape:', sparse.hstack((X,xt)).shape #This works </code></pre> Based on the following output <pre class="prettyprint"><code>X shape: (10, 10000) xt shape: (10, 1) </code></pre> one may expect that the <code>hstack</code> in the following line will work, but the fact is that it throws this error: <pre class="prettyprint"><code>ValueError: all the input arrays must have same number of dimensions </code></pre> So, use <code>scipy.sparse.hstack</code> when you have a sparse array to stack. <hr> In fact I have answered this as a comment in your another questions, and you mentioned that another error message pops up: <pre class="prettyprint"><code>TypeError: no supported conversion for types: (dtype('float64'), dtype('O')) </code></pre> First of all, <code>AllAlexaAndGoogleInfo</code> does not have a <code>dtype</code> as it is a <code>DataFrame</code>. To get it's underlying numpy array, simply use <code>AllAlexaAndGoogleInfo.values</code>. Check its <code>dtype</code>. Based on the error message, it has a <code>dtype</code> of <code>object</code>, which means that it might contain non-numerical elements like strings. This is a minimal example that reproduces this situation: <pre class="prettyprint"><code>X = sparse.rand(100, 10000) xt = np.random.random((100, 1)) xt = xt.astype('object') # Comment this to fix the error print 'X:', X.shape, X.dtype print 'xt:', xt.shape, xt.dtype print 'Stacked shape:', sparse.hstack((X,xt)).shape </code></pre> The error message: <pre class="prettyprint"><code>TypeError: no supported conversion for types: (dtype('float64'), dtype('O')) </code></pre> So, check if there is any non-numerical values in <code>AllAlexaAndGoogleInfo</code> and repair them, before doing the stacking.

Use <code>.column_stack</code>. Like so: <pre class="prettyprint"><code>X = np.column_stack((X, AllAlexaAndGoogleInfo)) </code></pre> From the docs: <blockquote> Take a sequence of 1-D arrays and stack them as columns to make a single 2-D array. 2-D arrays are stacked as-is, just like with hstack. </blockquote>

Numpy hstack - "ValueError: all the input arrays must have same number of dimensions" - but they do

Tags:

I am trying to join two numpy arrays. In one I have a set of columns/features after running TF-IDF on a single column of text. In the other I have one column/feature which is an integer. So I read in a column of train and test data, run TF-IDF on this, and then I want to add another integer column because I think this will help my classifier learn more accurately how it should behave.

Unfortunately, I am getting the error in the title when I try and run hstack to add this single column to my other numpy array.

Here is my code :

Click to copy

  #reading in test/train data for TF-IDF
  traindata = list(np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,2])
  testdata = list(np.array(p.read_csv('FinalTestCSVFin.csv', delimiter=";"))[:,2])

  #reading in labels for training
  y = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-2]

  #reading in single integer column to join
  AlexaTrainData = p.read_csv('FinalCSVFin.csv', delimiter=";")[["alexarank"]]
  AlexaTestData = p.read_csv('FinalTestCSVFin.csv', delimiter=";")[["alexarank"]]
  AllAlexaAndGoogleInfo = AlexaTestData.append(AlexaTrainData)

  tfv = TfidfVectorizer(min_df=3,  max_features=None, strip_accents='unicode',  
        analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1) #tf-idf object
  rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                             C=1, fit_intercept=True, intercept_scaling=1.0, 
                             class_weight=None, random_state=None) #Classifier
  X_all = traindata + testdata #adding test and train data to put into tf-idf
  lentrain = len(traindata) #find length of train data
  tfv.fit(X_all) #fit tf-idf on all our text
  X_all = tfv.transform(X_all) #transform it
  X = X_all[:lentrain] #reduce to size of training set
  AllAlexaAndGoogleInfo = AllAlexaAndGoogleInfo[:lentrain] #reduce to size of training set
  X_test = X_all[lentrain:] #reduce to size of training set

  #printing debug info, output below : 
  print "X.shape => " + str(X.shape)
  print "AllAlexaAndGoogleInfo.shape => " + str(AllAlexaAndGoogleInfo.shape)
  print "X_all.shape => " + str(X_all.shape)

  #line we get error on
  X = np.hstack((X, AllAlexaAndGoogleInfo))

Below is the output and error message :

Click to copy

X.shape => (7395, 238377)
AllAlexaAndGoogleInfo.shape => (7395, 1)
X_all.shape => (10566, 238377)



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-2b310887b5e4> in <module>()
     31 print "X_all.shape => " + str(X_all.shape)
     32 #X = np.column_stack((X, AllAlexaAndGoogleInfo))
---> 33 X = np.hstack((X, AllAlexaAndGoogleInfo))
     34 sc = preprocessing.StandardScaler().fit(X)
     35 X = sc.transform(X)

C:\Users\Simon\Anaconda\lib\site-packages\numpy\core\shape_base.pyc in hstack(tup)
    271     # As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
    272     if arrs[0].ndim == 1:
--> 273         return _nx.concatenate(arrs, 0)
    274     else:
    275         return _nx.concatenate(arrs, 1)

ValueError: all the input arrays must have same number of dimensions

What is causing my problem here? How can I fix this? As far as I can see I should be able to join these columns? What have I misunderstood?

Thank you.

Edit :

Using the method in the answer below gets the following error :

Click to copy

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-640ef6dd335d> in <module>()
---> 36 X = np.column_stack((X, AllAlexaAndGoogleInfo))
     37 sc = preprocessing.StandardScaler().fit(X)
     38 X = sc.transform(X)

C:\Users\Simon\Anaconda\lib\site-packages\numpy\lib\shape_base.pyc in column_stack(tup)
    294             arr = array(arr,copy=False,subok=True,ndmin=2).T
    295         arrays.append(arr)
--> 296     return _nx.concatenate(arrays,1)
    297 
    298 def dstack(tup):

ValueError: all the input array dimensions except for the concatenation axis must match exactly

Interestingly, I tried to print the dtype of X and this worked fine :

Click to copy

X.dtype => float64

However, trying to print the dtype of AllAlexaAndGoogleInfo like so :

Click to copy

print "AllAlexaAndGoogleInfo.dtype => " + str(AllAlexaAndGoogleInfo.dtype)

produces :

Click to copy

'DataFrame' object has no attribute 'dtype'

249

asked Mar 07 '14 18:03

Simon Kiely

2 Answers

As X is a sparse array, instead of numpy.hstack, use scipy.sparse.hstack to join the arrays. In my opinion the error message is kind of misleading here.

This minimal example illustrates the situation:

Click to copy

import numpy as np
from scipy import sparse

X = sparse.rand(10, 10000)
xt = np.random.random((10, 1))
print 'X shape:', X.shape
print 'xt shape:', xt.shape
print 'Stacked shape:', np.hstack((X,xt)).shape
#print 'Stacked shape:', sparse.hstack((X,xt)).shape #This works

Based on the following output

Click to copy

X shape: (10, 10000)
xt shape: (10, 1)

one may expect that the hstack in the following line will work, but the fact is that it throws this error:

Click to copy

ValueError: all the input arrays must have same number of dimensions

So, use scipy.sparse.hstack when you have a sparse array to stack.

In fact I have answered this as a comment in your another questions, and you mentioned that another error message pops up:

Click to copy

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))

First of all, AllAlexaAndGoogleInfo does not have a dtype as it is a DataFrame. To get it's underlying numpy array, simply use AllAlexaAndGoogleInfo.values. Check its dtype. Based on the error message, it has a dtype of object, which means that it might contain non-numerical elements like strings.

This is a minimal example that reproduces this situation:

Click to copy

X = sparse.rand(100, 10000)
xt = np.random.random((100, 1))
xt = xt.astype('object') # Comment this to fix the error
print 'X:', X.shape, X.dtype
print 'xt:', xt.shape, xt.dtype
print 'Stacked shape:', sparse.hstack((X,xt)).shape

The error message:

Click to copy

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))

So, check if there is any non-numerical values in AllAlexaAndGoogleInfo and repair them, before doing the stacking.

answered Sep 23 '22 00:09

YS-L

Use .column_stack. Like so:

Click to copy

X = np.column_stack((X, AllAlexaAndGoogleInfo))

From the docs:

Take a sequence of 1-D arrays and stack them as columns to make a single 2-D array. 2-D arrays are stacked as-is, just like with hstack.

answered Sep 23 '22 00:09

Drewness

Related questions
                            
                                Ignoring Bash pipefail for error code 141
                            
                                How to treat std::pair as two separate variables?
                            
                                Subscribe and Read MQTT Message Using PAHO
                            
                                Chrome Dev Tools: View unminified CSS
                            
                                How to post form login using jsoup?
                            
                                Lambda can only be used with functional interface?
                            
                                Error in Visual Studio 2013: "No exports were found that match the constraint"
                            
                                How can I code a Created-201 response using IHttpActionResult
                            
                                How is ArrayDeque faster than stack?
                            
                                Spring security access with multiple roles
                            
                                How do you send multiple parameters in a Url.Action?
                            
                                Split windows in Netbeans

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Numpy hstack - "ValueError: all the input arrays must have same number of dimensions" - but they do

Tags:

Simon Kiely

People also ask

2 Answers

YS-L

Drewness

Recent Activity

Donate For Us