I am trying to run hstack to join a column of integer values to a list of columns created by a TF-IDF (so I can eventually use all of these columns/features in a classifier).
I'm reading in the column using pandas, checking for any NA values and converting them to the largest value in the dataframe like so :
OtherColumn = p.read_csv('file.csv', delimiter=";", na_values=['?'])[["OtherColumn"]]
OtherColumn = OtherColumn.fillna(OtherColumn.max())
OtherColumn = OtherColumn.convert_objects(convert_numeric=True)
Then I read in my text column and run TF-IDF to create loads of features :
X = list(np.array(p.read_csv('file.csv', delimiter=";"))[:,2])
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode',
analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1)
tfv.fit(X)
Finally, I want to join them all together, and this is where our error occurs and the program cannot run, and also I am unsure whether I am using the StandardScaler appropriately here :
X = sp.sparse.hstack((X, OtherColumn.values)) #error here
sc = preprocessing.StandardScaler().fit(X)
X = sc.transform(X)
X_test = sc.transform(X_test)
Full error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-13-79d1e70bc1bc> in <module>()
---> 47 X = sp.sparse.hstack((X, OtherColumn.values))
48 sc = preprocessing.StandardScaler().fit(X)
49 X = sc.transform(X)
C:\Users\Simon\Anaconda\lib\site-packages\scipy\sparse\construct.pyc in hstack(blocks, format, dtype)
421
422 """
--> 423 return bmat([blocks], format=format, dtype=dtype)
424
425
C:\Users\Simon\Anaconda\lib\site-packages\scipy\sparse\construct.pyc in bmat(blocks, format, dtype)
537 nnz = sum([A.nnz for A in blocks[block_mask]])
538 if dtype is None:
--> 539 dtype = upcast(*tuple([A.dtype for A in blocks[block_mask]]))
540
541 row_offsets = np.concatenate(([0], np.cumsum(brow_lengths)))
C:\Users\Simon\Anaconda\lib\site-packages\scipy\sparse\sputils.pyc in upcast(*args)
58 return t
59
---> 60 raise TypeError('no supported conversion for types: %r' % (args,))
61
62
TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))
As discussed in Numpy hstack - "ValueError: all the input arrays must have same number of dimensions" - but they do you many need to explicitly cast the inputs to sparse.hstack
. The sparse
code is not as robust as the core numpy
code.
If X
is a sparse array with dtype=float
, and A
is dense with dtype=object
, several options are possible.
sparse.hstack(X, A) # error
sparse.hstack(X.astype(object), A) # cast X to object; return object
sparse.hstack(X, A.astype(float)) # cast A to float; return float
hstack(X.A, A) # make X dense, result will be type object
A.astype(float)
will work if A
contains some NaN
. See http://pandas.pydata.org/pandas-docs/stable/gotchas.html regarding NaN. If A
is object for some other reason (e.g. ragged lists), then we'll have to revisit the issue.
Another possibility is to use Pandas's concat
. http://pandas.pydata.org/pandas-docs/stable/merging.html. I assume Pandas has paid more attention to these issues than the sparse
coders.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With