Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Train `sklearn` ML model with scipy sparse matrix and numpy array

Just to explain some things more about my use case, A is a sparse matrix with tf-idf values and B is an array with some additional features of my data.

I have already splitted to training and test sets so A and B in my example are only about the training set. I (want to) do the same for the test set after this code.

I want to concatenate these matrices/arrays because then I want to pass them to a sklearn ML model to train it and I do not think that I can pass them separately.

So I tried to do this:

C = np.concatenate((A, B.T), axis=1)

where A is a <class 'scipy.sparse.csr.csr_matrix'> and B is a <class 'numpy.ndarray'>.

However, when I try to do this then I get the following error:

ValueError: zero-dimensional arrays cannot be concatenated

Also, I do not think that the idea of `np.concatenate` a numpy array with a sparse matrix is very good in my case because

  1. it is basically impossible to covert my sparse array A to a dense array because it is too big
  2. I will lose (or not actually??) information if I convert my fully dense array B to a sparse array

What is the best way to pass to an sklearn ML model a sparse and a fully dense array concatenated by rows?

like image 557
Outcast Avatar asked Feb 17 '26 20:02

Outcast


1 Answers

  1. You can use hstack from scipy. hstack will convert both matrices to scipy coo_matrix, merge them and return a coo_matrix by default.

  2. No information is lost when converting dense array to sparse. Sparse matrices are just compact data storage format. Also, unless to specify a value for argument dtype of hstack everything is upcasted. So, there is no possibility of data loss there as well.

Further, if you plan to use Logistic Regression from sklearn, sparse matrices must be in csr format for fit method to work.

The following code should work for your use-case

from scipy.sparse import hstack

X = hstack((A, B), format='csr')
like image 141
Mohsin hasan Avatar answered Feb 19 '26 10:02

Mohsin hasan