I have a spark dataframe with a column of short sentences, and a column with a categorical variable. I'd like to perform tf-idf
on the sentences, one-hot-encoding
on the categorical variable and then output it to a sparse matrix on my driver once it's much smaller in size (for a scikit-learn model).
What is the best way to get the data out of spark in sparse form? It seems like there is only a toArray()
method on sparse vectors, which outputs numpy
arrays. However, the docs do say that scipy sparse arrays can be used in the place of spark sparse arrays.
Keep in mind also that the tf_idf values are in fact a column of sparse arrays. Ideally it would be nice to get all these features into one large sparse matrix.
One possible solution can be expressed as follows:
convert features to RDD
and extract vectors:
from pyspark.ml.linalg import SparseVector
from operator import attrgetter
df = sc.parallelize([
(SparseVector(3, [0, 2], [1.0, 3.0]), ),
(SparseVector(3, [1], [4.0]), )
]).toDF(["features"])
features = df.rdd.map(attrgetter("features"))
add row indices:
indexed_features = features.zipWithIndex()
flatten to RDD of tuples (i, j, value)
:
def explode(row):
vec, i = row
for j, v in zip(vec.indices, vec.values):
yield i, j, v
entries = indexed_features.flatMap(explode)
collect and reshape:
row_indices, col_indices, data = zip(*entries.collect())
compute shape:
shape = (
df.count(),
df.rdd.map(attrgetter("features")).first().size
)
create sparse matrix:
from scipy.sparse import csr_matrix
mat = csr_matrix((data, (row_indices, col_indices)), shape=shape)
quick sanity check:
mat.todense()
With expected result:
matrix([[ 1., 0., 3.],
[ 0., 4., 0.]])
Another one:
convert each row of features
to matrix:
import numpy as np
def as_matrix(vec):
data, indices = vec.values, vec.indices
shape = 1, vec.size
return csr_matrix((data, indices, np.array([0, vec.values.size])), shape)
mats = features.map(as_matrix)
and reduce with vstack
:
from scipy.sparse import vstack
mat = mats.reduce(lambda x, y: vstack([x, y]))
or collect
and vstack
mat = vstack(mats.collect())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With