I'd like to find an efficient method to create spare vectors in PySpark using dataframes. Let's say given the transactional input: <pre class="prettyprint"><code>df = spark.createDataFrame([ (0, "a"), (1, "a"), (1, "b"), (1, "c"), (2, "a"), (2, "b"), (2, "b"), (2, "b"), (2, "c"), (0, "a"), (1, "b"), (1, "b"), (2, "cc"), (3, "a"), (4, "a"), (5, "c") ], ["id", "category"]) </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>+---+--------+ | id|category| +---+--------+ | 0| a| | 1| a| | 1| b| | 1| c| | 2| a| | 2| b| | 2| b| | 2| b| | 2| c| | 0| a| | 1| b| | 1| b| | 2| cc| | 3| a| | 4| a| | 5| c| +---+--------+ </code></pre> In a summed up format: <pre class="prettyprint"><code>df.groupBy(df["id"],df["category"]).count().show() </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>+---+--------+-----+ | id|category|count| +---+--------+-----+ | 1| b| 3| | 1| a| 1| | 1| c| 1| | 2| cc| 1| | 2| c| 1| | 2| a| 1| | 1| a| 1| | 0| a| 2| +---+--------+-----+ </code></pre> My aim is to get this output by id: <pre class="prettyprint"><code>+---+-----------------------------------------------+ | id| feature | +---+-----------------------------------------------+ | 2|SparseVector({a: 1.0, b: 3.0, c: 1.0, cc: 1.0})| </code></pre> Could you please point me in the right direction? With mapreduce in Java it seemed to be way easier for me.

This can be done pretty easily with <code>pivot</code> and <code>VectorAssembler</code>. Replace aggregation with <code>pivot</code>: <pre class="prettyprint"><code> pivoted = df.groupBy("id").pivot("category").count().na.fill(0) </code></pre> and assemble: <pre class="prettyprint"><code>from pyspark.ml.feature import VectorAssembler input_cols = [x for x in pivoted.columns if x != id] result = (VectorAssembler(inputCols=input_cols, outputCol="features") .transform(pivoted) .select("id", "features")) </code></pre> with the result being as follows. This will choose more efficient representation depending on sparsity: <pre class="prettyprint"><code>+---+---------------------+ |id |features | +---+---------------------+ |0 |(5,[1],[2.0]) | |5 |(5,[0,3],[5.0,1.0]) | |1 |[1.0,1.0,3.0,1.0,0.0]| |3 |(5,[0,1],[3.0,1.0]) | |2 |[2.0,1.0,3.0,1.0,1.0]| |4 |(5,[0,1],[4.0,1.0]) | +---+---------------------+ </code></pre> but of course you can still convert it to a single representation: <pre class="prettyprint"><code>from pyspark.ml.linalg import SparseVector, VectorUDT import numpy as np def to_sparse(c): def to_sparse_(v): if isinstance(v, SparseVector): return v vs = v.toArray() nonzero = np.nonzero(vs)[0] return SparseVector(v.size, nonzero, vs[nonzero]) return udf(to_sparse_, VectorUDT())(c) </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>+---+-------------------------------------+ |id |features | +---+-------------------------------------+ |0 |(5,[1],[2.0]) | |5 |(5,[0,3],[5.0,1.0]) | |1 |(5,[0,1,2,3],[1.0,1.0,3.0,1.0]) | |3 |(5,[0,1],[3.0,1.0]) | |2 |(5,[0,1,2,3,4],[2.0,1.0,3.0,1.0,1.0])| |4 |(5,[0,1],[4.0,1.0]) | +---+-------------------------------------+ </code></pre>

Sparse Vector pyspark

Tags:

python

apache-spark

pyspark

sparse-matrix

I'd like to find an efficient method to create spare vectors in PySpark using dataframes.

Let's say given the transactional input:

df = spark.createDataFrame([
    (0, "a"),
    (1, "a"),
    (1, "b"),
    (1, "c"),
    (2, "a"),
    (2, "b"),
    (2, "b"),
    (2, "b"),
    (2, "c"),
    (0, "a"),
    (1, "b"),
    (1, "b"),
    (2, "cc"),
    (3, "a"),
    (4, "a"),
    (5, "c")
], ["id", "category"])

+---+--------+
| id|category|
+---+--------+
|  0|       a|
|  1|       a|
|  1|       b|
|  1|       c|
|  2|       a|
|  2|       b|
|  2|       b|
|  2|       b|
|  2|       c|
|  0|       a|
|  1|       b|
|  1|       b|
|  2|      cc|
|  3|       a|
|  4|       a|
|  5|       c|
+---+--------+

In a summed up format:

df.groupBy(df["id"],df["category"]).count().show()

+---+--------+-----+
| id|category|count|
+---+--------+-----+
|  1|       b|    3|
|  1|       a|    1|
|  1|       c|    1|
|  2|      cc|    1|
|  2|       c|    1|
|  2|       a|    1|
|  1|       a|    1|
|  0|       a|    2|
+---+--------+-----+

My aim is to get this output by id:

+---+-----------------------------------------------+
| id|                                       feature |
+---+-----------------------------------------------+
|  2|SparseVector({a: 1.0, b: 3.0, c: 1.0, cc: 1.0})|

Could you please point me in the right direction? With mapreduce in Java it seemed to be way easier for me.

449

asked May 05 '17 16:05

ponthu

1 Answers

This can be done pretty easily with pivot and VectorAssembler. Replace aggregation with pivot:

 pivoted = df.groupBy("id").pivot("category").count().na.fill(0)

and assemble:

from pyspark.ml.feature import VectorAssembler

input_cols = [x for x in pivoted.columns if x != id]

result = (VectorAssembler(inputCols=input_cols, outputCol="features")
    .transform(pivoted)
    .select("id", "features"))

with the result being as follows. This will choose more efficient representation depending on sparsity:

+---+---------------------+
|id |features             |
+---+---------------------+
|0  |(5,[1],[2.0])        |
|5  |(5,[0,3],[5.0,1.0])  |
|1  |[1.0,1.0,3.0,1.0,0.0]|
|3  |(5,[0,1],[3.0,1.0])  |
|2  |[2.0,1.0,3.0,1.0,1.0]|
|4  |(5,[0,1],[4.0,1.0])  |
+---+---------------------+

but of course you can still convert it to a single representation:

from pyspark.ml.linalg import SparseVector, VectorUDT
import numpy as np

def to_sparse(c):
    def to_sparse_(v):
        if isinstance(v, SparseVector):
            return v
        vs = v.toArray()
        nonzero = np.nonzero(vs)[0]
        return SparseVector(v.size, nonzero, vs[nonzero])
    return udf(to_sparse_, VectorUDT())(c)

+---+-------------------------------------+
|id |features                             |
+---+-------------------------------------+
|0  |(5,[1],[2.0])                        |
|5  |(5,[0,3],[5.0,1.0])                  |
|1  |(5,[0,1,2,3],[1.0,1.0,3.0,1.0])      |
|3  |(5,[0,1],[3.0,1.0])                  |
|2  |(5,[0,1,2,3,4],[2.0,1.0,3.0,1.0,1.0])|
|4  |(5,[0,1],[4.0,1.0])                  |
+---+-------------------------------------+

answered Sep 30 '22 11:09

zero323

Related questions
                            
                                How do I use "tel", "number", or other input types in WTForms?
                            
                                Python static typing does not work
                            
                                print function does not show output on sublime text 3 console
                            
                                Python: Dot product of each vector in two lists of vectors
                            
                                How to install MySQL-python on Amazon Web Services EC2 instance?
                            
                                Proper way to dynamically import a module with relative imports?
                            
                                Split list into separate but overlapping chunks
                            
                                How do I define a conditional function using sympy?
                            
                                Why are str.count('') and len(str) giving different output?
                            
                                Clear all lru_cache in Python
                            
                                How to Use a Wildcard (%) in Pandas read_sql()
                            
                                Python nltk counting word and phrase frequency
                            
                                Last digit of a large number(power) python
                            
                                checking if an object is an Exception class
                            
                                Filter columns of a dataframe between a range using a for loop?
                            
                                Python - How to handle folder creation if folder already exists
                            
                                scrapy : find element which has particular text
                            
                                AppConfig.ready() is Running Twice on Django Setup (Using Heroku)
                            
                                Replace '\n' with ' 'space in string using python
                            
                                Django Ajax return error messages

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With