Pyspark- size function on elements of vector from count vectorizer?

Question

Background: I have URL data aggregated into a string array. Of this form. [xyz.com,abc.com,efg.com]

1)I filter based on Url count in a row with

vectored_file(size('agg_url_host')>3)

2)I filter out urls that do not occur frequently in the next step with

CountVectorizer(inputCol="agg_url_host",outputCol="vectors",minDF=10000)

The problem is some rows have enough to pass my size function in step 1, but after we remove less frequent urls do not. So I end up with rows with the vectors column reading: (68,[],[]) (68,[4,56],[1.0,1.0]) even if I only want rows with counts higher than 3 for modeling.

So my question is can I run a size function on a vector object like the output of countVectorizer? Or is their a similar function that will remove low counts?

Perhaps there is a way to create a new string array column from my original 'agg_url' column with the less frequent removed? Then I can perform CountVectorizer on that.

Any help appreciated.

Alper t. Turker · Accepted Answer

Size of the output vector is always fixed, so the only thing you can do, is counting non-zero elements:

from pyspark.sql.functions import udf

@udf("long")
def num_nonzeros(v):
    return v.numNonzeros()

df = spark.createDataFrame([
    (1, SparseVector(10, [1, 2, 4, 6], [0.1, 0.3, 0.1, 0.1])),
    (2, SparseVector(10, [], []))
], ("id", "vectors"))

df.where(num_nonzeros("vectors") > 3).show()
# +---+--------------------+      
# | id|             vectors|
# +---+--------------------+
# |  1|(10,[1,2,4,6],[0....|
# +---+--------------------+

but operation operations like this is not very useful feature engineering step in general. Remember that lack of information is information as well.

Pyspark- size function on elements of vector from count vectorizer?

Tags:

python

apache-spark

apache-spark-sql

pyspark

countvectorizer

JB5

1 Answers

Alper t. Turker

Recent Activity

Donate For Us

Pyspark- size function on elements of vector from count vectorizer?

Tags:

python

apache-spark

apache-spark-sql

pyspark

countvectorizer

JB5

1 Answers

Alper t. Turker

Related questions

Recent Activity

Donate For Us