Background: I have URL data aggregated into a string array. Of this form. [xyz.com,abc.com,efg.com]
1)I filter based on Url count in a row with
vectored_file(size('agg_url_host')>3)
2)I filter out urls that do not occur frequently in the next step with
CountVectorizer(inputCol="agg_url_host",outputCol="vectors",minDF=10000)
The problem is some rows have enough to pass my size function in step 1, but after we remove less frequent urls do not. So I end up with rows with the vectors column reading: (68,[],[]) (68,[4,56],[1.0,1.0]) even if I only want rows with counts higher than 3 for modeling.
So my question is can I run a size function on a vector object like the output of countVectorizer? Or is their a similar function that will remove low counts?
Perhaps there is a way to create a new string array column from my original 'agg_url' column with the less frequent removed? Then I can perform CountVectorizer on that.
Any help appreciated.
Size of the output vector is always fixed, so the only thing you can do, is counting non-zero elements:
from pyspark.sql.functions import udf
@udf("long")
def num_nonzeros(v):
return v.numNonzeros()
df = spark.createDataFrame([
(1, SparseVector(10, [1, 2, 4, 6], [0.1, 0.3, 0.1, 0.1])),
(2, SparseVector(10, [], []))
], ("id", "vectors"))
df.where(num_nonzeros("vectors") > 3).show()
# +---+--------------------+
# | id| vectors|
# +---+--------------------+
# | 1|(10,[1,2,4,6],[0....|
# +---+--------------------+
but operation operations like this is not very useful feature engineering step in general. Remember that lack of information is information as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With