I'm new to PySpark. I was playing around with the tfidf. Just wanted to check if they are giving same results. But they are not same. Here's what I did.
# create the PySpark dataframe
sentenceData = sqlContext.createDataFrame((
(0.0, "Hi I heard about Spark"),
(0.0, "I wish Java could use case classes"),
(1.0, "Logistic regression models are neat")
)).toDF("label", "sentence")
# tokenize
tokenizer = Tokenizer().setInputCol("sentence").setOutputCol("words")
wordsData = tokenizer.transform(sentenceData)
# vectorize
vectorizer = CountVectorizer(inputCol='words', outputCol='vectorizer').fit(wordsData)
wordsData = vectorizer.transform(wordsData)
# calculate scores
idf = IDF(inputCol="vectorizer", outputCol="tfidf_features")
idf_model = idf.fit(wordsData)
wordsData = idf_model.transform(wordsData)
# dense the current response variable
def to_dense(in_vec):
return DenseVector(in_vec.toArray())
to_dense_udf = udf(lambda x: to_dense(x), VectorUDT())
# create dense vector
wordsData = wordsData.withColumn("tfidf_features_dense", to_dense_udf('tfidf_features'))
I converted the PySpark df to pandas
wordsData_pandas = wordsData.toPandas()
and, then just calculating using sklearn's tfidf as following
def dummy_fun(doc):
return doc
# create sklearn tfidf
tfidf = TfidfVectorizer(
analyzer='word',
tokenizer=dummy_fun,
preprocessor=dummy_fun,
token_pattern=None)
# transform and get idf scores
feature_matrix = tfidf.fit_transform(wordsData_pandas.words)
# create sklearn dtm matrix
sklearn_tfifdf = pd.DataFrame(feature_matrix.toarray(), columns=tfidf.get_feature_names())
# create PySpark dtm matrix
spark_tfidf = pd.DataFrame([np.array(i) for i in wordsData_pandas.tfidf_features_dense], columns=vectorizer.vocabulary)
But unfortunately, I'm getting this for PySpark
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>i</th> <th>are</th> <th>logistic</th> <th>case</th> <th>spark</th> <th>hi</th> <th>about</th> <th>neat</th> <th>could</th> <th>regression</th> <th>wish</th> <th>use</th> <th>heard</th> <th>classes</th> <th>java</th> <th>models</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>0.287682</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.693147</td> <td>0.693147</td> <td>0.693147</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.693147</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> </tr> <tr> <th>1</th> <td>0.287682</td> <td>0.000000</td> <td>0.000000</td> <td>0.693147</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.693147</td> <td>0.000000</td> <td>0.693147</td> <td>0.693147</td> <td>0.000000</td> <td>0.693147</td> <td>0.693147</td> <td>0.000000</td> </tr> <tr> <th>2</th> <td>0.000000</td> <td>0.693147</td> <td>0.693147</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.693147</td> <td>0.000000</td> <td>0.693147</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.693147</td> </tr> </tbody></table>
and this for sklearn,
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>i</th> <th>are</th> <th>logistic</th> <th>case</th> <th>spark</th> <th>hi</th> <th>about</th> <th>neat</th> <th>could</th> <th>regression</th> <th>wish</th> <th>use</th> <th>heard</th> <th>classes</th> <th>java</th> <th>models</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>0.355432</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.467351</td> <td>0.467351</td> <td>0.467351</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.467351</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> </tr> <tr> <th>1</th> <td>0.296520</td> <td>0.000000</td> <td>0.000000</td> <td>0.389888</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.389888</td> <td>0.000000</td> <td>0.389888</td> <td>0.389888</td> <td>0.000000</td> <td>0.389888</td> <td>0.389888</td> <td>0.000000</td> </tr> <tr> <th>2</th> <td>0.000000</td> <td>0.447214</td> <td>0.447214</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.447214</td> <td>0.000000</td> <td>0.447214</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.000000</td> <td>0.447214</td> </tr> </tbody></table>
I did try out the use_idf
, smooth_idf
paramters. But none seem to make both same. What am I missing? Any help is appreciated. Thanks in advance.
Both Python and Pyspark implementation of tfidf scores are the same. Refer the same Sklearn document but on following line,
The key difference between them is that Sklearn uses l2
norm by default, which is not the case with Pyspark. If we set the norm to None, we will get the same result in sklearn as well.
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd
corpus = ["I heard about Spark","I wish Java could use case classes","Logistic regression models are neat"]
corpus = [sent.lower().split() for sent in corpus]
def dummy_fun(doc):
return doc
tfidfVectorizer=TfidfVectorizer(norm=None,analyzer='word',
tokenizer=dummy_fun,preprocessor=dummy_fun,token_pattern=None)
tf=tfidfVectorizer.fit_transform(corpus)
tf_df=pd.DataFrame(tf.toarray(),columns= tfidfVectorizer.get_feature_names())
tf_df
Refer to my answer here to understand how norm works with tf-idf vectorizer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With