I've created a PipelineModel
for doing LDA in Spark 2.0 (via PySpark API):
def create_lda_pipeline(minTokenLength=1, minDF=1, minTF=1, numTopics=10, seed=42, pattern='[\W]+'):
"""
Create a pipeline for running an LDA model on a corpus. This function does not need data and will not actually do
any fitting until invoked by the caller.
Args:
minTokenLength:
minDF: minimum number of documents word is present in corpus
minTF: minimum number of times word is found in a document
numTopics:
seed:
pattern: regular expression to split words
Returns:
pipeline: class pyspark.ml.PipelineModel
"""
reTokenizer = RegexTokenizer(inputCol="text", outputCol="tokens", pattern=pattern, minTokenLength=minTokenLength)
cntVec = CountVectorizer(inputCol=reTokenizer.getOutputCol(), outputCol="vectors", minDF=minDF, minTF=minTF)
lda = LDA(k=numTopics, seed=seed, optimizer="em", featuresCol=cntVec.getOutputCol())
pipeline = Pipeline(stages=[reTokenizer, cntVec, lda])
return pipeline
I want to calculate the perplexity on a dataset using the trained model with the LDAModel.logPerplexity()
method, so I tried running the following:
try:
training = get_20_newsgroups_data(test_or_train='test')
pipeline = create_lda_pipeline(numTopics=20, minDF=3, minTokenLength=5)
model = pipeline.fit(training) # train model on training data
testing = get_20_newsgroups_data(test_or_train='test')
perplexity = model.logPerplexity(testing)
pprint(perplexity)
This just results in the following AttributeError
:
'PipelineModel' object has no attribute 'logPerplexity'
I understand why this error happens, since the logPerplexity
method belongs to LDAModel
, not PipelineModel
, but I am wondering if there is a way to access the method from that stage.
To better understand PySpark’s API and data structures, recall the Hello World program mentioned previously: The entry-point of any PySpark program is a SparkContext object. This object allows you to connect to a Spark cluster and create RDDs.
Here comes the PySpark, a python wrapper of spark which provides the functionality of spark in python with syntax very much similar to Pandas. In this blog, I will cover the steps of building a Machine Learning model using PySpark. For this project, we are using events data of a music streaming company named Sparkify provided by Udacity.
You can use the spark-submit command installed along with Spark to submit PySpark code to a cluster using the command line. This command takes a PySpark or Scala program and executes it on a cluster. This is likely how you’ll execute your real Big Data processing jobs.
You can see the running Python and JVM processes by using ps aux : So what PySpark does, is it allows you to use a Python programme to send commands to a JVM programme named Spark! Confused? Well the key point here is that Spark is written in Java and Scala, but not in Python.
All transformers in the pipeline are stored in stages
property. Extract stages
, take the last one, and you're ready to go:
model.stages[-1].logPerplexity(testing)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With