Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Any way to access methods from individual stages in PySpark PipelineModel?

I've created a PipelineModel for doing LDA in Spark 2.0 (via PySpark API):

def create_lda_pipeline(minTokenLength=1, minDF=1, minTF=1, numTopics=10, seed=42, pattern='[\W]+'):
    """
    Create a pipeline for running an LDA model on a corpus. This function does not need data and will not actually do
    any fitting until invoked by the caller.
    Args:
        minTokenLength:
        minDF: minimum number of documents word is present in corpus
        minTF: minimum number of times word is found in a document
        numTopics:
        seed:
        pattern: regular expression to split words

    Returns:
        pipeline: class pyspark.ml.PipelineModel
    """
    reTokenizer = RegexTokenizer(inputCol="text", outputCol="tokens", pattern=pattern, minTokenLength=minTokenLength)
    cntVec = CountVectorizer(inputCol=reTokenizer.getOutputCol(), outputCol="vectors", minDF=minDF, minTF=minTF)
    lda = LDA(k=numTopics, seed=seed, optimizer="em", featuresCol=cntVec.getOutputCol())
    pipeline = Pipeline(stages=[reTokenizer, cntVec, lda])
    return pipeline

I want to calculate the perplexity on a dataset using the trained model with the LDAModel.logPerplexity() method, so I tried running the following:

try:
    training = get_20_newsgroups_data(test_or_train='test')
    pipeline = create_lda_pipeline(numTopics=20, minDF=3, minTokenLength=5)
    model = pipeline.fit(training)  # train model on training data
    testing = get_20_newsgroups_data(test_or_train='test')
    perplexity = model.logPerplexity(testing)
    pprint(perplexity)

This just results in the following AttributeError:

'PipelineModel' object has no attribute 'logPerplexity'

I understand why this error happens, since the logPerplexity method belongs to LDAModel, not PipelineModel, but I am wondering if there is a way to access the method from that stage.

like image 945
Evan Zamir Avatar asked Jul 29 '16 17:07

Evan Zamir


People also ask

How to understand pyspark’s API?

To better understand PySpark’s API and data structures, recall the Hello World program mentioned previously: The entry-point of any PySpark program is a SparkContext object. This object allows you to connect to a Spark cluster and create RDDs.

What is pyspark for machine learning?

Here comes the PySpark, a python wrapper of spark which provides the functionality of spark in python with syntax very much similar to Pandas. In this blog, I will cover the steps of building a Machine Learning model using PySpark. For this project, we are using events data of a music streaming company named Sparkify provided by Udacity.

How do I run a pyspark program on a cluster?

You can use the spark-submit command installed along with Spark to submit PySpark code to a cluster using the command line. This command takes a PySpark or Scala program and executes it on a cluster. This is likely how you’ll execute your real Big Data processing jobs.

How does pyspark work with JVM?

You can see the running Python and JVM processes by using ps aux : So what PySpark does, is it allows you to use a Python programme to send commands to a JVM programme named Spark! Confused? Well the key point here is that Spark is written in Java and Scala, but not in Python.


1 Answers

All transformers in the pipeline are stored in stages property. Extract stages, take the last one, and you're ready to go:

model.stages[-1].logPerplexity(testing)
like image 182
zero323 Avatar answered Sep 27 '22 21:09

zero323