Any way to access methods from individual stages in PySpark PipelineModel?

Tags:

I've created a PipelineModel for doing LDA in Spark 2.0 (via PySpark API):

def create_lda_pipeline(minTokenLength=1, minDF=1, minTF=1, numTopics=10, seed=42, pattern='[\W]+'):
    """
    Create a pipeline for running an LDA model on a corpus. This function does not need data and will not actually do
    any fitting until invoked by the caller.
    Args:
        minTokenLength:
        minDF: minimum number of documents word is present in corpus
        minTF: minimum number of times word is found in a document
        numTopics:
        seed:
        pattern: regular expression to split words

    Returns:
        pipeline: class pyspark.ml.PipelineModel
    """
    reTokenizer = RegexTokenizer(inputCol="text", outputCol="tokens", pattern=pattern, minTokenLength=minTokenLength)
    cntVec = CountVectorizer(inputCol=reTokenizer.getOutputCol(), outputCol="vectors", minDF=minDF, minTF=minTF)
    lda = LDA(k=numTopics, seed=seed, optimizer="em", featuresCol=cntVec.getOutputCol())
    pipeline = Pipeline(stages=[reTokenizer, cntVec, lda])
    return pipeline

I want to calculate the perplexity on a dataset using the trained model with the LDAModel.logPerplexity() method, so I tried running the following:

try:
    training = get_20_newsgroups_data(test_or_train='test')
    pipeline = create_lda_pipeline(numTopics=20, minDF=3, minTokenLength=5)
    model = pipeline.fit(training)  # train model on training data
    testing = get_20_newsgroups_data(test_or_train='test')
    perplexity = model.logPerplexity(testing)
    pprint(perplexity)

This just results in the following AttributeError:

'PipelineModel' object has no attribute 'logPerplexity'

I understand why this error happens, since the logPerplexity method belongs to LDAModel, not PipelineModel, but I am wondering if there is a way to access the method from that stage.

945

asked Jul 29 '16 17:07

Evan Zamir

1 Answers

All transformers in the pipeline are stored in stages property. Extract stages, take the last one, and you're ready to go:

model.stages[-1].logPerplexity(testing)

182

answered Sep 27 '22 21:09

zero323

Related questions
                            
                                Why does 'the' survive after .remove?
                            
                                Dynamically add subplots in matplotlib with more than one column
                            
                                Python: How to calculate the length of a range without creating the range?
                            
                                how to kill a process group using Python subprocess
                            
                                Unable to install mysqlclient in python3 virtualenv
                            
                                Parse an XML string in Python
                            
                                Get first element of sublist as dictionary key in python
                            
                                python build from source: cannot build optional module sqlite3
                            
                                How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion?
                            
                                pandas - return column of exponential values
                            
                                Create a temporary table in MySQL using Pandas
                            
                                How to change the font size and color of markdown cell in Ipython (py 2.7) notebook
                            
                                Python/numpy: Most efficient way to sum n elements of an array, so that each output element is the sum of the previous n input elements?
                            
                                Weird for loop statement [duplicate]
                            
                                Dot Product in Python without NumPy
                            
                                pandas pivot table - changing order of non-index columns
                            
                                What is the equivalent of from django.views.generic.simple import direct_to_template in django 1.9
                            
                                How to remove the Windows PATH from a Sublime Text 3 Python build error?
                            
                                How to get IAM Policy Document via boto
                            
                                While debugging, how to print all variables (which is in list format) who are trainable in Tensorflow?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Any way to access methods from individual stages in PySpark PipelineModel?

Tags:

python

apache-spark

pyspark

apache-spark-ml

apache-spark-mllib

Evan Zamir

People also ask

1 Answers

zero323

Recent Activity

Donate For Us