maxCategories not working as expected in VectorIndexer when using RandomForestClassifier in pyspark.ml

Tags:

Background: I'm doing a simple binary classification, using RandomForestClassifier from pyspark.ml. Before feeding the data to training, I managed to use VectorIndexer to decide whether features would be numerical or categorical by providing the argument maxCategories.

Problem: Even if I have used the VectorIndexer with maxCategories setting to 30, I was still getting an error during training pipeline:

An error occurred while calling o15371.fit.
: java.lang.IllegalArgumentException: requirement failed: DecisionTree requires maxBins (= 32) to be at least as large as the number of values in each categorical feature, but categorical feature 0 has 10765 values. Considering remove this and other categorical features with a large number of values, or add more training examples.

My code is simple, col_idx is a column string list I generated which will be passed to stringindexer, col_all is a column string list which will be passed to stringindexer and onehotencoder, col_num are numeric column names.

from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler, IndexToString, VectorIndexer
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier

my_data.cache()

# stringindexers and encoders
stIndexers = [StringIndexer(inputCol = Col, outputCol = Col + 'Index').setHandleInvalid('keep') for Col in col_idx]
encoder = OneHotEncoderEstimator(inputCols = [Col + 'Index' for Col in col_all], outputCols = [Col + 'ClassVec' for Col in col_all]).setHandleInvalid('keep')

# vector assemblor
col_into_assembler = [cols + 'Index' for cols in col_idx] + [cols + 'ClassVec' for cols in col_all] + col_num
assembler = VectorAssembler(inputCols = col_into_assembler, outputCol = "features")

# featureIndexer, labelIndexer, rf classifier and labelConverter
featureIndexer = VectorIndexer(inputCol = "features", outputCol = "indexedFeatures", maxCategories = 30)
# columns smaller than maxCategories => categorical features, columns larger than maxCategories => numerical / continuous features, smaller value => less categorical features, larger value => more categorical features.
labelIndexer = StringIndexer(inputCol = "label", outputCol = "indexedLabel").fit(my_data)
rf = RandomForestClassifier(featuresCol = "indexedFeatures", labelCol = "indexedLabel")
labelConverter = IndexToString(inputCol = "prediction", outputCol = "predictedLabel", labels=labelIndexer.labels)

# chain all the estimators and transformers stages into a Pipeline estimator
rfPipeline = Pipeline(stages = stIndexers + [encoder, assembler, featureIndexer, labelIndexer, rf, labelConverter])

# split data, cache them
training, test = my_data.randomSplit([0.7, 0.3], seed = 100)
training.cache()
test.cache()

# fit the estimator with training dataset to get a compiled pipeline with transformers and fitted models.
ModelRF = rfPipeline.fit(training)

# make predictions
predictions = ModelRF.transform(test)
predictions.printSchema()
predictions.show(5)

So my question is: how come there's still a high levels categorical feature in my data even if I have set maxCategories to 30 in VectorIndexer. I can set maxBins in rf classifier to higher value but I'm just curious: why the VectorIndexer is not working as expected (well, as I expected): casting feature smaller than maxCategories to categorical feature, larger to numerical features.

633

asked May 22 '18 12:05

Yiming Wu

1 Answers

It looks like, that contrary to the documentation, which lists:

Preserve metadata in transform; if a feature's metadata is already present, do not recompute.

among TODO, metadata is already preserved.

from pyspark.sql.functions import col
from pyspark.ml import Pipeline
from pyspark.ml.feature import  *

df = spark.range(10)

stages = [StringIndexer(inputCol="id", outputCol="idx"), VectorAssembler(inputCols=["idx"], outputCol="features"), VectorIndexer(inputCol="features", outputCol="features_indexed", maxCategories=5)]
Pipeline(stages=stages).fit(df).transform(df).schema["features"].metadata
# {'ml_attr': {'attrs': {'nominal': [{'vals': ['8',
#       '4',
#       '9',
#       '5',
#       '6',
#       '1',
#       '0',
#       '2',
#       '7',
#       '3'],
#      'idx': 0,
#      'name': 'idx'}]},
#   'num_attrs': 1}}

Pipeline(stages=stages).fit(df).transform(df).schema["features_indexed"].metadata

# {'ml_attr': {'attrs': {'nominal': [{'ord': False,
#      'vals': ['0.0',
#       '1.0',
#       '2.0',
#       '3.0',
#       '4.0',
#       '5.0',
#       '6.0',
#       '7.0',
#       '8.0',
#       '9.0'],
#      'idx': 0,
#      'name': 'idx'}]},
#   'num_attrs': 1}}

Under normal circumstances it is a desired behavior. You shouldn't use indexed categorical features as continuous variables

But if still want to circumvent this behavior, you'll have to reset metadata, for example:

pipeline1 = Pipeline(stages=stages[:1])
pipeline2 = Pipeline(stages=stages[1:])

dft1 = pipeline1.fit(df).transform(df).withColumn("idx", col("idx").alias("idx", metadata={}))
dft2 = pipeline2.fit(dft1).transform(dft1)


dft2.schema["features_indexed"].metadata

# {'ml_attr': {'attrs': {'numeric': [{'idx': 0, 'name': 'idx'}]},
#   'num_attrs': 1}}

163

answered Sep 29 '22 05:09

Alper t. Turker

Related questions
                            
                                What is the most efficient way to do a sorted reduce in PySpark?
                            
                                Combining Spark Streaming + MLlib
                            
                                Read Kafka topic in a Spark batch job
                            
                                PySpark: retrieve mean and the count of values around the mean for groups within a dataframe
                            
                                Running Spark on Linux : $JAVA_HOME not set error
                            
                                Inspecting GraphX Graph Object
                            
                                GroupByKey with datasets in Spark 2.0 using Java
                            
                                Outlier detection algorithm spark mllib
                            
                                Hadoop Yarn: How to limit dynamic self allocation of resources with Spark?
                            
                                How to make Spark driver resilient to Master restarts?
                            
                                spark: SAXParseException while writing to parquet on s3
                            
                                How to use "cube" only for specific fields on Spark dataframe?
                            
                                Spark: graphx api OOM errors after unpersist useless RDDs
                            
                                How does back pressure property work in Spark Streaming?
                            
                                Spark Shell with Yarn - Error: Yarn application has already ended! It might have been killed or unable to launch application master
                            
                                How to split comma separated string and get n values in Spark Scala dataframe?
                            
                                How to connect with JMX remotely to Spark worker on Dataproc
                            
                                how to write spark custom data source based on FileFormat
                            
                                What causes "unknown resolver null" in Spark Kafka Connector?
                            
                                Is manually managing memory with .unpersist() a good idea?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

maxCategories not working as expected in VectorIndexer when using RandomForestClassifier in pyspark.ml

Tags:

machine-learning

apache-spark

pyspark

random-forest

Yiming Wu

People also ask

1 Answers

Alper t. Turker

Recent Activity

Donate For Us