In Spark ML, why is fitting a StringIndexer on a column with million of disctinct values yielding an OOM error?

Tags:

I am trying to use Spark's StringIndexer feature transformer on a column with about 15.000.000 unique string values. Regardless of how many resources I throw at it, Spark always dies on me with some sort of Out Of Memory exception.

from pyspark.ml.feature import StringIndexer

data = spark.read.parquet("s3://example/data-raw").select("user", "count")

user_indexer = StringIndexer(inputCol="user", outputCol="user_idx")

indexer_model = user_indexer.fit(data) # This never finishes

indexer_model \
    .transform(data) \
    .write.parquet("s3://example/data-indexed")

An error file is produced on the driver, with the begining of it looking like this:

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 268435456 bytes for committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2657)

Now, if I try to manually index the values and store them in a dataframe, everything works like charm, all on couple of Amazon c3.2xlarge workers.

from pyspark.sql.functions import row_number
from pyspark.sql.window import Window

data = spark.read.parquet("s3://example/data-raw").select("user", "count")

uid_map = data \
    .select("user") \
    .distinct() \
    .select("user", row_number().over(Window.orderBy("user")).alias("user_idx"))

data.join(uid_map, "user", "inner").write.parquet("s3://example/data-indexed")

I would really like to use the formal transformers provided by Spark, but at this time this doesn't seem possible. Any ideas of how I can make this work?

549

asked Aug 24 '18 08:08

Interfector

1 Answers

The reason why you get an OOM error is that behind the curtains, Spark's StringIndexer calls countByValue on the "user" column to get all the distinct values.

With 15M distinct values, you are actually creating a huge Map on the driver and it runs out of memory... A straightforward workaround would be to increase the memory of the driver. If you use spark-submit you can use --driver-memory 16g. You can also use the spark.driver.memory property in the config file.

Yet, the problem will simply occur again as the number of distinct values increases. Unfortunately, there is not much you can do with Spark's transformers and here is why. Actually, after being fit to the data, the transformers are meant to be serialized for further use. Therefore they are not designed to be this big (a map with 15M strings would at the very least weigh 100MB). I think that you need to reconsider the use of a StringIndexer for that many categories. Using a Hashing trick would perhaps be a better fit here.

Finally, let me comment on your workaround. With your window, you actually put all your 15M categories on one partition and thus on one executor. It won't scale if that number increases. Also, using a non partitioned window is generally a bad idea since it prevents parallel computations (in addition to putting everything on the same partition which can cause an OOM error). I would compute your uid_map like this:

# if you don't need consecutive indices
uid_map = data\
    .select("user")\
    .distinct()\
    .withColumn("user_idx", monotonically_increasing_id())

# if you do, you need to use RDDs
uid_rdd = data\
    .select("user")\
    .distinct()\
    .rdd.map(lambda x : x["user"])\
    .zipWithIndex()
uid_map = spark.createDataFrame(uid_rdd, ["user", "user_idx"])

152

answered Oct 01 '22 04:10

Oli

Related questions
                            
                                too many map keys causing out of memory exception in spark
                            
                                How to improve my recommendation result? I am using spark ALS implicit
                            
                                How to serialize a pyspark Pipeline object?
                            
                                Can I create an RDD from a kafka topic if I do not know the until offset?
                            
                                How to Set spark.sql.parquet.output.committer.class in pyspark
                            
                                Performance of loading parquet files into case classes in Spark
                            
                                PySpark how to read file having string with multiple encoding
                            
                                Why does SparkSQL require two literal escape backslashes in the SQL query?
                            
                                Timestamp roundtrip from Spark Python to Pandas and back
                            
                                Load a file from SFTP server into spark RDD
                            
                                Structured Streaming - Foreach Sink
                            
                                Read data from remote hive on spark over JDBC returns empty result
                            
                                Why can't I display prediction column of Spark MultilayerPerceptronClassifier?
                            
                                How to add hbase-site.xml config file using spark-shell
                            
                                Re-run Spark jobs on Failure or Abort
                            
                                How do I use Spark ORC indexes?
                            
                                Get a registered Spark Accumulator by name
                            
                                Pyspark: spark-submit not working like CLI
                            
                                PySpark SparkSession Builder with Kubernetes Master
                            
                                Outer join two Datasets (not DataFrames) in Spark Structured Streaming

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In Spark ML, why is fitting a StringIndexer on a column with million of disctinct values yielding an OOM error?

Tags:

apache-spark

pyspark

apache-spark-ml

Interfector

People also ask

1 Answers

Oli

Recent Activity

Donate For Us