I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more). Task: <ol> <li>Create wide DF via groupBy and pivot. </li> <li>Transform columns to vector and processing in to KMeans from pyspark.ml.</li> </ol> So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans. It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing in pandas (pivot df, and iterate 7 clusters) takes less one minute. Obviously I understand overhead and performance decreasing for standalone mode and caching and so on but it's really discourages me. Could somebody explain how I can avoid this overhead? How peoples work with wide DF instead of using vectorassembler and getting performance decreasing? More formal question (for sof rules) sound like - How can I speed up this code? <pre class="prettyprint"><code>%%time tmp = (df_states.select('ObjectPath', 'User', 'PropertyFlagValue') .groupBy('User') .pivot('ObjectPath') .agg({'PropertyFlagValue':'max'}) .fillna(0)) ignore = ['User'] assembler = VectorAssembler( inputCols=[x for x in tmp.columns if x not in ignore], outputCol='features') Wall time: 36.7 s print(tmp.count(), len(tmp.columns)) 552, 9378 %%time transformed = assembler.transform(tmp).select('User', 'features').cache() Wall time: 10min 45s %%time lst_levels = [] for num in range(3, 14): kmeans = KMeans(k=num, maxIter=50) model = kmeans.fit(transformed) lst_levels.append(model.computeCost(transformed)) rs = [i-j for i,j in list(zip(lst_levels, lst_levels[1:]))] for i, j in zip(rs, rs[1:]): if i - j < j: print(rs.index(i)) kmeans = KMeans(k=rs.index(i) + 3, maxIter=50) model = kmeans.fit(transformed) break Wall time: 1min 32s </code></pre> Config: <pre class="prettyprint"><code>.config("spark.sql.pivotMaxValues", "100000") \ .config("spark.sql.autoBroadcastJoinThreshold", "-1") \ .config("spark.sql.shuffle.partitions", "4") \ .config("spark.sql.inMemoryColumnarStorage.batchSize", "1000") \ </code></pre>

Actually solution was found in map for rdd. <ol> <li>First of all we going to create map of values. </li> <li>Also extract all distinct names.</li> <li>Penultimate step we are searching each value of rows' map in dict of names and return value or 0 if nothing was found.</li> <li>Vector assembler on results.</li> </ol> Advantages: <ol> <li>You haven't to create wide dataframe with a lot of columns count and hence avoid overhead. (Speed was risen up from 11 minutes to 1.)</li> <li>You still work on cluster and execute you code in paradigm of spark.</li> </ol> Example of code: scala implementation.

Performance decrease for huge amount of columns. Pyspark

Tags:

python

pandas

machine-learning

apache-spark

pyspark

I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more).
Task:

Create wide DF via groupBy and pivot.
Transform columns to vector and processing in to KMeans from pyspark.ml.

So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans.
It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing in pandas (pivot df, and iterate 7 clusters) takes less one minute.
Obviously I understand overhead and performance decreasing for standalone mode and caching and so on but it's really discourages me.
Could somebody explain how I can avoid this overhead?
How peoples work with wide DF instead of using vectorassembler and getting performance decreasing?
More formal question (for sof rules) sound like - How can I speed up this code?

%%time
tmp = (df_states.select('ObjectPath', 'User', 'PropertyFlagValue')
       .groupBy('User')
       .pivot('ObjectPath')
       .agg({'PropertyFlagValue':'max'})
       .fillna(0))
ignore = ['User']
assembler = VectorAssembler(
    inputCols=[x for x in tmp.columns if x not in ignore],
    outputCol='features')
Wall time: 36.7 s

print(tmp.count(), len(tmp.columns))
552, 9378

%%time
transformed = assembler.transform(tmp).select('User', 'features').cache()
Wall time: 10min 45s

%%time
lst_levels = []
for num in range(3, 14):
    kmeans = KMeans(k=num, maxIter=50)
    model = kmeans.fit(transformed)
    lst_levels.append(model.computeCost(transformed))
rs = [i-j for i,j in list(zip(lst_levels, lst_levels[1:]))]
for i, j in zip(rs, rs[1:]):
    if i - j < j:
        print(rs.index(i))
        kmeans = KMeans(k=rs.index(i) + 3, maxIter=50)
        model = kmeans.fit(transformed)
        break
 Wall time: 1min 32s

Config:

.config("spark.sql.pivotMaxValues", "100000") \
.config("spark.sql.autoBroadcastJoinThreshold", "-1") \
.config("spark.sql.shuffle.partitions", "4") \
.config("spark.sql.inMemoryColumnarStorage.batchSize", "1000") \

559

asked Feb 20 '18 08:02

Anton Alekseev

1 Answers

Actually solution was found in map for rdd.

First of all we going to create map of values.
Also extract all distinct names.
Penultimate step we are searching each value of rows' map in dict of names and return value or 0 if nothing was found.
Vector assembler on results.

Advantages:

You haven't to create wide dataframe with a lot of columns count and hence avoid overhead. (Speed was risen up from 11 minutes to 1.)
You still work on cluster and execute you code in paradigm of spark.

Example of code: scala implementation.

answered Oct 24 '22 16:10

Anton Alekseev

Related questions
                            
                                Why not use .values rather than .iat for 6x performance improvement?
                            
                                Python: how to mock a kafka topic for unit tests?
                            
                                Python's shutil.make_archive() creates dot directory on Windows
                            
                                Current-rule's name in Snakemake
                            
                                does not appear to have any patterns in it. If you see valid patterns in the file then the issue is probably caused by a circular import
                            
                                Django CAN find my static files, Pycharm CANNOT resolve them
                            
                                Can type hint in python 3 be used to generate docstring?
                            
                                Pandas: change order of crosstab result
                            
                                How do you wrap lines in a Jupyter notebook?
                            
                                Python, json dump a list with no newlines
                            
                                Face clustering using Chinese Whispers algorithm
                            
                                statsmodels installation: No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
                            
                                How to correctly use cv2.imwrite to save an image in openCV with cv2.selectROI
                            
                                Getting 403 error when accessing Google My Business API through Service Account
                            
                                ImportError: Package installed from Git using pip not found by Python
                            
                                Authenticating and authorizing users securely in a Python PyQt desktop application
                            
                                Graphical user interface with TK - button position and actions
                            
                                What is the best way to save tensor value to file as binary format?
                            
                                Alembic migrate with existing SQLAlchemy engine
                            
                                What is the most pythonic way to have inverse enumerate of a list? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Performance decrease for huge amount of columns. Pyspark

Tags:

python

pandas

machine-learning

apache-spark

pyspark

Anton Alekseev

People also ask

1 Answers

Anton Alekseev

Recent Activity

Donate For Us