I was trying to print total elements in each partitions in a DataFrame using spark 2.2
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
def count_elements(splitIndex, iterator):
n = sum(1 for _ in iterator)
yield (splitIndex, n)
spark = SparkSession.builder.appName("tmp").getOrCreate()
num_parts = 3
df = spark.read.json("/tmp/tmp/gon_s.json").repartition(num_parts)
print("df has partitions."+ str(df.rdd.getNumPartitions()))
print("Elements across partitions is:" + str(df.rdd.mapPartitionsWithIndex(lambda ind, x: count_elements(ind, x)).take(3)))
The Code above kept failing with following error
n = sum(1 for _ in iterator) File "/home/dev/wk/pyenv/py3/lib/python3.5/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 40, in _ jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col) AttributeError: 'NoneType' object has no attribute '_jvm'
after removing the import below
from pyspark.sql.functions import *
Code works fine
skewed_large_df has partitions.3
The distribution of elements across partitions is:[(0, 1), (1, 2), (2, 2)]
What is it causing this error and how can I fix it?
Solution of AttributeError: 'DataFrame' object has no attribute 'map' in PySpark. PySpark DataFrame doesn't have a map() transformation instead it's present in RDD hence you are getting the error AttributeError: 'DataFrame' object has no attribute 'map' So first, Convert PySpark DataFrame to RDD using df.
Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and Pandas to work with the data, which allows vectorized operations. A Pandas UDF is defined using the pandas_udf as a decorator or to wrap the function, and no additional configuration is required.
In PySpark, the withColumn() function is widely used and defined as the transformation function of the DataFrame which is further used to change the value, convert the datatype of an existing column, create the new column etc.
PySpark is a data analytics tool created by Apache Spark Community for using Python along with Spark. It allows us to work with RDD (Resilient Distributed Dataset) and DataFrames in Python.
This is a great example of why you shouldn't use import *
.
The line
from pyspark.sql.functions import *
will bring in all the functions in the pyspark.sql.functions
module into your namespace, include some that will shadow your builtins.
The specific issue is in the count_elements
function on the line:
n = sum(1 for _ in iterator)
# ^^^ - this is now pyspark.sql.functions.sum
You intended to call __builtin__.sum
, but the import *
shadowed the builtin.
Instead, do one of the following:
import pyspark.sql.functions as f
Or
from pyspark.sql.functions import sum as sum_
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With