I was trying to print total elements in each partitions in a DataFrame using spark 2.2 <pre class="prettyprint lang-python prettyprint-override"><code>from pyspark.sql.functions import * from pyspark.sql import SparkSession def count_elements(splitIndex, iterator): n = sum(1 for _ in iterator) yield (splitIndex, n) spark = SparkSession.builder.appName("tmp").getOrCreate() num_parts = 3 df = spark.read.json("/tmp/tmp/gon_s.json").repartition(num_parts) print("df has partitions."+ str(df.rdd.getNumPartitions())) print("Elements across partitions is:" + str(df.rdd.mapPartitionsWithIndex(lambda ind, x: count_elements(ind, x)).take(3))) </code></pre> The Code above kept failing with following error <blockquote> <pre class="prettyprint lang-python prettyprint-override"><code> n = sum(1 for _ in iterator) File "/home/dev/wk/pyenv/py3/lib/python3.5/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 40, in _ jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col) AttributeError: 'NoneType' object has no attribute '_jvm' </code></pre> </blockquote> after removing the import below <pre class="prettyprint lang-python prettyprint-override"><code>from pyspark.sql.functions import * </code></pre> Code works fine <pre class="prettyprint lang-python prettyprint-override"><code>skewed_large_df has partitions.3 The distribution of elements across partitions is:[(0, 1), (1, 2), (2, 2)] </code></pre> What is it causing this error and how can I fix it?

This is a great example of why you shouldn't use <code>import *</code>. The line <pre class="prettyprint lang-python prettyprint-override"><code>from pyspark.sql.functions import * </code></pre> will bring in all the functions in the <code>pyspark.sql.functions</code> module into your namespace, include some that will shadow your builtins. The specific issue is in the <code>count_elements</code> function on the line: <pre class="prettyprint lang-python prettyprint-override"><code>n = sum(1 for _ in iterator) # ^^^ - this is now pyspark.sql.functions.sum </code></pre> You intended to call <code>__builtin__.sum</code>, but the <code>import *</code> shadowed the builtin. Instead, do one of the following: <pre class="prettyprint lang-python prettyprint-override"><code>import pyspark.sql.functions as f </code></pre> Or <pre class="prettyprint lang-python prettyprint-override"><code>from pyspark.sql.functions import sum as sum_ </code></pre>

Pyspark 'NoneType' object has no attribute '

I was trying to print total elements in each partitions in a DataFrame using spark 2.2

from pyspark.sql.functions import *
from pyspark.sql import SparkSession

def count_elements(splitIndex, iterator):
    n = sum(1 for _ in iterator)
    yield (splitIndex, n)

spark = SparkSession.builder.appName("tmp").getOrCreate()
num_parts = 3
df = spark.read.json("/tmp/tmp/gon_s.json").repartition(num_parts)
print("df has partitions."+ str(df.rdd.getNumPartitions()))
print("Elements across partitions is:" + str(df.rdd.mapPartitionsWithIndex(lambda ind, x: count_elements(ind, x)).take(3)))

The Code above kept failing with following error

  n = sum(1 for _ in iterator)
  File "/home/dev/wk/pyenv/py3/lib/python3.5/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 40, in _
    jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col)
AttributeError: 'NoneType' object has no attribute '_jvm'

after removing the import below

from pyspark.sql.functions import *

Code works fine

skewed_large_df has partitions.3
The distribution of elements across partitions is:[(0, 1), (1, 2), (2, 2)]

What is it causing this error and how can I fix it?

What is attribute error in PySpark?

Solution of AttributeError: 'DataFrame' object has no attribute 'map' in PySpark. PySpark DataFrame doesn't have a map() transformation instead it's present in RDD hence you are getting the error AttributeError: 'DataFrame' object has no attribute 'map' So first, Convert PySpark DataFrame to RDD using df.

How do you use pandas UDF in PySpark?

Pandas UDFs are user defined functions that are executed by Spark using Arrow to transfer data and Pandas to work with the data, which allows vectorized operations. A Pandas UDF is defined using the pandas_udf as a decorator or to wrap the function, and no additional configuration is required.

What is withColumn in PySpark?

In PySpark, the withColumn() function is widely used and defined as the transformation function of the DataFrame which is further used to change the value, convert the datatype of an existing column, create the new column etc.

What is PySpark SQL functions?

PySpark is a data analytics tool created by Apache Spark Community for using Python along with Spark. It allows us to work with RDD (Resilient Distributed Dataset) and DataFrames in Python.

This is a great example of why you shouldn't use import *.

The line

from pyspark.sql.functions import *

will bring in all the functions in the pyspark.sql.functions module into your namespace, include some that will shadow your builtins.

The specific issue is in the count_elements function on the line:

n = sum(1 for _ in iterator)
#   ^^^ - this is now pyspark.sql.functions.sum

You intended to call __builtin__.sum, but the import * shadowed the builtin.

Instead, do one of the following:

import pyspark.sql.functions as f

Or

from pyspark.sql.functions import sum as sum_

Pyspark 'NoneType' object has no attribute '_jvm' error

Tags:

python

apache-spark

apache-spark-sql

pyspark

user400058

People also ask

1 Answers

pault

Recent Activity

Donate For Us