Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AttributeError: 'DataFrame' object has no attribute '_data'

Azure Databricks execution error while parallelizing on pandas dataframe. The code is able to create RDD but breaks at the time of performing .collect()

setup:

import pandas as pd
# initialize list of lists 
data = [['tom', 10], ['nick', 15], ['juli', 14]] 
  
# Create the pandas DataFrame 
my_df = pd.DataFrame(data, columns = ['Name', 'Age']) 

def testfn(i):
  return my_df.iloc[i]
test_var=sc.parallelize([0,1,2],50).map(testfn).collect()
print (test_var)

Error:

Py4JJavaError                             Traceback (most recent call last)
<command-2941072546245585> in <module>
      1 def testfn(i):
      2   return my_df.iloc[i]
----> 3 test_var=sc.parallelize([0,1,2],50).map(testfn).collect()
      4 print (test_var)

/databricks/spark/python/pyspark/rdd.py in collect(self)
    901         # Default path used in OSS Spark / for non-credential passthrough clusters:
    902         with SCCallSiteSync(self.context) as css:
--> 903             sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    904         return list(_load_from_socket(sock_info, self._jrdd_deserializer))
    905 

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    125     def deco(*a, **kw):
    126         try:
--> 127             return f(*a, **kw)
    128         except py4j.protocol.Py4JJavaError as e:
    129             converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 16 in stage 3845.0 failed 4 times, most recent failure: Lost task 16.3 in stage 3845.0 : org.apache.spark.api.python.PythonException: 'AttributeError: 'DataFrame' object has no attribute '_data'', from <command-2941072546245585>, line 2. Full traceback below:
Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/worker.py", line 654, in main
    process()
  File "/databricks/spark/python/pyspark/worker.py", line 646, in process
    serializer.dump_stream(out_iter, outfile)
  File "/databricks/spark/python/pyspark/serializers.py", line 279, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/databricks/spark/python/pyspark/util.py", line 109, in wrapper
    return f(*args, **kwargs)
  File "<command-2941072546245585>", line 2, in testfn
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/indexing.py", line 1767, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/indexing.py", line 2137, in _getitem_axis
    self._validate_integer(key, axis)
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/indexing.py", line 2060, in _validate_integer
    len_axis = len(self.obj._get_axis(axis))
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 424, in _get_axis
    return getattr(self, name)
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
    return object.__getattribute__(self, name)
  File "pandas/_libs/properties.pyx", line 63, in pandas._libs.properties.AxisProperty.__get__
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute '_data'

Version details:

spark: '3.0.0' python:3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0]

like image 551
hashini paramasivam Avatar asked Dec 28 '20 07:12

hashini paramasivam


People also ask

How do you solve a DataFrame object has no attribute?

If you try to call concat() on a DataFrame object, you will raise the AttributeError: 'DataFrame' object has no attribute 'concat'. You have to pass the columns to concatenate to pandas. concat() and define the axis to concatenate along.

Which is not an attribute of DataFrame object?

the reason of " 'DataFrame' object has no attribute 'Number'/'Close'/or any col name " is because you are looking at the col name and it seems to be "Number" but in reality it is " Number" or "Number " , that extra space is because in the excel sheet col name is written in that format.

How do you convert a DataFrame to a list?

To convert Pandas DataFrame to List in Python, use the DataFrame. values(). tolist() function.

What is a DataFrame object?

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.


Video Answer


2 Answers

I've seen such error when driver & executors had different version of Pandas installed. In my case it was driver with Pandas 1.1.0 (via databricks-connect), and executors were on Databricks Runtime 7.3 with Pandas 1.0.1. Pandas 1.1.0 has a big change in internals, so the code sent by the driver to executors is broken. You need to check that your executors and driver have the same version of the Pandas (you can find version of the Pandas used by Databricks Runtimes in the release notes). You can use the following script to compare version of the Python libraries on executors & driver.

like image 112
Alex Ott Avatar answered Oct 24 '22 14:10

Alex Ott


I come across the same question。

i think it is due to pandas version difference。

i solved this bug by updating my pandas version from 1.0.1 to 1.0.5

like image 42
xie bruce Avatar answered Oct 24 '22 16:10

xie bruce