AttributeError: 'DataFrame' object has no attribute '_data'

Tags:

Azure Databricks execution error while parallelizing on pandas dataframe. The code is able to create RDD but breaks at the time of performing .collect()

setup:

import pandas as pd
# initialize list of lists 
data = [['tom', 10], ['nick', 15], ['juli', 14]] 
  
# Create the pandas DataFrame 
my_df = pd.DataFrame(data, columns = ['Name', 'Age']) 

def testfn(i):
  return my_df.iloc[i]
test_var=sc.parallelize([0,1,2],50).map(testfn).collect()
print (test_var)

Error:

Py4JJavaError                             Traceback (most recent call last)
<command-2941072546245585> in <module>
      1 def testfn(i):
      2   return my_df.iloc[i]
----> 3 test_var=sc.parallelize([0,1,2],50).map(testfn).collect()
      4 print (test_var)

/databricks/spark/python/pyspark/rdd.py in collect(self)
    901         # Default path used in OSS Spark / for non-credential passthrough clusters:
    902         with SCCallSiteSync(self.context) as css:
--> 903             sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    904         return list(_load_from_socket(sock_info, self._jrdd_deserializer))
    905 

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    125     def deco(*a, **kw):
    126         try:
--> 127             return f(*a, **kw)
    128         except py4j.protocol.Py4JJavaError as e:
    129             converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 16 in stage 3845.0 failed 4 times, most recent failure: Lost task 16.3 in stage 3845.0 : org.apache.spark.api.python.PythonException: 'AttributeError: 'DataFrame' object has no attribute '_data'', from <command-2941072546245585>, line 2. Full traceback below:
Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/worker.py", line 654, in main
    process()
  File "/databricks/spark/python/pyspark/worker.py", line 646, in process
    serializer.dump_stream(out_iter, outfile)
  File "/databricks/spark/python/pyspark/serializers.py", line 279, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/databricks/spark/python/pyspark/util.py", line 109, in wrapper
    return f(*args, **kwargs)
  File "<command-2941072546245585>", line 2, in testfn
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/indexing.py", line 1767, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/indexing.py", line 2137, in _getitem_axis
    self._validate_integer(key, axis)
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/indexing.py", line 2060, in _validate_integer
    len_axis = len(self.obj._get_axis(axis))
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 424, in _get_axis
    return getattr(self, name)
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
    return object.__getattribute__(self, name)
  File "pandas/_libs/properties.pyx", line 63, in pandas._libs.properties.AxisProperty.__get__
  File "/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py", line 5270, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute '_data'

Version details:

spark: '3.0.0' python:3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0]

551

asked Dec 28 '20 07:12

hashini paramasivam

Video Answer

2 Answers

I've seen such error when driver & executors had different version of Pandas installed. In my case it was driver with Pandas 1.1.0 (via databricks-connect), and executors were on Databricks Runtime 7.3 with Pandas 1.0.1. Pandas 1.1.0 has a big change in internals, so the code sent by the driver to executors is broken. You need to check that your executors and driver have the same version of the Pandas (you can find version of the Pandas used by Databricks Runtimes in the release notes). You can use the following script to compare version of the Python libraries on executors & driver.

112

answered Oct 24 '22 14:10

Alex Ott

I come across the same question。

i think it is due to pandas version difference。

i solved this bug by updating my pandas version from 1.0.1 to 1.0.5

answered Oct 24 '22 16:10

xie bruce

Related questions
                            
                                How to programmatically check if Kafka Broker is up and running in Python
                            
                                Difference between latest only operator and catchup in Airflow
                            
                                How do I make parallel async HTTP requests using httpx (versus aiohttp) in Python?
                            
                                How to use python subprocess with bytes instead of files
                            
                                Crop a video in python
                            
                                Error: astype() got an unexpected keyword argument 'categories'
                            
                                Replace values in a dictionary of NumPy arrays and single numbers with sums
                            
                                Did I/O become slower since Python 2.7?
                            
                                ImportError: cannot import name 'quote'
                            
                                How to fix AttributeError in App Engine Flex when using grpc and cloud-datastore?
                            
                                UnboundLocalError: local variable 'e' referenced before assignment
                            
                                Python: How to remove default options on Typer CLI?
                            
                                Python pylint(raising-format-tuple) Exception arguments suggest string formatting might be intended
                            
                                Trying to add a colorbar to a Seaborn scatterplot
                            
                                ImportError: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory
                            
                                Get environment variables in a cloud function
                            
                                Why there is an unbound variable error warning by IDE in this simple python function
                            
                                __call__() missing 1 required positional argument: 'send' FastAPI on App Engine
                            
                                Why is str()+"" slower than ""+""
                            
                                How to split dataset to train, test and valid in Python? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AttributeError: 'DataFrame' object has no attribute '_data'

Tags:

python

apache-spark

pyspark

databricks

azure-databricks