pyspark: ValueError: Some of types cannot be determined after inferring

Tags:

I have a pandas data frame my_df, and my_df.dtypes gives us:

ts              int64
fieldA         object
fieldB         object
fieldC         object
fieldD         object
fieldE         object
dtype: object

Then I am trying to convert the pandas data frame my_df to a spark data frame by doing below:

spark_my_df = sc.createDataFrame(my_df)

However, I got the following errors:

ValueErrorTraceback (most recent call last)
<ipython-input-29-d4c9bb41bb1e> in <module>()
----> 1 spark_my_df = sc.createDataFrame(my_df)
      2 spark_my_df.take(20)

/usr/local/spark-latest/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio)
    520             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    521         else:
--> 522             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    523         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    524         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/usr/local/spark-latest/python/pyspark/sql/session.py in _createFromLocal(self, data, schema)
    384 
    385         if schema is None or isinstance(schema, (list, tuple)):
--> 386             struct = self._inferSchemaFromList(data)
    387             if isinstance(schema, (list, tuple)):
    388                 for i, name in enumerate(schema):

/usr/local/spark-latest/python/pyspark/sql/session.py in _inferSchemaFromList(self, data)
    318         schema = reduce(_merge_type, map(_infer_schema, data))
    319         if _has_nulltype(schema):
--> 320             raise ValueError("Some of types cannot be determined after inferring")
    321         return schema
    322 

ValueError: Some of types cannot be determined after inferring

Does anyone know what the above error mean? Thanks!

611

asked Nov 09 '16 23:11

Edamame

3 Answers

In order to infer the field type, PySpark looks at the non-none records in each field. If a field only has None records, PySpark can not infer the type and will raise that error.

Manually defining a schema will resolve the issue

>>> from pyspark.sql.types import StructType, StructField, StringType >>> schema = StructType([StructField("foo", StringType(), True)]) >>> df = spark.createDataFrame([[None]], schema=schema) >>> df.show() +----+ |foo | +----+ |null| +----+

129

answered Sep 16 '22 13:09

Gregology

And to fix this problem, you could provide your own defined schema.

For example:

To reproduce the error:

>>> df = spark.createDataFrame([[None, None]], ["name", "score"])

To fix the error:

>>> from pyspark.sql.types import StructType, StructField, StringType, DoubleType >>> schema = StructType([StructField("name", StringType(), True), StructField("score", DoubleType(), True)]) >>> df = spark.createDataFrame([[None, None]], schema=schema) >>> df.show() +----+-----+ |name|score| +----+-----+ |null| null| +----+-----+

answered Sep 16 '22 13:09

Akavall

If you are using the RDD[Row].toDF() monkey-patched method you can increase the sample ratio to check more than 100 records when inferring types:

# Set sampleRatio smaller as the data size increases
my_df = my_rdd.toDF(sampleRatio=0.01)
my_df.show()

Assuming there are non-null rows in all fields in your RDD, it will be more likely to find them when you increase the sampleRatio towards 1.0.

answered Sep 18 '22 13:09

rjurney

Related questions
                            
                                What is difference frozen_inference_graph.pb and saved_model.pb?
                            
                                How to download any(!) webpage with correct charset in python?
                            
                                PyQt and MVC-pattern
                            
                                How to serve static files from a different directory than the static path?
                            
                                Image segmentation based on edge pixel map [closed]
                            
                                How to read a CSV without the first column
                            
                                tkinter.TclError: couldn't connect to display "localhost:18.0"
                            
                                Remove the x-axis ticks while keeping the grids (matplotlib) [duplicate]
                            
                                How can I access each element of a pair in a pair list?
                            
                                pandas scatter plotting datetime
                            
                                Matplotlib: plotting transparent histogram with non transparent edge
                            
                                TensorFlow - Read all examples from a TFRecords at once?
                            
                                Edit the values in a list of dictionaries?
                            
                                How do I create a web interface to a simple python script?
                            
                                How to install OpenSSL for Python
                            
                                Update a PyPI package
                            
                                SQLAlchemy - can you add custom methods to the query object?
                            
                                Python deep merge dictionary data [duplicate]
                            
                                Import error cannot import name execute_manager in windows environment
                            
                                tkinter gui layout using frames and grid

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pyspark: ValueError: Some of types cannot be determined after inferring

Tags:

python

pandas

python-2.7

pyspark

spark-dataframe