I have a pandas data frame my_df
, and my_df.dtypes
gives us:
ts int64
fieldA object
fieldB object
fieldC object
fieldD object
fieldE object
dtype: object
Then I am trying to convert the pandas data frame my_df
to a spark data frame by doing below:
spark_my_df = sc.createDataFrame(my_df)
However, I got the following errors:
ValueErrorTraceback (most recent call last)
<ipython-input-29-d4c9bb41bb1e> in <module>()
----> 1 spark_my_df = sc.createDataFrame(my_df)
2 spark_my_df.take(20)
/usr/local/spark-latest/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio)
520 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
521 else:
--> 522 rdd, schema = self._createFromLocal(map(prepare, data), schema)
523 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
524 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
/usr/local/spark-latest/python/pyspark/sql/session.py in _createFromLocal(self, data, schema)
384
385 if schema is None or isinstance(schema, (list, tuple)):
--> 386 struct = self._inferSchemaFromList(data)
387 if isinstance(schema, (list, tuple)):
388 for i, name in enumerate(schema):
/usr/local/spark-latest/python/pyspark/sql/session.py in _inferSchemaFromList(self, data)
318 schema = reduce(_merge_type, map(_infer_schema, data))
319 if _has_nulltype(schema):
--> 320 raise ValueError("Some of types cannot be determined after inferring")
321 return schema
322
ValueError: Some of types cannot be determined after inferring
Does anyone know what the above error mean? Thanks!
Create PySpark ArrayType You can create an instance of an ArrayType using ArraType() class, This takes arguments valueType and one optional argument valueContainsNull to specify if a value can accept null, by default it takes True. valueType should be a PySpark type that extends DataType class.
StructField – Defines the metadata of the DataFrame column PySpark provides pyspark.sql.types import StructField class to define the columns which include column name(String), column type (DataType), nullable column (Boolean) and metadata (MetaData)
In order to infer the field type, PySpark looks at the non-none records in each field. If a field only has None records, PySpark can not infer the type and will raise that error.
Manually defining a schema will resolve the issue
>>> from pyspark.sql.types import StructType, StructField, StringType >>> schema = StructType([StructField("foo", StringType(), True)]) >>> df = spark.createDataFrame([[None]], schema=schema) >>> df.show() +----+ |foo | +----+ |null| +----+
And to fix this problem, you could provide your own defined schema.
For example:
To reproduce the error:
>>> df = spark.createDataFrame([[None, None]], ["name", "score"])
To fix the error:
>>> from pyspark.sql.types import StructType, StructField, StringType, DoubleType >>> schema = StructType([StructField("name", StringType(), True), StructField("score", DoubleType(), True)]) >>> df = spark.createDataFrame([[None, None]], schema=schema) >>> df.show() +----+-----+ |name|score| +----+-----+ |null| null| +----+-----+
If you are using the RDD[Row].toDF()
monkey-patched method you can increase the sample ratio to check more than 100 records when inferring types:
# Set sampleRatio smaller as the data size increases
my_df = my_rdd.toDF(sampleRatio=0.01)
my_df.show()
Assuming there are non-null rows in all fields in your RDD, it will be more likely to find them when you increase the sampleRatio
towards 1.0.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With