Could someone help me solve this problem I have with Spark DataFrame?
When I do myFloatRDD.toDF()
I get an error:
TypeError: Can not infer schema for type: type 'float'
I don't understand why...
Example:
myFloatRdd = sc.parallelize([1.0,2.0,3.0]) df = myFloatRdd.toDF()
Thanks
inferSchema -> Infer schema will automatically guess the data types for each field. If we set this option to TRUE, the API will read some sample records from the file to infer the schema. If we want to set this value to false, we must specify a schema explicitly.
SparkSession.createDataFrame
, which is used under the hood, requires an RDD
/ list
of Row
/tuple
/list
/* or dict
pandas.DataFrame
, unless schema with DataType
is provided. Try to convert float to tuple like this:
myFloatRdd.map(lambda x: (x, )).toDF()
or even better:
from pyspark.sql import Row row = Row("val") # Or some other column name myFloatRdd.map(row).toDF()
To create a DataFrame
from a list of scalars you'll have to use SparkSession.createDataFrame
directly and provide a schema***:
from pyspark.sql.types import FloatType df = spark.createDataFrame([1.0, 2.0, 3.0], FloatType()) df.show() ## +-----+ ## |value| ## +-----+ ## | 1.0| ## | 2.0| ## | 3.0| ## +-----+
but for a simple range it would be better to use SparkSession.range
:
from pyspark.sql.functions import col spark.range(1, 4).select(col("id").cast("double"))
* No longer supported.
** Spark SQL also provides a limited support for schema inference on Python objects exposing __dict__
.
*** Supported only in Spark 2.0 or later.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With