Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create Spark DataFrame. Can not infer schema for type: <type 'float'>

Could someone help me solve this problem I have with Spark DataFrame?

When I do myFloatRDD.toDF() I get an error:

TypeError: Can not infer schema for type: type 'float'

I don't understand why...

Example:

myFloatRdd = sc.parallelize([1.0,2.0,3.0]) df = myFloatRdd.toDF() 

Thanks

like image 513
Breach Avatar asked Sep 23 '15 14:09

Breach


People also ask

What is inferSchema PySpark?

inferSchema -> Infer schema will automatically guess the data types for each field. If we set this option to TRUE, the API will read some sample records from the file to infer the schema. If we want to set this value to false, we must specify a schema explicitly.


1 Answers

SparkSession.createDataFrame, which is used under the hood, requires an RDD / list of Row/tuple/list/dict* or pandas.DataFrame, unless schema with DataType is provided. Try to convert float to tuple like this:

myFloatRdd.map(lambda x: (x, )).toDF() 

or even better:

from pyspark.sql import Row  row = Row("val") # Or some other column name myFloatRdd.map(row).toDF() 

To create a DataFrame from a list of scalars you'll have to use SparkSession.createDataFrame directly and provide a schema***:

from pyspark.sql.types import FloatType  df = spark.createDataFrame([1.0, 2.0, 3.0], FloatType())  df.show()  ## +-----+ ## |value| ## +-----+ ## |  1.0| ## |  2.0| ## |  3.0| ## +-----+ 

but for a simple range it would be better to use SparkSession.range:

from pyspark.sql.functions import col  spark.range(1, 4).select(col("id").cast("double")) 

* No longer supported.

** Spark SQL also provides a limited support for schema inference on Python objects exposing __dict__.

*** Supported only in Spark 2.0 or later.

like image 148
zero323 Avatar answered Sep 23 '22 04:09

zero323