Could someone help me solve this problem I have with Spark DataFrame? When I do <code>myFloatRDD.toDF()</code> I get an error: <blockquote> TypeError: Can not infer schema for type: type 'float' </blockquote> I don't understand why... Example: <pre class="prettyprint"><code>myFloatRdd = sc.parallelize([1.0,2.0,3.0]) df = myFloatRdd.toDF() </code></pre> Thanks

<code>SparkSession.createDataFrame</code>, which is used under the hood, requires an <code>RDD</code> / <code>list</code> of <code>Row</code>/<code>tuple</code>/<code>list</code>/<s><code>dict</code></s>* or <code>pandas.DataFrame</code>, unless schema with <code>DataType</code> is provided. Try to convert float to tuple like this: <pre class="prettyprint"><code>myFloatRdd.map(lambda x: (x, )).toDF() </code></pre> or even better: <pre class="prettyprint"><code>from pyspark.sql import Row row = Row("val") # Or some other column name myFloatRdd.map(row).toDF() </code></pre> To create a <code>DataFrame</code> from a list of scalars you'll have to use <code>SparkSession.createDataFrame</code> directly and provide a schema***: <pre class="prettyprint"><code>from pyspark.sql.types import FloatType df = spark.createDataFrame([1.0, 2.0, 3.0], FloatType()) df.show() ## +-----+ ## |value| ## +-----+ ## | 1.0| ## | 2.0| ## | 3.0| ## +-----+ </code></pre> but for a simple range it would be better to use <code>SparkSession.range</code>: <pre class="prettyprint"><code>from pyspark.sql.functions import col spark.range(1, 4).select(col("id").cast("double")) </code></pre> <hr> * No longer supported. ** Spark SQL also provides a limited support for schema inference on Python objects exposing <code>__dict__</code>. *** Supported only in Spark 2.0 or later.

Create Spark DataFrame. Can not infer schema for type: <type 'float'>

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

Could someone help me solve this problem I have with Spark DataFrame?

When I do myFloatRDD.toDF() I get an error:

TypeError: Can not infer schema for type: type 'float'

I don't understand why...

Example:

myFloatRdd = sc.parallelize([1.0,2.0,3.0]) df = myFloatRdd.toDF()

Thanks

513

asked Sep 23 '15 14:09

Breach

1 Answers

SparkSession.createDataFrame, which is used under the hood, requires an RDD / list of Row/tuple/list/~~dict~~* or pandas.DataFrame, unless schema with DataType is provided. Try to convert float to tuple like this:

myFloatRdd.map(lambda x: (x, )).toDF()

or even better:

from pyspark.sql import Row  row = Row("val") # Or some other column name myFloatRdd.map(row).toDF()

To create a DataFrame from a list of scalars you'll have to use SparkSession.createDataFrame directly and provide a schema***:

from pyspark.sql.types import FloatType  df = spark.createDataFrame([1.0, 2.0, 3.0], FloatType())  df.show()  ## +-----+ ## |value| ## +-----+ ## |  1.0| ## |  2.0| ## |  3.0| ## +-----+

but for a simple range it would be better to use SparkSession.range:

from pyspark.sql.functions import col  spark.range(1, 4).select(col("id").cast("double"))

* No longer supported.

** Spark SQL also provides a limited support for schema inference on Python objects exposing __dict__.

*** Supported only in Spark 2.0 or later.

148

answered Sep 23 '22 04:09

zero323

Related questions
                            
                                built-in range or numpy.arange: which is more efficient?
                            
                                How to use a multiprocessing.Manager()?
                            
                                importing a module when the module name is in a variable [duplicate]
                            
                                py.test skips test class if constructor is defined
                            
                                django-rest-framework 3.0 create or update in nested serializer
                            
                                ":=" syntax and assignment expressions: what and why?
                            
                                Converting "yield from" statement to Python 2.7 code
                            
                                Turning off IntelliJ Auto-save
                            
                                Histogram values of a Pandas Series
                            
                                What is the point of setLevel in a python logging handler?
                            
                                Type hints: solve circular dependency [duplicate]
                            
                                mypy, type hint: Union[float, int] -> is there a Number type?
                            
                                Pandas column bind (cbind) two data frames
                            
                                Putting many python pandas dataframes to one excel worksheet
                            
                                increase the linewidth of the legend lines in matplotlib
                            
                                Pandas: Convert Timestamp to datetime.date
                            
                                2D list has weird behavor when trying to modify a single value [duplicate]
                            
                                In Javascript a dictionary comprehension, or an Object `map`
                            
                                Print raw string from variable? (not getting the answers)
                            
                                Celery with RabbitMQ: AttributeError: 'DisabledBackend' object has no attribute '_get_task_meta_for'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With