Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a sample single-column Spark DataFrame in Python?

I want to create a sample single-column DataFrame, but the following code is not working:

df = spark.createDataFrame(["10","11","13"], ("age"))  ## ValueError ## ... ## ValueError: Could not parse datatype: age 

The expected result:

age 10 11 13 
like image 964
Ajish Kb Avatar asked Dec 06 '17 12:12

Ajish Kb


People also ask

How do you create a dummy DataFrame in PySpark?

In order to create an empty PySpark DataFrame manually with schema ( column names & data types) first, Create a schema using StructType and StructField . Now use the empty RDD created above and pass it to createDataFrame() of SparkSession along with the schema for column names & data types.

How do I create a Spark DataFrame with column names?

To do this first create a list of data and a list of column names. Then pass this zipped data to spark. createDataFrame() method. This method is used to create DataFrame.

How do I extract a column in Spark?

In order to convert Spark DataFrame Column to List, first select() the column you want, next use the Spark map() transformation to convert the Row to String, finally collect() the data to the driver which returns an Array[String] .


2 Answers

the following code is not working

With single element you need a schema as type

spark.createDataFrame(["10","11","13"], "string").toDF("age") 

or DataType:

from pyspark.sql.types import StringType  spark.createDataFrame(["10","11","13"], StringType()).toDF("age") 

With name elements should be tuples and schema as sequence:

spark.createDataFrame([("10", ), ("11", ), ("13",  )], ["age"]) 
like image 76
Alper t. Turker Avatar answered Sep 23 '22 19:09

Alper t. Turker


Well .. There is some pretty easy method for creating sample dataframe in PySpark

>>> df = sc.parallelize([[1,2,3], [2,3,4]]).toDF() >>> df.show() +---+---+---+ | _1| _2| _3| +---+---+---+ |  1|  2|  3| |  2|  3|  4| +---+---+---+ 

to create with some column names

>>> df1 = sc.parallelize([[1,2,3], [2,3,4]]).toDF(("a", "b", "c")) >>> df1.show() +---+---+---+ |  a|  b|  c| +---+---+---+ |  1|  2|  3| |  2|  3|  4| +---+---+---+ 

In this way, no need to define schema too.Hope this is the simplest way

like image 36
Sarath Chandra Vema Avatar answered Sep 23 '22 19:09

Sarath Chandra Vema