Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add new rows to pyspark Dataframe

Am very new pyspark but familiar with pandas. I have a pyspark Dataframe

# instantiate Spark
spark = SparkSession.builder.getOrCreate()

# make some test data
columns = ['id', 'dogs', 'cats']
vals = [
     (1, 2, 0),
     (2, 0, 1)
]

# create DataFrame
df = spark.createDataFrame(vals, columns)

wanted to add new Row (4,5,7) so it will output:

df.show()
+---+----+----+
| id|dogs|cats|
+---+----+----+
|  1|   2|   0|
|  2|   0|   1|
|  4|   5|   7|
+---+----+----+
like image 920
Roushan Avatar asked Oct 07 '18 05:10

Roushan


People also ask

How do I add rows to a DataFrame PySpark?

To append row to dataframe one can use collect method also. collect() function converts dataframe to list and you can directly append data to list and again convert list to dataframe.

How do I add a row in Spark DataFrame?

2. Add Dict as Row to DataFrame. You can create a DataFrame and add/insert a new row to this DataFrame from dict, first create a Python Dictionary and use append() function, this method is required to pass ignore_index=True in order to add dict as a row to DataFrame, not using this will get you an error.

How do I create a row in Spark?

To create a new Row, use RowFactory. create() in Java or Row. apply() in Scala. A Row object can be constructed by providing field values.


2 Answers

As thebluephantom has already said union is the way to go. I'm just answering your question to give you a pyspark example:

# if not already created automatically, instantiate Sparkcontext
spark = SparkSession.builder.getOrCreate()

columns = ['id', 'dogs', 'cats']
vals = [(1, 2, 0), (2, 0, 1)]

df = spark.createDataFrame(vals, columns)

newRow = spark.createDataFrame([(4,5,7)], columns)
appended = df.union(newRow)
appended.show()

Please have also a lookat the databricks FAQ: https://kb.databricks.com/data/append-a-row-to-rdd-or-dataframe.html

like image 63
cronoik Avatar answered Oct 01 '22 11:10

cronoik


From something I did, using union, showing a block partial coding - you need to adapt of course to your own situation:

val dummySchema = StructType(
StructField("phrase", StringType, true) :: Nil)
var dfPostsNGrams2 = spark.createDataFrame(sc.emptyRDD[Row], dummySchema)
for (i <- i_grams_Cols) {
    val nameCol = col({i})
    dfPostsNGrams2 = dfPostsNGrams2.union(dfPostsNGrams.select(explode({nameCol}).as("phrase")).toDF )
 }

union of DF with itself is the way to go.

like image 27
thebluephantom Avatar answered Oct 01 '22 11:10

thebluephantom