PySpark dataframe to_json() function

Tags:

I have a dataframe like below,

>>> df.show(10,False)
+-----+----+---+------+
|id   |name|age|salary|
+-----+----+---+------+
|10001|alex|30 |75000 |
|10002|bob |31 |80000 |
|10003|deb |31 |80000 |
|10004|john|33 |85000 |
|10005|sam |30 |75000 |
+-----+----+---+------+

Converting the entire row of df into one new column "jsonCol",

>>> newDf1 = df.withColumn("jsonCol", to_json(struct([df[x] for x in df.columns])))
>>> newDf1.show(10,False)
+-----+----+---+------+--------------------------------------------------------+
|id   |name|age|salary|jsonCol                                                 |
+-----+----+---+------+--------------------------------------------------------+
|10001|alex|30 |75000 |{"id":"10001","name":"alex","age":"30","salary":"75000"}|
|10002|bob |31 |80000 |{"id":"10002","name":"bob","age":"31","salary":"80000"} |
|10003|deb |31 |80000 |{"id":"10003","name":"deb","age":"31","salary":"80000"} |
|10004|john|33 |85000 |{"id":"10004","name":"john","age":"33","salary":"85000"}|
|10005|sam |30 |75000 |{"id":"10005","name":"sam","age":"30","salary":"75000"} |
+-----+----+---+------+--------------------------------------------------------+

Instead of converting the entire row into a JSON string like in the above step I needed a solution to select only few columns based on the value of the field. I have provided a sample condition in the below command.

But when I started using the when function, the resultant JSON string's column names(keys) are gone. Only getting column names by their position, instead of the actual column names(keys)

>>> newDf2 = df.withColumn("jsonCol", to_json(struct([ when(col(x)!="  ",df[x]).otherwise(None) for x in df.columns])))
>>> newDf2.show(10,False)
+-----+----+---+------+---------------------------------------------------------+
|id   |name|age|salary|jsonCol                                                  |
+-----+----+---+------+---------------------------------------------------------+
|10001|alex|30 |75000 |{"col1":"10001","col2":"alex","col3":"30","col4":"75000"}|
|10002|bob |31 |80000 |{"col1":"10002","col2":"bob","col3":"31","col4":"80000"} |
|10003|deb |31 |80000 |{"col1":"10003","col2":"deb","col3":"31","col4":"80000"} |
|10004|john|33 |85000 |{"col1":"10004","col2":"john","col3":"33","col4":"85000"}|
|10005|sam |30 |75000 |{"col1":"10005","col2":"sam","col3":"30","col4":"75000"} |
+-----+----+---+------+---------------------------------------------------------+

I needed to use the when function but to have the results as in newDf1 with actual column names(keys). Can someone help me out?

330

asked Apr 01 '18 21:04

vishnu ram

1 Answers

You have used conditions inside struct function as columns and the condition columns are renamed as col1 col2 .... and thats why you need alias to change the names

from pyspark.sql import functions as F
newDf2 = df.withColumn("jsonCol", F.to_json(F.struct([F.when(F.col(x)!="  ",df[x]).otherwise(None).alias(x) for x in df.columns])))
newDf2.show(truncate=False)

150

answered Oct 18 '22 12:10

Ramesh Maharjan

Related questions
                            
                                How to read XML files from apache spark framework?
                            
                                Change hadoop version using spark-ec2
                            
                                Spark SQL HiveContext - saveAsTable creates wrong schema
                            
                                Iterate through a Java RDD by row
                            
                                Is Spark zipWithIndex safe with parallel implementation?
                            
                                spark submit java.lang.ClassNotFoundException
                            
                                Differentiate driver code and work code in Apache Spark
                            
                                Returning Multiple Arrays from User-Defined Aggregate Function (UDAF) in Apache Spark SQL
                            
                                Unit testing with Spark dataframes
                            
                                Apache spark Hive, executable JAR with maven shade
                            
                                Non linear (DAG) ML pipelines in Apache Spark
                            
                                Pyspark socket timeout exception after application running for a while
                            
                                Share config files with spark-submit in cluster mode
                            
                                Writing a sparkdataframe to a .csv file in S3 and choose a name in pyspark
                            
                                How to exclude jar in final sbt assembly plugin
                            
                                How can I tell if my spark job is progressing?
                            
                                Difference between spark-submit vs. SparkSession in python script?
                            
                                Spark ML Pipeline with RandomForest takes too long on 20MB dataset
                            
                                Understanding DAG in spark
                            
                                Databricks display() function equivalent or alternative to Jupyter

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark dataframe to_json() function

Tags:

apache-spark

apache-spark-sql

pyspark

vishnu ram

People also ask

1 Answers

Ramesh Maharjan

Recent Activity

Donate For Us