I have json data in various json files And the keys could be different in lines, for eg <pre class="prettyprint"><code>{"a":1 , "b":"abc", "c":"abc2", "d":"abc3"} {"a":1 , "b":"abc2", "d":"abc"} {"a":1 ,"b":"abc", "c":"abc2", "d":"abc3"} </code></pre> I want to aggreagate data on column 'b','c','d' and 'f' which is not present in the given json file but could be present in the other files. SO as column 'f' is not present we can take empty string for that column. I am reading the input file and aggregating the data like this <pre class="prettyprint"><code>import pyspark.sql.functions as f df = spark.read.json(inputfile) df2 =df.groupby("b","c","d","f").agg(f.sum(df["a"])) </code></pre> This is the final output I want <pre class="prettyprint"><code>{"a":2 , "b":"abc", "c":"abc2", "d":"abc3","f":"" } {"a":1 , "b":"abc2", "c":"" ,"d":"abc","f":""} </code></pre> Can anyone please Help? Thanks in advance!

You can check if colum is available in dataframe and modify <code>df</code> only if necessary: <pre class="prettyprint"><code>if not 'f' in df.columns: df = df.withColumn('f', f.lit('')) </code></pre> For nested schemas you may need to use <code>df.schema</code> like below: <pre class="prettyprint"><code>>>> df.printSchema() root |-- a: struct (nullable = true) | |-- b: long (nullable = true) >>> 'b' in df.schema['a'].dataType.names True >>> 'x' in df.schema['a'].dataType.names False </code></pre>

In case someone needs this in Scala: <pre class="prettyprint"><code>if (!df.columns.contains("f")) { val newDf = df.withColumn("f", lit("")) } </code></pre>

pyspark dataframe add a column if it doesn't exist

Tags:

apache-spark

apache-spark-sql

pyspark

pyspark-sql

I have json data in various json files And the keys could be different in lines, for eg

{"a":1 , "b":"abc", "c":"abc2", "d":"abc3"}
{"a":1 , "b":"abc2", "d":"abc"}
{"a":1 ,"b":"abc", "c":"abc2", "d":"abc3"}

I want to aggreagate data on column 'b','c','d' and 'f' which is not present in the given json file but could be present in the other files. SO as column 'f' is not present we can take empty string for that column.

I am reading the input file and aggregating the data like this

import pyspark.sql.functions as f
df =  spark.read.json(inputfile)
df2 =df.groupby("b","c","d","f").agg(f.sum(df["a"]))

This is the final output I want

{"a":2 , "b":"abc", "c":"abc2", "d":"abc3","f":"" }
{"a":1 , "b":"abc2", "c":"" ,"d":"abc","f":""}

Can anyone please Help? Thanks in advance!

591

asked Mar 01 '17 08:03

gashu

2 Answers

You can check if colum is available in dataframe and modify df only if necessary:

if not 'f' in df.columns:
   df = df.withColumn('f', f.lit(''))

For nested schemas you may need to use df.schema like below:

>>> df.printSchema()
root
 |-- a: struct (nullable = true)
 |    |-- b: long (nullable = true)

>>> 'b' in df.schema['a'].dataType.names
True
>>> 'x' in df.schema['a'].dataType.names
False

answered Nov 02 '22 19:11

Mariusz

In case someone needs this in Scala:

if (!df.columns.contains("f")) {
  val newDf = df.withColumn("f", lit(""))
}

answered Nov 02 '22 21:11

Javier Montón

Related questions
                            
                                how to use Regexp_replace in spark
                            
                                Spark Implicit $ for DataFrame
                            
                                spark off heap memory config and tungsten
                            
                                It is possible to start an embedded instance of apache Spark node?
                            
                                Is caching the only advantage of spark over map-reduce?
                            
                                When does shuffling occur in Apache Spark?
                            
                                Stackoverflow due to long RDD Lineage
                            
                                How to check version of Spark and Scala in Zeppelin?
                            
                                ETL in Java Spring Batch vs Apache Spark Benchmarking
                            
                                Modify collection inside a Spark RDD foreach
                            
                                PySpark — UnicodeEncodeError: 'ascii' codec can't encode character
                            
                                Replace missing values with mean - Spark Dataframe
                            
                                Spark-Submit: --packages vs --jars
                            
                                How do you perform basic joins of two RDD tables in Spark using Python?
                            
                                Spark RDD default number of partitions
                            
                                How can I get the current SparkSession in any place of the codes?
                            
                                Not able to import Spark Implicits in ScalaTest
                            
                                How to read only n rows of large CSV file on HDFS using spark-csv package?
                            
                                How to convert column of arrays of strings to strings?
                            
                                setting SparkContext for pyspark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With