I have a simple dataframe like this: <pre class="prettyprint"><code>rdd = sc.parallelize( [ (0, "A", 223,"201603", "PORT"), (0, "A", 22,"201602", "PORT"), (0, "A", 422,"201601", "DOCK"), (1,"B", 3213,"201602", "DOCK"), (1,"B", 3213,"201601", "PORT"), (2,"C", 2321,"201601", "DOCK") ] ) df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"]) df_data.show() +---+----+----+------+----+ | id|type|cost| date|ship| +---+----+----+------+----+ | 0| A| 223|201603|PORT| | 0| A| 22|201602|PORT| | 0| A| 422|201601|DOCK| | 1| B|3213|201602|DOCK| | 1| B|3213|201601|PORT| | 2| C|2321|201601|DOCK| +---+----+----+------+----+ </code></pre> and I need to pivot it by date: <pre class="prettyprint"><code>df_data.groupby(df_data.id, df_data.type).pivot("date").avg("cost").show() +---+----+------+------+------+ | id|type|201601|201602|201603| +---+----+------+------+------+ | 2| C|2321.0| null| null| | 0| A| 422.0| 22.0| 223.0| | 1| B|3213.0|3213.0| null| +---+----+------+------+------+ </code></pre> Everything works as expected. But now I need to pivot it and get a non-numeric column: <pre class="prettyprint"><code>df_data.groupby(df_data.id, df_data.type).pivot("date").avg("ship").show() </code></pre> and of course I would get an exception: <pre class="prettyprint"><code>AnalysisException: u'"ship" is not a numeric column. Aggregation function can only be applied on a numeric column.;' </code></pre> I would like to generate something on the line of <pre class="prettyprint"><code>+---+----+------+------+------+ | id|type|201601|201602|201603| +---+----+------+------+------+ | 2| C|DOCK | null| null| | 0| A| DOCK | PORT| DOCK| | 1| B|DOCK |PORT | null| +---+----+------+------+------+ </code></pre> Is that possible with <code>pivot</code>?

Assuming that <code>(id |type | date)</code> combinations are unique and your only goal is pivoting and not aggregation you can use <code>first</code> (or any other function not restricted to numeric values): <pre class="prettyprint"><code>from pyspark.sql.functions import first (df_data .groupby(df_data.id, df_data.type) .pivot("date") .agg(first("ship")) .show()) ## +---+----+------+------+------+ ## | id|type|201601|201602|201603| ## +---+----+------+------+------+ ## | 2| C| DOCK| null| null| ## | 0| A| DOCK| PORT| PORT| ## | 1| B| PORT| DOCK| null| ## +---+----+------+------+------+ </code></pre> If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common <code>ship</code> value: <pre class="prettyprint"><code>from pyspark.sql.functions import max, struct (df_data .groupby("id", "type", "date", "ship") .count() .groupby("id", "type") .pivot("date") .agg(max(struct("count", "ship"))) .show()) ## +---+----+--------+--------+--------+ ## | id|type| 201601| 201602| 201603| ## +---+----+--------+--------+--------+ ## | 2| C|[1,DOCK]| null| null| ## | 0| A|[1,DOCK]|[1,PORT]|[1,PORT]| ## | 1| B|[1,PORT]|[1,DOCK]| null| ## +---+----+--------+--------+--------+ </code></pre>

In case, if someone is looking for SQL style approach. <pre class="prettyprint"><code>rdd = spark.sparkContext.parallelize( [ (0, "A", 223,"201603", "PORT"), (0, "A", 22,"201602", "PORT"), (0, "A", 422,"201601", "DOCK"), (1,"B", 3213,"201602", "DOCK"), (1,"B", 3213,"201601", "PORT"), (2,"C", 2321,"201601", "DOCK") ] ) df_data = spark.createDataFrame(rdd, ["id","type", "cost", "date", "ship"]) df_data.createOrReplaceTempView("df") df_data.show() dt_vals=spark.sql("select collect_set(date) from df").collect()[0][0] ['201601', '201602', '201603'] dt_vals_colstr=",".join(["'" + c + "'" for c in sorted(dt_vals)]) "'201601','201602','201603'" </code></pre> Part-1 (Note the <code>f</code> format specifier) <pre class="prettyprint"><code>spark.sql(f""" select * from (select id , type, date, ship from df) pivot ( first(ship) for date in ({dt_vals_colstr}) ) """).show(100,truncate=False) +---+----+------+------+------+ |id |type|201601|201602|201603| +---+----+------+------+------+ |1 |B |PORT |DOCK |null | |2 |C |DOCK |null |null | |0 |A |DOCK |PORT |PORT | +---+----+------+------+------+ </code></pre> Part-2 <pre class="prettyprint"><code>spark.sql(f""" select * from (select id , type, date, ship from df) pivot ( case when count(*)=0 then null else struct(count(*),first(ship)) end for date in ({dt_vals_colstr}) ) """).show(100,truncate=False) +---+----+---------+---------+---------+ |id |type|201601 |201602 |201603 | +---+----+---------+---------+---------+ |1 |B |[1, PORT]|[1, DOCK]|null | |2 |C |[1, DOCK]|null |null | |0 |A |[1, DOCK]|[1, PORT]|[1, PORT]| +---+----+---------+---------+---------+ </code></pre>

Pivot String column on Pyspark Dataframe

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

I have a simple dataframe like this:

rdd = sc.parallelize(     [         (0, "A", 223,"201603", "PORT"),          (0, "A", 22,"201602", "PORT"),          (0, "A", 422,"201601", "DOCK"),          (1,"B", 3213,"201602", "DOCK"),          (1,"B", 3213,"201601", "PORT"),          (2,"C", 2321,"201601", "DOCK")     ] ) df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])  df_data.show()  +---+----+----+------+----+ | id|type|cost|  date|ship| +---+----+----+------+----+ |  0|   A| 223|201603|PORT| |  0|   A|  22|201602|PORT| |  0|   A| 422|201601|DOCK| |  1|   B|3213|201602|DOCK| |  1|   B|3213|201601|PORT| |  2|   C|2321|201601|DOCK| +---+----+----+------+----+

and I need to pivot it by date:

df_data.groupby(df_data.id, df_data.type).pivot("date").avg("cost").show()  +---+----+------+------+------+ | id|type|201601|201602|201603| +---+----+------+------+------+ |  2|   C|2321.0|  null|  null| |  0|   A| 422.0|  22.0| 223.0| |  1|   B|3213.0|3213.0|  null| +---+----+------+------+------+

Everything works as expected. But now I need to pivot it and get a non-numeric column:

df_data.groupby(df_data.id, df_data.type).pivot("date").avg("ship").show()

and of course I would get an exception:

AnalysisException: u'"ship" is not a numeric column. Aggregation function can only be applied on a numeric column.;'

I would like to generate something on the line of

+---+----+------+------+------+ | id|type|201601|201602|201603| +---+----+------+------+------+ |  2|   C|DOCK  |  null|  null| |  0|   A| DOCK |  PORT| DOCK| |  1|   B|DOCK  |PORT  |  null| +---+----+------+------+------+

Is that possible with pivot?

557

asked May 27 '16 15:05

Ivan

2 Answers

Assuming that (id |type | date) combinations are unique and your only goal is pivoting and not aggregation you can use first (or any other function not restricted to numeric values):

from pyspark.sql.functions import first  (df_data     .groupby(df_data.id, df_data.type)     .pivot("date")     .agg(first("ship"))     .show())  ## +---+----+------+------+------+ ## | id|type|201601|201602|201603| ## +---+----+------+------+------+ ## |  2|   C|  DOCK|  null|  null| ## |  0|   A|  DOCK|  PORT|  PORT| ## |  1|   B|  PORT|  DOCK|  null| ## +---+----+------+------+------+

If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common ship value:

from pyspark.sql.functions import max, struct  (df_data     .groupby("id", "type", "date", "ship")     .count()     .groupby("id", "type")     .pivot("date")     .agg(max(struct("count", "ship")))     .show())  ## +---+----+--------+--------+--------+ ## | id|type|  201601|  201602|  201603| ## +---+----+--------+--------+--------+ ## |  2|   C|[1,DOCK]|    null|    null| ## |  0|   A|[1,DOCK]|[1,PORT]|[1,PORT]| ## |  1|   B|[1,PORT]|[1,DOCK]|    null| ## +---+----+--------+--------+--------+

135

answered Oct 02 '22 21:10

zero323

In case, if someone is looking for SQL style approach.

rdd = spark.sparkContext.parallelize(     [         (0, "A", 223,"201603", "PORT"),          (0, "A", 22,"201602", "PORT"),          (0, "A", 422,"201601", "DOCK"),          (1,"B", 3213,"201602", "DOCK"),          (1,"B", 3213,"201601", "PORT"),          (2,"C", 2321,"201601", "DOCK")     ] ) df_data = spark.createDataFrame(rdd, ["id","type", "cost", "date", "ship"]) df_data.createOrReplaceTempView("df") df_data.show()  dt_vals=spark.sql("select collect_set(date) from df").collect()[0][0] ['201601', '201602', '201603']  dt_vals_colstr=",".join(["'" + c + "'" for c in sorted(dt_vals)]) "'201601','201602','201603'"

Part-1 (Note the f format specifier)

spark.sql(f""" select * from  (select id , type, date, ship from df) pivot ( first(ship) for date in ({dt_vals_colstr}) ) """).show(100,truncate=False)  +---+----+------+------+------+ |id |type|201601|201602|201603| +---+----+------+------+------+ |1  |B   |PORT  |DOCK  |null  | |2  |C   |DOCK  |null  |null  | |0  |A   |DOCK  |PORT  |PORT  | +---+----+------+------+------+

Part-2

spark.sql(f""" select * from  (select id , type, date, ship from df) pivot ( case when count(*)=0 then null  else struct(count(*),first(ship)) end for date in ({dt_vals_colstr}) ) """).show(100,truncate=False)  +---+----+---------+---------+---------+ |id |type|201601   |201602   |201603   | +---+----+---------+---------+---------+ |1  |B   |[1, PORT]|[1, DOCK]|null     | |2  |C   |[1, DOCK]|null     |null     | |0  |A   |[1, DOCK]|[1, PORT]|[1, PORT]| +---+----+---------+---------+---------+

answered Oct 02 '22 19:10

stack0114106

Related questions
                            
                                How to avoid overlapping of labels & autopct in a matplotlib pie chart?
                            
                                Can't find msguniq. Make sure you have GNU gettext tools 0.15 or newer installed. (Django 1.8 and OSX ElCapitan)
                            
                                Django template filters, tags, simple_tags, and inclusion_tags
                            
                                moment.calendar() without the time
                            
                                How can I use cumsum within a group in Pandas?
                            
                                vim and python scripts debugging
                            
                                Simple IPC between C++ and Python (cross platform)
                            
                                Using Python Iterparse For Large XML Files
                            
                                Python unit test that uses an external data file
                            
                                UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually
                            
                                Help me understand the difference between CLOBs and BLOBs in Oracle
                            
                                Generating non-repeating random numbers in Python
                            
                                Maximum size of "TEXT" datatype in postgresql
                            
                                Flask view return error "View function did not return a response"
                            
                                Apply function on each row (row-wise) of a NumPy array
                            
                                What is a maximum number of arguments in a Python function?
                            
                                Python Weather API [closed]
                            
                                How do I use cx_freeze?
                            
                                Python super() arguments: why not super(obj)?
                            
                                How to use str.contains() with multiple expressions, in pandas dataframes?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With