<h3>Input</h3> <p>I have a column <code>Parameters</code> of type <code>map</code> of the form: </p> <pre class="prettyprint"><code>>>> from pyspark.sql import SQLContext >>> sqlContext = SQLContext(sc) >>> d = [{'Parameters': {'foo': '1', 'bar': '2', 'baz': 'aaa'}}] >>> df = sqlContext.createDataFrame(d) >>> df.collect() [Row(Parameters={'foo': '1', 'bar': '2', 'baz': 'aaa'})] </code></pre> <h3>Output</h3> <p>I want to reshape it in pyspark so that all the keys (<code>foo</code>, <code>bar</code>, etc.) are columns, namely:</p> <pre class="prettyprint"><code>[Row(foo='1', bar='2', baz='aaa')] </code></pre> <p>Using <code>withColumn</code> works:</p> <pre class="prettyprint"><code>(df .withColumn('foo', df.Parameters['foo']) .withColumn('bar', df.Parameters['bar']) .withColumn('baz', df.Parameters['baz']) .drop('Parameters') ).collect() </code></pre> <p>But <strong>I need like a solution that doesn't explicitly mention the column names</strong> as I have dozens of them.</p> <h3>Schema</h3> <pre class="prettyprint"><code>>>> df.printSchema() root |-- Parameters: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true) </code></pre>

<p><strong>Performant solution</strong></p> <p>One of the question constraints is to dynamically determine the column names, which is fine, but be warned that this can be really slow. Here's how you can avoid typing and write code that'll execute quickly.</p> <pre class="prettyprint lang-py prettyprint-override"><code>cols = list(map( lambda f: F.col("Parameters").getItem(f).alias(str(f)), ["foo", "bar", "baz"])) df.select(cols).show() </code></pre> <pre class="prettyprint"><code>+---+---+---+ |foo|bar|baz| +---+---+---+ | 1| 2|aaa| +---+---+---+ </code></pre> <p>Notice that this runs a single select operation. Don't run <code>withColumn</code> multiple times because that's slower.</p> <p>The fast solution is only possible if you know all the map keys. You'll need to revert to the slower solution if you don't know all the unique values for the map keys.</p> <p><strong>Slower solution</strong></p> <p>The accepted answer is good. My solution is a bit more performant because it doesn't call <code>.rdd</code> or <code>flatMap()</code>.</p> <pre class="prettyprint lang-py prettyprint-override"><code>import pyspark.sql.functions as F d = [{'Parameters': {'foo': '1', 'bar': '2', 'baz': 'aaa'}}] df = spark.createDataFrame(d) keys_df = df.select(F.explode(F.map_keys(F.col("Parameters")))).distinct() keys = list(map(lambda row: row[0], keys_df.collect())) key_cols = list(map(lambda f: F.col("Parameters").getItem(f).alias(str(f)), keys)) df.select(key_cols).show() </code></pre> <pre class="prettyprint"><code>+---+---+---+ |bar|foo|baz| +---+---+---+ | 2| 1|aaa| +---+---+---+ </code></pre> <p>Collecting results to the driver node can be a performance bottleneck. It's good to execute this code <code>list(map(lambda row: row[0], keys_df.collect()))</code> as a separate command to make sure it's not running too slowly.</p>

PySpark converting a column of type 'map' to multiple columns in a dataframe

Input

I have a column Parameters of type map of the form:

Click to copy

>>> from pyspark.sql import SQLContext
>>> sqlContext = SQLContext(sc)
>>> d = [{'Parameters': {'foo': '1', 'bar': '2', 'baz': 'aaa'}}]
>>> df = sqlContext.createDataFrame(d)
>>> df.collect()
[Row(Parameters={'foo': '1', 'bar': '2', 'baz': 'aaa'})]

Output

I want to reshape it in pyspark so that all the keys (foo, bar, etc.) are columns, namely:

Click to copy

[Row(foo='1', bar='2', baz='aaa')]

Using withColumn works:

Click to copy

(df
 .withColumn('foo', df.Parameters['foo'])
 .withColumn('bar', df.Parameters['bar'])
 .withColumn('baz', df.Parameters['baz'])
 .drop('Parameters')
).collect()

But I need like a solution that doesn't explicitly mention the column names as I have dozens of them.

Schema

Click to copy

>>> df.printSchema()

root
 |-- Parameters: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

472

asked Apr 26 '16 15:04

Kamil Sindi

2 Answers

Since keys of the MapType are not a part of the schema you'll have to collect these first for example like this:

Click to copy

from pyspark.sql.functions import explode

keys = (df
    .select(explode("Parameters"))
    .select("key")
    .distinct()
    .rdd.flatMap(lambda x: x)
    .collect())

When you have this all what is left is simple select:

Click to copy

from pyspark.sql.functions import col

exprs = [col("Parameters").getItem(k).alias(k) for k in keys]
df.select(*exprs)

answered Oct 03 '22 23:10

zero323

Performant solution

One of the question constraints is to dynamically determine the column names, which is fine, but be warned that this can be really slow. Here's how you can avoid typing and write code that'll execute quickly.

Click to copy

cols = list(map(
    lambda f: F.col("Parameters").getItem(f).alias(str(f)),
    ["foo", "bar", "baz"]))
df.select(cols).show()

Click to copy

+---+---+---+
|foo|bar|baz|
+---+---+---+
|  1|  2|aaa|
+---+---+---+

Notice that this runs a single select operation. Don't run withColumn multiple times because that's slower.

The fast solution is only possible if you know all the map keys. You'll need to revert to the slower solution if you don't know all the unique values for the map keys.

Slower solution

The accepted answer is good. My solution is a bit more performant because it doesn't call .rdd or flatMap().

Click to copy

import pyspark.sql.functions as F

d = [{'Parameters': {'foo': '1', 'bar': '2', 'baz': 'aaa'}}]
df = spark.createDataFrame(d)

keys_df = df.select(F.explode(F.map_keys(F.col("Parameters")))).distinct()
keys = list(map(lambda row: row[0], keys_df.collect()))
key_cols = list(map(lambda f: F.col("Parameters").getItem(f).alias(str(f)), keys))
df.select(key_cols).show()

Click to copy

+---+---+---+
|bar|foo|baz|
+---+---+---+
|  2|  1|aaa|
+---+---+---+

Collecting results to the driver node can be a performance bottleneck. It's good to execute this code list(map(lambda row: row[0], keys_df.collect())) as a separate command to make sure it's not running too slowly.

answered Oct 03 '22 22:10

Powers

Related questions
                            
                                How do I output a config value in a Sphinx .rst file?
                            
                                What algorithm is used when using the in operator in python to search a list?
                            
                                How can I display a np.array with pylab.imshow()
                            
                                Detect if an Active Directory user account is locked using LDAP in Python
                            
                                Tunneling httplib Through a Proxy
                            
                                Ignore certificate validation with urllib3
                            
                                pip install customized include path
                            
                                convert a django queryset into an array
                            
                                TypeError: '_sre.SRE_Match' object has no attribute '__getitem__'
                            
                                Check requirements for python 3 support
                            
                                RuntimeError: cannot access configuration outside request
                            
                                Mixed types when reading csv files. Causes, fixes and consequences
                            
                                Correct fitting with scipy curve_fit including errors in x?
                            
                                numpy merge sorted array to an new array?
                            
                                PyInstaller: Single-file executable doesn't run
                            
                                Multiple font sizes in plot title
                            
                                How can I set different opacity of edgecolor and facecolor of a patch in Matplotlib
                            
                                How to reconnect to RabbitMQ?
                            
                                ImportError: No module named numpy on spark workers
                            
                                Top and bottom line on errorbar with python and seaborn

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark converting a column of type 'map' to multiple columns in a dataframe

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

Input

Output

Schema

Kamil Sindi

People also ask

2 Answers

zero323

Powers

Recent Activity

Donate For Us