Is there an operation in pandas that does the same as flatMap in pyspark? flatMap example: <pre class="prettyprint"><code>>>> rdd = sc.parallelize([2, 3, 4]) >>> sorted(rdd.flatMap(lambda x: range(1, x)).collect()) [1, 1, 1, 2, 2, 3] </code></pre> So far I can think of <code>apply</code> followed by <code>itertools.chain</code>, but I am wondering if there is a one-step solution.

There's a hack. I often do something like <pre class="prettyprint"><code>In [1]: import pandas as pd In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]}) In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True) Out[3]: 0 1 1 3 2 2 3 4 4 NaN 5 5 dtype: float64 </code></pre> The introduction of <code>NaN</code> is because the intermediate object creates a <code>MultiIndex</code>, but for a lot of things you can just drop that: <pre class="prettyprint"><code>In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna() Out[4]: 0 1 1 3 2 2 3 4 5 5 dtype: float64 </code></pre> This trick uses all pandas code, so I would expect it to be reasonably efficient, though it might not like things like very different sized lists.

pyspark's flatMap in pandas

Tags:

pandas

pyspark

Is there an operation in pandas that does the same as flatMap in pyspark?

flatMap example:

>>> rdd = sc.parallelize([2, 3, 4])
>>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())
[1, 1, 1, 2, 2, 3]

So far I can think of apply followed by itertools.chain, but I am wondering if there is a one-step solution.

662

asked Jun 26 '15 18:06

GeauxEric

2 Answers

Since July 2019, Pandas offer pd.Series.explode to unnest frames. Here's a possible implementation of pd.Series.flatmap based on explode and map. Why?

flatmap operations should be a subset of map, not apply. check this thread for map/applymap/apply details Difference between map, applymap and apply methods in Pandas

import pandas as pd
from typing import Callable

def flatmap(
    self,
    func:Callable[[pd.Series],pd.Series],
    ignore_index:bool=False):
    return self.map(func).explode(ignore_index)
pd.Series.flatmap = flatmap

# example
df = pd.DataFrame([(x,y) for x,y in zip(range(1,6),range(6,16))], columns=['A','B'])
print(df.head(5))
#    A   B
# 0  1   6
# 1  2   7
# 2  3   8
# 3  4   9
# 4  5  10
print(df.A.flatmap(range,False))
# 0    NaN
# 1      0
# 2      0
# 2      1
# 3      0
# 3      1
# 3      2
# 4      0
# 4      1
# 4      2
# 4      3
# Name: A, dtype: object
print(df.A.flatmap(range,True))
# 0     0
# 1     0
# 2     1
# 3     0
# 4     1
# 5     2
# 6     0
# 7     1
# 8     2
# 9     3
# 10    0
# 11    1
# 12    2
# 13    3
# 14    4
# Name: A, dtype: object

As you can see, the main issue is the indexing. You could ignore it and just reset, but then you're better of using NumPy or std lists, as indexing is one of the key pandas' points. If you do not care about indexing at all, you could reuse the idea of the solution above, change pd.Series.map to pd.DataFrame.applymap and pd.Series.explode to pd.DataFrame.explode and forcing ignore_index=True.

155

answered Nov 10 '22 03:11

Théophile Pace

There's a hack. I often do something like

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})

In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True)
Out[3]:
0     1
1     3
2     2
3     4
4   NaN
5     5
dtype: float64

The introduction of NaN is because the intermediate object creates a MultiIndex, but for a lot of things you can just drop that:

In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna()
Out[4]:
0    1
1    3
2    2
3    4
5    5
dtype: float64

This trick uses all pandas code, so I would expect it to be reasonably efficient, though it might not like things like very different sized lists.

answered Nov 10 '22 03:11

santon

Related questions
                            
                                Pandas - Group/bins of data per longitude/latitude
                            
                                Sort pandas DataFrame with MultiIndex according to column value
                            
                                Plot two levels of x_ticklabels on a pandas multi-index dataframe [duplicate]
                            
                                Using pandas, how do I subsample a large DataFrame by group in an efficient manner?
                            
                                filtering a pandas Dataframe based on date value
                            
                                How do I check for correlation using Decimal numbers/data with python 3
                            
                                pandas assert_frame_equal behavior
                            
                                How to do a random stratified sampling with Python (Not a train/test split)?
                            
                                Pandas dataframe type datetime64[ns] is not working in Hive/Athena
                            
                                How to make a progress bar on a web page for pandas operation
                            
                                How to subset row of condition with some of N rows before the condition meet , more faster than my code?
                            
                                Groupby based on a multiple logical conditions applied to a different columns DataFrame
                            
                                How to create a pandas Timestamp object?
                            
                                How to get autoincrement values for a column after uploading a Pandas dataframe to a MySQL database
                            
                                Merging with empty DataFrame
                            
                                Pandas pivot table arrangement no aggregation
                            
                                Python Pandas Group By Error 'Index' object has no attribute 'labels'
                            
                                AttributeError: module 'pandas' has no attribute 'read_csv' Python3.5
                            
                                How to read merged Excel cells with NaN into Pandas DataFrame
                            
                                Python Take first observation per group Using pandas.pivot_table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With