Logo Questions Linux Laravel Mysql Ubuntu Git Menu

pyspark's flatMap in pandas




Is there an operation in pandas that does the same as flatMap in pyspark?

flatMap example:

>>> rdd = sc.parallelize([2, 3, 4])
>>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())
[1, 1, 1, 2, 2, 3]

So far I can think of apply followed by itertools.chain, but I am wondering if there is a one-step solution.

like image 662
GeauxEric Avatar asked Jun 26 '15 18:06


People also ask

Can we use flatMap in DataFrame?

flatMap() on Spark DataFrame operates similar to RDD, when applied it executes the function specified on every element of the DataFrame by splitting or merging the elements hence, the result count of the flapMap() can be different. This yields below output after flatMap() transformation.

What is the difference between MAP () and flatMap () transformation?

Both of the functions map() and flatMap are used for transformation and mapping operations. map() function produces one output for one input value, whereas flatMap() function produces an arbitrary no of values as output (ie zero or more than zero) for each input value.

What is flatMap in Python?

In PySpark, the flatMap() is defined as the transformation operation which flattens the Resilient Distributed Dataset or DataFrame(i.e. array/map DataFrame columns) after applying the function on every element and further returns the new PySpark Resilient Distributed Dataset or DataFrame.

What does RDD flatMap do?

The flatMap() is used to produce multiple output elements for each input element. When using map(), the function we provide to flatMap() is called individually for each element in our input RDD. Instead of returning a single element, an iterator with the return values is returned.

2 Answers

Since July 2019, Pandas offer pd.Series.explode to unnest frames. Here's a possible implementation of pd.Series.flatmap based on explode and map. Why?

  • flatmap operations should be a subset of map, not apply. check this thread for map/applymap/apply details Difference between map, applymap and apply methods in Pandas
import pandas as pd
from typing import Callable

def flatmap(
    return self.map(func).explode(ignore_index)
pd.Series.flatmap = flatmap

# example
df = pd.DataFrame([(x,y) for x,y in zip(range(1,6),range(6,16))], columns=['A','B'])
#    A   B
# 0  1   6
# 1  2   7
# 2  3   8
# 3  4   9
# 4  5  10
# 0    NaN
# 1      0
# 2      0
# 2      1
# 3      0
# 3      1
# 3      2
# 4      0
# 4      1
# 4      2
# 4      3
# Name: A, dtype: object
# 0     0
# 1     0
# 2     1
# 3     0
# 4     1
# 5     2
# 6     0
# 7     1
# 8     2
# 9     3
# 10    0
# 11    1
# 12    2
# 13    3
# 14    4
# Name: A, dtype: object

As you can see, the main issue is the indexing. You could ignore it and just reset, but then you're better of using NumPy or std lists, as indexing is one of the key pandas' points. If you do not care about indexing at all, you could reuse the idea of the solution above, change pd.Series.map to pd.DataFrame.applymap and pd.Series.explode to pd.DataFrame.explode and forcing ignore_index=True.

like image 155
Théophile Pace Avatar answered Nov 10 '22 03:11

Théophile Pace

There's a hack. I often do something like

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})

In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True)
0     1
1     3
2     2
3     4
4   NaN
5     5
dtype: float64

The introduction of NaN is because the intermediate object creates a MultiIndex, but for a lot of things you can just drop that:

In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna()
0    1
1    3
2    2
3    4
5    5
dtype: float64

This trick uses all pandas code, so I would expect it to be reasonably efficient, though it might not like things like very different sized lists.

like image 32
santon Avatar answered Nov 10 '22 03:11
