Is there an operation in pandas that does the same as flatMap in pyspark?
flatMap example:
>>> rdd = sc.parallelize([2, 3, 4])
>>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())
[1, 1, 1, 2, 2, 3]
So far I can think of apply
followed by itertools.chain
, but I am wondering if there is a one-step solution.
flatMap() on Spark DataFrame operates similar to RDD, when applied it executes the function specified on every element of the DataFrame by splitting or merging the elements hence, the result count of the flapMap() can be different. This yields below output after flatMap() transformation.
Both of the functions map() and flatMap are used for transformation and mapping operations. map() function produces one output for one input value, whereas flatMap() function produces an arbitrary no of values as output (ie zero or more than zero) for each input value.
In PySpark, the flatMap() is defined as the transformation operation which flattens the Resilient Distributed Dataset or DataFrame(i.e. array/map DataFrame columns) after applying the function on every element and further returns the new PySpark Resilient Distributed Dataset or DataFrame.
The flatMap() is used to produce multiple output elements for each input element. When using map(), the function we provide to flatMap() is called individually for each element in our input RDD. Instead of returning a single element, an iterator with the return values is returned.
Since July 2019, Pandas offer pd.Series.explode
to unnest frames. Here's a possible implementation of pd.Series.flatmap
based on explode and map. Why?
flatmap
operations should be a subset of map
, not apply
. check this thread for map
/applymap
/apply
details Difference between map, applymap and apply methods in Pandas
import pandas as pd
from typing import Callable
def flatmap(
self,
func:Callable[[pd.Series],pd.Series],
ignore_index:bool=False):
return self.map(func).explode(ignore_index)
pd.Series.flatmap = flatmap
# example
df = pd.DataFrame([(x,y) for x,y in zip(range(1,6),range(6,16))], columns=['A','B'])
print(df.head(5))
# A B
# 0 1 6
# 1 2 7
# 2 3 8
# 3 4 9
# 4 5 10
print(df.A.flatmap(range,False))
# 0 NaN
# 1 0
# 2 0
# 2 1
# 3 0
# 3 1
# 3 2
# 4 0
# 4 1
# 4 2
# 4 3
# Name: A, dtype: object
print(df.A.flatmap(range,True))
# 0 0
# 1 0
# 2 1
# 3 0
# 4 1
# 5 2
# 6 0
# 7 1
# 8 2
# 9 3
# 10 0
# 11 1
# 12 2
# 13 3
# 14 4
# Name: A, dtype: object
As you can see, the main issue is the indexing. You could ignore it and just reset, but then you're better of using NumPy or std lists, as indexing is one of the key pandas' points. If you do not care about indexing at all, you could reuse the idea of the solution above, change pd.Series.map
to pd.DataFrame.applymap
and pd.Series.explode
to pd.DataFrame.explode
and forcing ignore_index=True
.
There's a hack. I often do something like
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True)
Out[3]:
0 1
1 3
2 2
3 4
4 NaN
5 5
dtype: float64
The introduction of NaN
is because the intermediate object creates a MultiIndex
, but for a lot of things you can just drop that:
In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna()
Out[4]:
0 1
1 3
2 2
3 4
5 5
dtype: float64
This trick uses all pandas code, so I would expect it to be reasonably efficient, though it might not like things like very different sized lists.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With