This question is similar to Split (explode) pandas dataframe string entry to separate rows but includes a question about adding ranges.
I have a DataFrame:
+------+---------+----------------+
| Name | Options | Email |
+------+---------+----------------+
| Bob | 1,2,4-6 | [email protected] |
+------+---------+----------------+
| John | NaN | [email protected] |
+------+---------+----------------+
| Mary | 1,2 | [email protected] |
+------+---------+----------------+
| Jane | 1,3-5 | [email protected] |
+------+---------+----------------+
And I'd like the Options
column to be split by the comma as well as rows added for a range.
+------+---------+----------------+
| Name | Options | Email |
+------+---------+----------------+
| Bob | 1 | [email protected] |
+------+---------+----------------+
| Bob | 2 | [email protected] |
+------+---------+----------------+
| Bob | 4 | [email protected] |
+------+---------+----------------+
| Bob | 5 | [email protected] |
+------+---------+----------------+
| Bob | 6 | [email protected] |
+------+---------+----------------+
| John | NaN | [email protected] |
+------+---------+----------------+
| Mary | 1 | [email protected] |
+------+---------+----------------+
| Mary | 2 | [email protected] |
+------+---------+----------------+
| Jane | 1 | [email protected] |
+------+---------+----------------+
| Jane | 3 | [email protected] |
+------+---------+----------------+
| Jane | 4 | [email protected] |
+------+---------+----------------+
| Jane | 5 | [email protected] |
+------+---------+----------------+
How can I go beyond using concat
and split
like the reference SO article says to accomplish this? I need a way to add a range.
That article uses the following code to split comma delineated values (1,2,3
):
In [7]: a
Out[7]:
var1 var2
0 a,b,c 1
1 d,e,f 2
In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))
for _, row in a.iterrows()]).reset_index()
Out[55]:
index 0
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
Thanks in advance for your suggestions!
Update 2/14 Sample data was updated to match my current case.
You take the price diff column from the DataFrame df and break the string on the space using . str. split() . This will make sure that the two differences will end up in two separate rows in the end.
For most cases, the correct answer is to now use pandas. DataFrame. explode() as shown in this answer, or pandas. Series.
To split cell into multiple rows in a Python Pandas dataframe, we can use the apply method. to call apply with a lambda function that calls str. split to split the x string value. And then we call explode to fill new rows with the split values.
If I understand what you need
def yourfunc(s):
ranges = (x.split("-") for x in s.split(","))
return [i for r in ranges for i in range(int(r[0]), int(r[-1]) + 1)]
df.Options=df.Options.apply(yourfunc)
df
Out[114]:
Name Options Email
0 Bob [1, 2, 4, 5, 6] [email protected]
1 Jane [1, 3, 4, 5] [email protected]
df.set_index(['Name','Email']).Options.apply(pd.Series).stack().reset_index().drop('level_2',1)
Out[116]:
Name Email 0
0 Bob [email protected] 1.0
1 Bob [email protected] 2.0
2 Bob [email protected] 4.0
3 Bob [email protected] 5.0
4 Bob [email protected] 6.0
5 Jane [email protected] 1.0
6 Jane [email protected] 3.0
7 Jane [email protected] 4.0
8 Jane [email protected] 5.0
I like using np.r_
and slice
I know it looks like a mess but beauty is in the eye of the beholder.
def parse(o):
mm = lambda i: slice(min(i), max(i) + 1)
return np.r_.__getitem__(tuple(
mm(list(map(int, s.strip().split('-')))) for s in o.split(',')
))
r = df.Options.apply(parse)
new = np.concatenate(r.values)
lens = r.str.len()
df.loc[df.index.repeat(lens)].assign(Options=new)
Name Options Email
0 Bob 1 [email protected]
0 Bob 2 [email protected]
0 Bob 4 [email protected]
0 Bob 5 [email protected]
0 Bob 6 [email protected]
2 Mary 1 [email protected]
2 Mary 2 [email protected]
3 Jane 1 [email protected]
3 Jane 3 [email protected]
3 Jane 4 [email protected]
3 Jane 5 [email protected]
Explanation
np.r_
takes different slicers and indexers and returns an array of the combination.
np.r_[1, 4:7]
array([1, 4, 5, 6])
or
np.r_[slice(1, 2), slice(4, 7)]
array([1, 4, 5, 6])
But if I need to pass an arbitrary bunch of them, I need to pass a tuple
to np.r_
s __getitem__
method.
np.r_.__getitem__((slice(1, 2), slice(4, 7), slice(10, 14)))
array([ 1, 4, 5, 6, 10, 11, 12, 13])
So I iterate, parse, make slices and pass to np.r_.__getitem__
Use a combo of loc
, pd.Index.repeat
, and pd.Series.str.len
after applying my cool parser
pd.DataFrame.assign
to overwrite existing column__NOTE__
If you have bad characters in your Options
column, I'd try to filter like this.
df = df.dropna(subset=['Options']).astype(dict(Options=str)) \
.replace(dict(Options={'[^0-9,\-]': ''}), regex=True) \
.query('Options != ""')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With