This question is similar to Split (explode) pandas dataframe string entry to separate rows but includes a question about adding ranges.
I have a DataFrame:
+------+---------+----------------+
| Name | Options | Email          |
+------+---------+----------------+
| Bob  | 1,2,4-6 | [email protected]  |
+------+---------+----------------+
| John |   NaN   | [email protected] |
+------+---------+----------------+
| Mary |   1,2   | [email protected] |
+------+---------+----------------+
| Jane | 1,3-5   | [email protected] |
+------+---------+----------------+
And I'd like the Options column to be split by the comma as well as rows added for a range.
+------+---------+----------------+
| Name | Options | Email          |
+------+---------+----------------+
| Bob  | 1       | [email protected]  |
+------+---------+----------------+
| Bob  | 2       | [email protected]  |
+------+---------+----------------+
| Bob  | 4       | [email protected]  |
+------+---------+----------------+
| Bob  | 5       | [email protected]  |
+------+---------+----------------+
| Bob  | 6       | [email protected]  |
+------+---------+----------------+
| John | NaN     | [email protected] |
+------+---------+----------------+
| Mary | 1       | [email protected] |
+------+---------+----------------+
| Mary | 2       | [email protected] |
+------+---------+----------------+
| Jane | 1       | [email protected] |
+------+---------+----------------+
| Jane | 3       | [email protected] |
+------+---------+----------------+
| Jane | 4       | [email protected] |
+------+---------+----------------+
| Jane | 5       | [email protected] |
+------+---------+----------------+
How can I go beyond using concat and split like the reference SO article says to accomplish this? I need a way to add a range.
That article uses the following code to split comma delineated values (1,2,3):
In [7]: a
Out[7]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2
In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))              
                    for _, row in a.iterrows()]).reset_index()
Out[55]: 
  index  0
0     a  1
1     b  1
2     c  1
3     d  2
4     e  2
5     f  2
Thanks in advance for your suggestions!
Update 2/14 Sample data was updated to match my current case.
You take the price diff column from the DataFrame df and break the string on the space using . str. split() . This will make sure that the two differences will end up in two separate rows in the end.
For most cases, the correct answer is to now use pandas. DataFrame. explode() as shown in this answer, or pandas. Series.
To split cell into multiple rows in a Python Pandas dataframe, we can use the apply method. to call apply with a lambda function that calls str. split to split the x string value. And then we call explode to fill new rows with the split values.
If I understand what you need
def yourfunc(s):
    ranges = (x.split("-") for x in s.split(","))
    return [i for r in ranges for i in range(int(r[0]), int(r[-1]) + 1)]
df.Options=df.Options.apply(yourfunc)
df
Out[114]: 
   Name          Options           Email
0   Bob  [1, 2, 4, 5, 6]   [email protected]
1  Jane     [1, 3, 4, 5]  [email protected]
df.set_index(['Name','Email']).Options.apply(pd.Series).stack().reset_index().drop('level_2',1)
Out[116]: 
   Name           Email    0
0   Bob   [email protected]  1.0
1   Bob   [email protected]  2.0
2   Bob   [email protected]  4.0
3   Bob   [email protected]  5.0
4   Bob   [email protected]  6.0
5  Jane  [email protected]  1.0
6  Jane  [email protected]  3.0
7  Jane  [email protected]  4.0
8  Jane  [email protected]  5.0
                        I like using np.r_ and slice
I know it looks like a mess but beauty is in the eye of the beholder.
def parse(o):
    mm = lambda i: slice(min(i), max(i) + 1)
    return np.r_.__getitem__(tuple(
        mm(list(map(int, s.strip().split('-')))) for s in o.split(',')
    ))
r = df.Options.apply(parse)
new = np.concatenate(r.values)
lens = r.str.len()
df.loc[df.index.repeat(lens)].assign(Options=new)
   Name  Options           Email
0   Bob        1   [email protected]
0   Bob        2   [email protected]
0   Bob        4   [email protected]
0   Bob        5   [email protected]
0   Bob        6   [email protected]
2  Mary        1  [email protected]
2  Mary        2  [email protected]
3  Jane        1  [email protected]
3  Jane        3  [email protected]
3  Jane        4  [email protected]
3  Jane        5  [email protected]
Explanation
np.r_ takes different slicers and indexers and returns an array of the combination.
np.r_[1, 4:7]
array([1, 4, 5, 6])
or
np.r_[slice(1, 2), slice(4, 7)]
array([1, 4, 5, 6])
But if I need to pass an arbitrary bunch of them, I need to pass a tuple to np.r_ s __getitem__ method.
np.r_.__getitem__((slice(1, 2), slice(4, 7), slice(10, 14)))
array([ 1,  4,  5,  6, 10, 11, 12, 13])
So I iterate, parse, make slices and pass to np.r_.__getitem__
Use a combo of loc, pd.Index.repeat, and pd.Series.str.len after applying my cool parser
pd.DataFrame.assign to overwrite existing column__NOTE__
If you have bad characters in your Options column, I'd try to filter like this.
df = df.dropna(subset=['Options']).astype(dict(Options=str)) \
       .replace(dict(Options={'[^0-9,\-]': ''}), regex=True) \
       .query('Options != ""')
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With