Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split (explode) range in dataframe into multiple rows

This question is similar to Split (explode) pandas dataframe string entry to separate rows but includes a question about adding ranges.

I have a DataFrame:

+------+---------+----------------+
| Name | Options | Email          |
+------+---------+----------------+
| Bob  | 1,2,4-6 | [email protected]  |
+------+---------+----------------+
| John |   NaN   | [email protected] |
+------+---------+----------------+
| Mary |   1,2   | [email protected] |
+------+---------+----------------+
| Jane | 1,3-5   | [email protected] |
+------+---------+----------------+

And I'd like the Options column to be split by the comma as well as rows added for a range.

+------+---------+----------------+
| Name | Options | Email          |
+------+---------+----------------+
| Bob  | 1       | [email protected]  |
+------+---------+----------------+
| Bob  | 2       | [email protected]  |
+------+---------+----------------+
| Bob  | 4       | [email protected]  |
+------+---------+----------------+
| Bob  | 5       | [email protected]  |
+------+---------+----------------+
| Bob  | 6       | [email protected]  |
+------+---------+----------------+
| John | NaN     | [email protected] |
+------+---------+----------------+
| Mary | 1       | [email protected] |
+------+---------+----------------+
| Mary | 2       | [email protected] |
+------+---------+----------------+
| Jane | 1       | [email protected] |
+------+---------+----------------+
| Jane | 3       | [email protected] |
+------+---------+----------------+
| Jane | 4       | [email protected] |
+------+---------+----------------+
| Jane | 5       | [email protected] |
+------+---------+----------------+

How can I go beyond using concat and split like the reference SO article says to accomplish this? I need a way to add a range.

That article uses the following code to split comma delineated values (1,2,3):

In [7]: a
Out[7]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2

In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))              
                    for _, row in a.iterrows()]).reset_index()
Out[55]: 
  index  0

0     a  1
1     b  1
2     c  1
3     d  2
4     e  2
5     f  2

Thanks in advance for your suggestions!

Update 2/14 Sample data was updated to match my current case.

like image 216
kabaname Avatar asked Feb 13 '18 21:02

kabaname


People also ask

How do I split one column into multiple rows in pandas?

You take the price diff column from the DataFrame df and break the string on the space using . str. split() . This will make sure that the two differences will end up in two separate rows in the end.

How do you explode a list inside a DataFrame cell into separate rows?

For most cases, the correct answer is to now use pandas. DataFrame. explode() as shown in this answer, or pandas. Series.

How do you split a string into multiple rows in Python?

To split cell into multiple rows in a Python Pandas dataframe, we can use the apply method. to call apply with a lambda function that calls str. split to split the x string value. And then we call explode to fill new rows with the split values.


2 Answers

If I understand what you need

def yourfunc(s):
    ranges = (x.split("-") for x in s.split(","))

    return [i for r in ranges for i in range(int(r[0]), int(r[-1]) + 1)]


df.Options=df.Options.apply(yourfunc)

df
Out[114]: 
   Name          Options           Email
0   Bob  [1, 2, 4, 5, 6]   [email protected]
1  Jane     [1, 3, 4, 5]  [email protected]


df.set_index(['Name','Email']).Options.apply(pd.Series).stack().reset_index().drop('level_2',1)
Out[116]: 
   Name           Email    0
0   Bob   [email protected]  1.0
1   Bob   [email protected]  2.0
2   Bob   [email protected]  4.0
3   Bob   [email protected]  5.0
4   Bob   [email protected]  6.0
5  Jane  [email protected]  1.0
6  Jane  [email protected]  3.0
7  Jane  [email protected]  4.0
8  Jane  [email protected]  5.0
like image 114
BENY Avatar answered Sep 28 '22 01:09

BENY


I like using np.r_ and slice
I know it looks like a mess but beauty is in the eye of the beholder.

def parse(o):
    mm = lambda i: slice(min(i), max(i) + 1)
    return np.r_.__getitem__(tuple(
        mm(list(map(int, s.strip().split('-')))) for s in o.split(',')
    ))

r = df.Options.apply(parse)
new = np.concatenate(r.values)
lens = r.str.len()

df.loc[df.index.repeat(lens)].assign(Options=new)

   Name  Options           Email
0   Bob        1   [email protected]
0   Bob        2   [email protected]
0   Bob        4   [email protected]
0   Bob        5   [email protected]
0   Bob        6   [email protected]
2  Mary        1  [email protected]
2  Mary        2  [email protected]
3  Jane        1  [email protected]
3  Jane        3  [email protected]
3  Jane        4  [email protected]
3  Jane        5  [email protected]

Explanation

  • np.r_ takes different slicers and indexers and returns an array of the combination.

    np.r_[1, 4:7]
    array([1, 4, 5, 6])
    

    or

    np.r_[slice(1, 2), slice(4, 7)]
    array([1, 4, 5, 6])
    

    But if I need to pass an arbitrary bunch of them, I need to pass a tuple to np.r_ s __getitem__ method.

    np.r_.__getitem__((slice(1, 2), slice(4, 7), slice(10, 14)))
    array([ 1,  4,  5,  6, 10, 11, 12, 13])
    

    So I iterate, parse, make slices and pass to np.r_.__getitem__

  • Use a combo of loc, pd.Index.repeat, and pd.Series.str.len after applying my cool parser

  • Use pd.DataFrame.assign to overwrite existing column

__NOTE__
If you have bad characters in your Options column, I'd try to filter like this.

df = df.dropna(subset=['Options']).astype(dict(Options=str)) \
       .replace(dict(Options={'[^0-9,\-]': ''}), regex=True) \
       .query('Options != ""')
like image 32
piRSquared Avatar answered Sep 28 '22 00:09

piRSquared