This question is similar to Split (explode) pandas dataframe string entry to separate rows but includes a question about adding ranges. I have a DataFrame: <pre class="prettyprint"><code>+------+---------+----------------+ | Name | Options | Email | +------+---------+----------------+ | Bob | 1,2,4-6 | bob@email.com | +------+---------+----------------+ | John | NaN | john@email.com | +------+---------+----------------+ | Mary | 1,2 | mary@email.com | +------+---------+----------------+ | Jane | 1,3-5 | jane@email.com | +------+---------+----------------+ </code></pre> And I'd like the <code>Options</code> column to be split by the comma as well as rows added for a range. <pre class="prettyprint"><code>+------+---------+----------------+ | Name | Options | Email | +------+---------+----------------+ | Bob | 1 | bob@email.com | +------+---------+----------------+ | Bob | 2 | bob@email.com | +------+---------+----------------+ | Bob | 4 | bob@email.com | +------+---------+----------------+ | Bob | 5 | bob@email.com | +------+---------+----------------+ | Bob | 6 | bob@email.com | +------+---------+----------------+ | John | NaN | john@email.com | +------+---------+----------------+ | Mary | 1 | mary@email.com | +------+---------+----------------+ | Mary | 2 | mary@email.com | +------+---------+----------------+ | Jane | 1 | jane@email.com | +------+---------+----------------+ | Jane | 3 | jane@email.com | +------+---------+----------------+ | Jane | 4 | jane@email.com | +------+---------+----------------+ | Jane | 5 | jane@email.com | +------+---------+----------------+ </code></pre> How can I go beyond using <code>concat</code> and <code>split</code> like the reference SO article says to accomplish this? I need a way to add a range. That article uses the following code to split comma delineated values (<code>1,2,3</code>): <pre class="prettyprint"><code>In [7]: a Out[7]: var1 var2 0 a,b,c 1 1 d,e,f 2 In [55]: pd.concat([Series(row['var2'], row['var1'].split(',')) for _, row in a.iterrows()]).reset_index() Out[55]: index 0 0 a 1 1 b 1 2 c 1 3 d 2 4 e 2 5 f 2 </code></pre> Thanks in advance for your suggestions! Update 2/14 Sample data was updated to match my current case.

If I understand what you need <pre class="prettyprint"><code>def yourfunc(s): ranges = (x.split("-") for x in s.split(",")) return [i for r in ranges for i in range(int(r[0]), int(r[-1]) + 1)] df.Options=df.Options.apply(yourfunc) df Out[114]: Name Options Email 0 Bob [1, 2, 4, 5, 6] bob@email.com 1 Jane [1, 3, 4, 5] jane@email.com df.set_index(['Name','Email']).Options.apply(pd.Series).stack().reset_index().drop('level_2',1) Out[116]: Name Email 0 0 Bob bob@email.com 1.0 1 Bob bob@email.com 2.0 2 Bob bob@email.com 4.0 3 Bob bob@email.com 5.0 4 Bob bob@email.com 6.0 5 Jane jane@email.com 1.0 6 Jane jane@email.com 3.0 7 Jane jane@email.com 4.0 8 Jane jane@email.com 5.0 </code></pre>

I like using <code>np.r_</code> and <code>slice</code> I know it looks like a mess but beauty is in the eye of the beholder. <pre class="prettyprint"><code>def parse(o): mm = lambda i: slice(min(i), max(i) + 1) return np.r_.__getitem__(tuple( mm(list(map(int, s.strip().split('-')))) for s in o.split(',') )) r = df.Options.apply(parse) new = np.concatenate(r.values) lens = r.str.len() df.loc[df.index.repeat(lens)].assign(Options=new) Name Options Email 0 Bob 1 bob@email.com 0 Bob 2 bob@email.com 0 Bob 4 bob@email.com 0 Bob 5 bob@email.com 0 Bob 6 bob@email.com 2 Mary 1 mary@email.com 2 Mary 2 mary@email.com 3 Jane 1 jane@email.com 3 Jane 3 jane@email.com 3 Jane 4 jane@email.com 3 Jane 5 jane@email.com </code></pre> <hr> Explanation <ul> <li> <code>np.r_</code> takes different slicers and indexers and returns an array of the combination. <pre class="prettyprint"><code>np.r_[1, 4:7] array([1, 4, 5, 6]) </code></pre> or <pre class="prettyprint"><code>np.r_[slice(1, 2), slice(4, 7)] array([1, 4, 5, 6]) </code></pre> But if I need to pass an arbitrary bunch of them, I need to pass a <code>tuple</code> to <code>np.r_</code> s <code>__getitem__</code> method. <pre class="prettyprint"><code>np.r_.__getitem__((slice(1, 2), slice(4, 7), slice(10, 14))) array([ 1, 4, 5, 6, 10, 11, 12, 13]) </code></pre> So I iterate, parse, make slices and pass to <code>np.r_.__getitem__</code> </li> <li>Use a combo of <code>loc</code>, <code>pd.Index.repeat</code>, and <code>pd.Series.str.len</code> after applying my cool parser</li> <li>Use <code>pd.DataFrame.assign</code> to overwrite existing column</li> </ul> <hr> __NOTE__ If you have bad characters in your <code>Options</code> column, I'd try to filter like this. <pre class="prettyprint"><code>df = df.dropna(subset=['Options']).astype(dict(Options=str)) \ .replace(dict(Options={'[^0-9,\-]': ''}), regex=True) \ .query('Options != ""') </code></pre>

Split (explode) range in dataframe into multiple rows

Tags:

python

pandas

dataframe

numpy

This question is similar to Split (explode) pandas dataframe string entry to separate rows but includes a question about adding ranges.

I have a DataFrame:

+------+---------+----------------+
| Name | Options | Email          |
+------+---------+----------------+
| Bob  | 1,2,4-6 | [email protected]  |
+------+---------+----------------+
| John |   NaN   | [email protected] |
+------+---------+----------------+
| Mary |   1,2   | [email protected] |
+------+---------+----------------+
| Jane | 1,3-5   | [email protected] |
+------+---------+----------------+

And I'd like the Options column to be split by the comma as well as rows added for a range.

+------+---------+----------------+
| Name | Options | Email          |
+------+---------+----------------+
| Bob  | 1       | [email protected]  |
+------+---------+----------------+
| Bob  | 2       | [email protected]  |
+------+---------+----------------+
| Bob  | 4       | [email protected]  |
+------+---------+----------------+
| Bob  | 5       | [email protected]  |
+------+---------+----------------+
| Bob  | 6       | [email protected]  |
+------+---------+----------------+
| John | NaN     | [email protected] |
+------+---------+----------------+
| Mary | 1       | [email protected] |
+------+---------+----------------+
| Mary | 2       | [email protected] |
+------+---------+----------------+
| Jane | 1       | [email protected] |
+------+---------+----------------+
| Jane | 3       | [email protected] |
+------+---------+----------------+
| Jane | 4       | [email protected] |
+------+---------+----------------+
| Jane | 5       | [email protected] |
+------+---------+----------------+

How can I go beyond using concat and split like the reference SO article says to accomplish this? I need a way to add a range.

That article uses the following code to split comma delineated values (1,2,3):

In [7]: a
Out[7]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2

In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))              
                    for _, row in a.iterrows()]).reset_index()
Out[55]: 
  index  0

0     a  1
1     b  1
2     c  1
3     d  2
4     e  2
5     f  2

Thanks in advance for your suggestions!

Update 2/14 Sample data was updated to match my current case.

216

asked Feb 13 '18 21:02

kabaname

2 Answers

If I understand what you need

def yourfunc(s):
    ranges = (x.split("-") for x in s.split(","))

    return [i for r in ranges for i in range(int(r[0]), int(r[-1]) + 1)]


df.Options=df.Options.apply(yourfunc)

df
Out[114]: 
   Name          Options           Email
0   Bob  [1, 2, 4, 5, 6]   [email protected]
1  Jane     [1, 3, 4, 5]  [email protected]


df.set_index(['Name','Email']).Options.apply(pd.Series).stack().reset_index().drop('level_2',1)
Out[116]: 
   Name           Email    0
0   Bob   [email protected]  1.0
1   Bob   [email protected]  2.0
2   Bob   [email protected]  4.0
3   Bob   [email protected]  5.0
4   Bob   [email protected]  6.0
5  Jane  [email protected]  1.0
6  Jane  [email protected]  3.0
7  Jane  [email protected]  4.0
8  Jane  [email protected]  5.0

114

answered Sep 28 '22 01:09

BENY

I like using np.r_ and slice
I know it looks like a mess but beauty is in the eye of the beholder.

def parse(o):
    mm = lambda i: slice(min(i), max(i) + 1)
    return np.r_.__getitem__(tuple(
        mm(list(map(int, s.strip().split('-')))) for s in o.split(',')
    ))

r = df.Options.apply(parse)
new = np.concatenate(r.values)
lens = r.str.len()

df.loc[df.index.repeat(lens)].assign(Options=new)

   Name  Options           Email
0   Bob        1   [email protected]
0   Bob        2   [email protected]
0   Bob        4   [email protected]
0   Bob        5   [email protected]
0   Bob        6   [email protected]
2  Mary        1  [email protected]
2  Mary        2  [email protected]
3  Jane        1  [email protected]
3  Jane        3  [email protected]
3  Jane        4  [email protected]
3  Jane        5  [email protected]

Explanation

np.r_ takes different slicers and indexers and returns an array of the combination.
```
np.r_[1, 4:7]
array([1, 4, 5, 6])
```
or
```
np.r_[slice(1, 2), slice(4, 7)]
array([1, 4, 5, 6])
```
But if I need to pass an arbitrary bunch of them, I need to pass a tuple to np.r_ s __getitem__ method.
```
np.r_.__getitem__((slice(1, 2), slice(4, 7), slice(10, 14)))
array([ 1,  4,  5,  6, 10, 11, 12, 13])
```
So I iterate, parse, make slices and pass to np.r_.__getitem__
Use a combo of loc, pd.Index.repeat, and pd.Series.str.len after applying my cool parser
Use pd.DataFrame.assign to overwrite existing column

__NOTE__
If you have bad characters in your Options column, I'd try to filter like this.

df = df.dropna(subset=['Options']).astype(dict(Options=str)) \
       .replace(dict(Options={'[^0-9,\-]': ''}), regex=True) \
       .query('Options != ""')

answered Sep 28 '22 00:09

piRSquared

Related questions
                            
                                Replace '\n' with ' 'space in string using python
                            
                                Django Ajax return error messages
                            
                                Sparse Vector pyspark
                            
                                Python __init__(self,**kwargs) takes 1 positional argument but 2 were given [duplicate]
                            
                                Python Scikit-image Install Failing Using Pip
                            
                                What does nb_epoch in neural network stands for?
                            
                                Split list into lists based on a character occurring inside of an element
                            
                                Python: how to replace substrings in a string given list of indices
                            
                                Pandas query function with like and date
                            
                                Generating all possible combinations of characters in a string
                            
                                Use mapped() in odoo 10
                            
                                Is there a function in Python that shuffle data by data blocks?
                            
                                'ImportError: cannot import name cbook' when using PyCharm's Profiler
                            
                                Python- How to make an if statement between x and y? [duplicate]
                            
                                Handling None when adding numbers
                            
                                Difference between os.getlogin() and os.environ for getting Username
                            
                                Detecting outer most-edge of image and plotting based on it
                            
                                Does Django automatically detect the end user's timezone?
                            
                                Replace a value in a column by vlookup another dataframe only if the value exists
                            
                                Python3: How to use print() to print a string with quote?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With