I have a pandas dataframe: <pre class="prettyprint"><code>import pandas as pd import numpy as np df = pd.DataFrame(columns=['Text','Selection_Values']) df["Text"] = ["Hi", "this is", "just", "a", "single", "sentence.", "This", np.nan, "is another one.","This is", "a", "third", "sentence","."] df["Selection_Values"] = [0,0,0,0,0,1,0,0,1,0,0,0,0,0] print(df) </code></pre> Output: <pre class="prettyprint"><code> Text Selection_Values 0 Hi 0 1 this is 0 2 just 0 3 a 0 4 single 0 5 sentence. 1 6 This 0 7 NaN 0 8 is another one. 1 9 This is 0 10 a 0 11 third 0 12 sentence 0 13 . 0 </code></pre> Now, I want to regroup the <code>Text</code> column into a 2D array based on the <code>Selection Value</code>column. All words that appear between a <code>0</code> (first integer, or after a <code>1</code>) and a <code>1</code>(including) should be put into a 2D array. The last sentence of the dataset might have no closing <code>1</code>. This can be done as explained in this question: Regroup pandas column into 2D list based on another column <pre class="prettyprint"><code>[["Hi this is just a single sentence."],["This is another one"], ["This is a third sentence ."]] </code></pre> I would like to go a step further and place the following condition: If more than <code>max_number_of_cells_per_list</code>of non-NaN cells are in a list, then this list should be divided into roughly equal parts which contain at most +/- 1 of <code>max_number_of_cells_per_list</code> cell elements. Let's say: <code>max_number_of_cells_per_list</code> = 2, then the expected output should be: <pre class="prettyprint"><code> [["Hi this is"], ["just a"], ["single sentence."],["This is another one"], ["This is"], ["a third sentence ."]] </code></pre> Example: Based on the column 'Selection_Values' one can regroup the cells into the following 2D list, using: <pre class="prettyprint"><code>[[s.str.cat(sep=' ')] for s in np.split(df.Text, df[df.Selection_Values == 1].index+1) if not s.empty] </code></pre> Output (original list): <pre class="prettyprint"><code>[["Hi this is just a single sentence."],["This is another one"], ["This is a third sentence ."]] </code></pre> Let's have a look at the number of cells that are within those lists: <img src="https://i.stack.imgur.com/ZYh7E.png" alt="enter image description here"> As you can see, list1 has 6 cells, list 2 has 2 cells, and list 3 has 5 cells. Now, what I would like to achieve is the following: if there are more than a certain number of cells in a list, it should be split up, such that each resulting list has +/-1 the wanted number of cells. So for example <code>max_number_of_cells_per_list</code> = 2 Modified list: <img src="https://i.stack.imgur.com/J4e1S.png" alt="enter image description here"> Do you see a way of doing this ? EDIT: Important note: Cells from the original lists should not be put into the same lists. EDIT 2: <pre class="prettyprint"><code> Text Selection_Values New 0 Hi 0 1.0 1 this is 0 0.0 2 just 0 1.0 3 a 0 0.0 4 single 0 1.0 5 sentence. 1 0.0 6 This 0 1.0 7 NaN 0 0.0 8 is another one. 1 1.0 9 This is 0 0.0 10 a 0 1.0 11 third 0 0.0 12 sentence 0 0.0 13 . 0 NaN </code></pre>

IIUC, you can do something like: <pre class="prettyprint"><code>n=2 #change this as you like for no. of splits s=df.Text.dropna().reset_index(drop=True) c=s.groupby(s.index//n).cumcount().eq(0).shift().shift(-1).fillna(False) </code></pre> <hr> <pre class="prettyprint"><code>[[i] for i in s.groupby(c.cumsum()).apply(' '.join).tolist()] </code></pre> <hr> <pre class="prettyprint"><code>[['Hi this is'], ['just a'], ['single sentence.'], ['This is another one.'], ['This is a'], ['third sentence .']] </code></pre> EDIT: <pre class="prettyprint"><code>d=dict(zip(df.loc[df.Text.notna(),'Text'].index,c.index)) ser=pd.Series(d) df['new']=ser.reindex(range(ser.index.min(), ser.index.max()+1)).map(c).fillna(False).astype(int) print(df) </code></pre> <hr> <pre class="prettyprint"><code> Text Selection_Values new 0 Hi 0 1 1 this is 0 0 2 just 0 1 3 a 0 0 4 single 0 1 5 sentence. 1 0 6 This 0 1 7 NaN 0 0 8 is another one. 1 0 9 This is 0 1 10 a 0 0 11 third 0 1 12 sentence 0 0 13 . 0 0 </code></pre>

Structuring a 2D array from a pandas dataframe

Tags:

python

list

pandas

I have a pandas dataframe:

import pandas as pd
import numpy as np

df = pd.DataFrame(columns=['Text','Selection_Values'])
df["Text"] = ["Hi", "this is", "just", "a", "single", "sentence.", "This", np.nan, "is another one.","This is", "a", "third", "sentence","."]
df["Selection_Values"] = [0,0,0,0,0,1,0,0,1,0,0,0,0,0]
print(df)

Output:

               Text  Selection_Values
0                Hi                 0
1           this is                 0
2              just                 0
3                 a                 0
4            single                 0
5         sentence.                 1
6              This                 0
7               NaN                 0
8   is another one.                 1
9           This is                 0
10                a                 0
11            third                 0
12         sentence                 0
13                .                 0

Now, I want to regroup the Text column into a 2D array based on the Selection Valuecolumn. All words that appear between a 0 (first integer, or after a 1) and a 1(including) should be put into a 2D array. The last sentence of the dataset might have no closing 1. This can be done as explained in this question: Regroup pandas column into 2D list based on another column

[["Hi this is just a single sentence."],["This is another one"], ["This is a third sentence ."]]

I would like to go a step further and place the following condition: If more than max_number_of_cells_per_listof non-NaN cells are in a list, then this list should be divided into roughly equal parts which contain at most +/- 1 of max_number_of_cells_per_list cell elements.

Let's say: max_number_of_cells_per_list = 2, then the expected output should be:

 [["Hi this is"], ["just a"], ["single sentence."],["This is another one"], ["This is"], ["a third sentence ."]]

Example:

Based on the column 'Selection_Values' one can regroup the cells into the following 2D list, using:

[[s.str.cat(sep=' ')] for s in np.split(df.Text, df[df.Selection_Values == 1].index+1) if not s.empty]

Output (original list):

[["Hi this is just a single sentence."],["This is another one"], ["This is a third sentence ."]]

Let's have a look at the number of cells that are within those lists:

enter image description here

As you can see, list1 has 6 cells, list 2 has 2 cells, and list 3 has 5 cells.

Now, what I would like to achieve is the following: if there are more than a certain number of cells in a list, it should be split up, such that each resulting list has +/-1 the wanted number of cells.

So for example max_number_of_cells_per_list = 2

Modified list: enter image description here

Do you see a way of doing this ?

EDIT: Important note: Cells from the original lists should not be put into the same lists.

EDIT 2:

               Text  Selection_Values  New
0                Hi                 0  1.0
1           this is                 0  0.0
2              just                 0  1.0
3                 a                 0  0.0
4            single                 0  1.0
5         sentence.                 1  0.0
6              This                 0  1.0
7               NaN                 0  0.0
8   is another one.                 1  1.0
9           This is                 0  0.0
10                a                 0  1.0
11            third                 0  0.0
12         sentence                 0  0.0
13                .                 0  NaN

317

asked Jul 21 '19 11:07

henry

1 Answers

IIUC, you can do something like:

n=2 #change this as you like for no. of splits
s=df.Text.dropna().reset_index(drop=True)
c=s.groupby(s.index//n).cumcount().eq(0).shift().shift(-1).fillna(False)

[[i] for i in s.groupby(c.cumsum()).apply(' '.join).tolist()]

[['Hi this is'], ['just a'], ['single sentence.'], 
    ['This is another one.'], ['This is a'], ['third sentence .']]

EDIT:

d=dict(zip(df.loc[df.Text.notna(),'Text'].index,c.index))
ser=pd.Series(d)
df['new']=ser.reindex(range(ser.index.min(),
                        ser.index.max()+1)).map(c).fillna(False).astype(int)
print(df)

               Text  Selection_Values  new
0                Hi                 0    1
1           this is                 0    0
2              just                 0    1
3                 a                 0    0
4            single                 0    1
5         sentence.                 1    0
6              This                 0    1
7               NaN                 0    0
8   is another one.                 1    0
9           This is                 0    1
10                a                 0    0
11            third                 0    1
12         sentence                 0    0
13                .                 0    0

195

answered Sep 19 '22 12:09

anky

Related questions
                            
                                How to run R script in python using rpy2
                            
                                How to use autocompleteselect widget in a modelform
                            
                                How to make the X-axis time dynamically refresh by using pyqtgraph TimeAxisItem
                            
                                Setting a random seed on TF 2.0
                            
                                Does Python's asyncio lock.acquire maintain order?
                            
                                Why not use mean squared error for classification problems?
                            
                                Make a dataframe of all unique words with their count and
                            
                                How to get all words with specific length that doesn't contain number?
                            
                                Flask-SQLAlchemy: SQLALCHEMY_ENGINE_OPTIONS not set up correctly
                            
                                Connect to Power BI XMLA endpoint with Python
                            
                                Why is asyncio queue await get() blocking?
                            
                                Append not working with DataFrames in for loop
                            
                                Handle Exception When Running Python Script From Another Python Script
                            
                                Delete variable from RAM
                            
                                Adding sublists elements based on indexing by condition in python
                            
                                Obtain features inside image and remove boundary
                            
                                Which SKLearn interface defines .fit, .predict etc
                            
                                Numpy: find row-wise common element efficiently
                            
                                Why would this dataset implementation run out of memory?
                            
                                Python: retain instance

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With