I have a pandas dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['Text','Selection_Values'])
df["Text"] = ["Hi", "this is", "just", "a", "single", "sentence.", "This", np.nan, "is another one.","This is", "a", "third", "sentence","."]
df["Selection_Values"] = [0,0,0,0,0,1,0,0,1,0,0,0,0,0]
print(df)
Output:
Text Selection_Values
0 Hi 0
1 this is 0
2 just 0
3 a 0
4 single 0
5 sentence. 1
6 This 0
7 NaN 0
8 is another one. 1
9 This is 0
10 a 0
11 third 0
12 sentence 0
13 . 0
Now, I want to regroup the Text
column into a 2D array based on the Selection Value
column. All words that appear between a 0
(first integer, or after a 1
) and a 1
(including) should be put into a 2D array. The last sentence of the dataset might have no closing 1
. This can be done as explained in this question: Regroup pandas column into 2D list based on another column
[["Hi this is just a single sentence."],["This is another one"], ["This is a third sentence ."]]
I would like to go a step further and place the following condition: If more than max_number_of_cells_per_list
of non-NaN cells are in a list, then this list should be divided into roughly equal parts which contain at most +/- 1 of max_number_of_cells_per_list
cell elements.
Let's say: max_number_of_cells_per_list
= 2, then the expected output should be:
[["Hi this is"], ["just a"], ["single sentence."],["This is another one"], ["This is"], ["a third sentence ."]]
Example:
Based on the column 'Selection_Values' one can regroup the cells into the following 2D list, using:
[[s.str.cat(sep=' ')] for s in np.split(df.Text, df[df.Selection_Values == 1].index+1) if not s.empty]
Output (original list):
[["Hi this is just a single sentence."],["This is another one"], ["This is a third sentence ."]]
Let's have a look at the number of cells that are within those lists:
As you can see, list1 has 6 cells, list 2 has 2 cells, and list 3 has 5 cells.
Now, what I would like to achieve is the following: if there are more than a certain number of cells in a list, it should be split up, such that each resulting list has +/-1 the wanted number of cells.
So for example max_number_of_cells_per_list
= 2
Modified list:
Do you see a way of doing this ?
EDIT: Important note: Cells from the original lists should not be put into the same lists.
EDIT 2:
Text Selection_Values New
0 Hi 0 1.0
1 this is 0 0.0
2 just 0 1.0
3 a 0 0.0
4 single 0 1.0
5 sentence. 1 0.0
6 This 0 1.0
7 NaN 0 0.0
8 is another one. 1 1.0
9 This is 0 0.0
10 a 0 1.0
11 third 0 0.0
12 sentence 0 0.0
13 . 0 NaN
It is quite easy to transform a pandas dataframe into a numpy array. Simply using the to_numpy() function provided by Pandas will do the trick. This will return us a numpy 2D array of the same size as our dataframe (df), but with the column names discarded.
DataFrame. DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.
IIUC, you can do something like:
n=2 #change this as you like for no. of splits
s=df.Text.dropna().reset_index(drop=True)
c=s.groupby(s.index//n).cumcount().eq(0).shift().shift(-1).fillna(False)
[[i] for i in s.groupby(c.cumsum()).apply(' '.join).tolist()]
[['Hi this is'], ['just a'], ['single sentence.'],
['This is another one.'], ['This is a'], ['third sentence .']]
EDIT:
d=dict(zip(df.loc[df.Text.notna(),'Text'].index,c.index))
ser=pd.Series(d)
df['new']=ser.reindex(range(ser.index.min(),
ser.index.max()+1)).map(c).fillna(False).astype(int)
print(df)
Text Selection_Values new
0 Hi 0 1
1 this is 0 0
2 just 0 1
3 a 0 0
4 single 0 1
5 sentence. 1 0
6 This 0 1
7 NaN 0 0
8 is another one. 1 0
9 This is 0 1
10 a 0 0
11 third 0 1
12 sentence 0 0
13 . 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With