Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add column of empty lists to DataFrame

Tags:

python

pandas

Similar to this question How to add an empty column to a dataframe?, I am interested in knowing the best way to add a column of empty lists to a DataFrame.

What I am trying to do is basically initialize a column and as I iterate over the rows to process some of them, then add a filled list in this new column to replace the initialized value.

For example, if below is my initial DataFrame:

df = pd.DataFrame(d = {'a': [1,2,3], 'b': [5,6,7]}) # Sample DataFrame  >>> df    a  b 0  1  5 1  2  6 2  3  7 

Then I want to ultimately end up with something like this, where each row has been processed separately (sample results shown):

>>> df    a  b          c 0  1  5     [5, 6] 1  2  6     [9, 0] 2  3  7  [1, 2, 3] 

Of course, if I try to initialize like df['e'] = [] as I would with any other constant, it thinks I am trying to add a sequence of items with length 0, and hence fails.

If I try initializing a new column as None or NaN, I run in to the following issues when trying to assign a list to a location.

df['d'] = None  >>> df    a  b     d 0  1  5  None 1  2  6  None 2  3  7  None 

Issue 1 (it would be perfect if I can get this approach to work! Maybe something trivial I am missing):

>>> df.loc[0,'d'] = [1,3]  ... ValueError: Must have equal len keys and value when setting with an iterable 

Issue 2 (this one works, but not without a warning because it is not guaranteed to work as intended):

>>> df['d'][0] = [1,3]  C:\Python27\Scripts\ipython:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame 

Hence I resort to initializing with empty lists and extending them as needed. There are a couple of methods I can think of to initialize this way, but is there a more straightforward way?

Method 1:

df['empty_lists1'] = [list() for x in range(len(df.index))]  >>> df    a  b   empty_lists1 0  1  5             [] 1  2  6             [] 2  3  7             [] 

Method 2:

 df['empty_lists2'] = df.apply(lambda x: [], axis=1)  >>> df    a  b   empty_lists1   empty_lists2 0  1  5             []             [] 1  2  6             []             [] 2  3  7             []             [] 

Summary of questions:

Is there any minor syntax change that can be addressed in Issue 1 that can allow a list to be assigned to a None/NaN initialized field?

If not, then what is the best way to initialize a new column with empty lists?

like image 839
vk1011 Avatar asked Jul 17 '15 00:07

vk1011


People also ask

How do I add a column to an empty data frame?

Add an Empty Column by Index Using Dataframe.Use DataFrame. insert() method to add an empty column at any position on the pandas DataFrame. This adds a column inplace on the existing DataFrame object.

How do you add an empty column to a DataFrame in R?

The easiest way to add an empty column to a dataframe in R is to use the add_column() method: dataf %>% add_column(new_col = NA) .


Video Answer


1 Answers

One more way is to use np.empty:

df['empty_list'] = np.empty((len(df), 0)).tolist() 

You could also knock off .index in your "Method 1" when trying to find len of df.

df['empty_list'] = [[] for _ in range(len(df))] 

Turns out, np.empty is faster...

In [1]: import pandas as pd  In [2]: df = pd.DataFrame(pd.np.random.rand(1000000, 5))  In [3]: timeit df['empty1'] = pd.np.empty((len(df), 0)).tolist() 10 loops, best of 3: 127 ms per loop  In [4]: timeit df['empty2'] = [[] for _ in range(len(df))] 10 loops, best of 3: 193 ms per loop  In [5]: timeit df['empty3'] = df.apply(lambda x: [], axis=1) 1 loops, best of 3: 5.89 s per loop 
like image 90
ComputerFellow Avatar answered Sep 24 '22 00:09

ComputerFellow