Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Modify a Data Frame column with list comprehension

Tags:

python

pandas

I have a list with about 90k strings, and a Data Frame with several columns, I'm interested in checking whether a string of the list is in column_1 and if it is assign the same value at column_2.

I can do this:

for i in range(len(my_list)):
    item = list[i]
    for j in range(len(df)):
         if item == df['column_1'][j]:
             df['column_2'][j] = item

But I would prefer to avoid the nested loops

I tried this

for item in my list:
    if item in list(df['column _1']):
          position = df[df['column_1']==item]].index.values[0]
          df['column_2'][position]  = item

but I think that this solution is even slower and harder to read, can this operation be done with a simple list comprehension?

Edit.

Second solution it's considerable faster, about an order of magnitude. why is that? seems that in that case it has to search twice for the mach:

here:

if item in list(df['column _1'])

and here:

possition = df[df['column_1]=='tem]].index.values[0]

Still I would prefer a simpler solution.

like image 783
Luis Ramon Ramirez Rodriguez Avatar asked Mar 07 '16 14:03

Luis Ramon Ramirez Rodriguez


People also ask

Can you do list comprehension with a DataFrame?

Using List Comprehension on Pandas DataFrame. In real-world, we generally have data stored in either CSV or relational databases. We generally convert it to pandas dataframe and then we do data cleaning and manipulation. Hence it is important to learn how to use list comprehension on dataframe.

Can we modify a data inside a DataFrame?

Renaming index and columns. We can alter the index and column names by calling rename() function. The official documentation of rename() function can be seen here. We can pass inplace = True to rename the data in place.

How do you restructure data frames?

You can use the following basic syntax to convert a pandas DataFrame from a wide format to a long format: df = pd. melt(df, id_vars='col1', value_vars=['col2', 'col3', ...]) In this scenario, col1 is the column we use as an identifier and col2, col3, etc.


1 Answers

You can do this by splitting the filtering and assignment actions you described into two distinct steps.

Pandas series objects include an 'isin' method that could let you identify rows whose column_1 values are in my_list and saves the results off in a boolean-valued series. This can in turn be used with the .loc indexing method to copy the values from the appropriate rows from column 1 to column 2

# Identify the matching rows
matches = df['column_1'].isin(my_list)
# Set the column_2 entries to column_1 in the matching rows
df.loc[matches,'column_2'] = df.loc[matches,'column_1']

If column_2 doesn't already exist, this approach creates column_2 and sets the non_matching values to NaN. The .loc method is used to avoid operating on a copy of the data when performing the indexing operations.

like image 145
res_edit Avatar answered Nov 03 '22 21:11

res_edit