I'm a bit annoyed with myself because I can't understand why one solution to a problem worked but another didn't. As in, it points to a deficient understanding of (basic) pandas on my part, and that makes me mad!
Anyway, my problem was simple: I had a list of 'bad' values ('bad_index'); these corresponded to row indexes on a dataframe ('data_clean1') for which I wanted to delete the corresponding rows. However, as the values will change with each new dataset, I didn't want to plug the bad values directly into the code. Here's what I did first:
bad_index = [2, 7, 8, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24, 29]
for i in bad_index:
dataclean2 = dataclean1.drop([i]).reset_index(level = 0, drop = True)
But this didn't work; the data_clean2 remained the exact same as data_clean1. My second idea was to use list comprehensions (as below); this worked out fine.
bad_index = [2, 7, 8, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24, 29]
data_clean2 = data_clean1.drop([x for x in bad_index]).reset_index(level = 0, drop = True)
Now, why did the list comprehension method work and not the 'for' loop? I've been coding for a few months, and I feel that I shouldn't be making these kinds of errors.
Thanks!
List comprehensions are often not only more readable but also faster than using "for loops." They can simplify your code, but if you put too much logic inside, they will instead become harder to read and understand.
As we can see, the for loop is slower than the list comprehension (9.9 seconds vs. 8.2 seconds). List comprehensions are faster than for loops to create lists. But, this is because we are creating a list by appending new elements to it at each iteration.
The for loop is a common way to iterate through a list. List comprehension, on the other hand, is a more efficient way to iterate through a list because it requires fewer lines of code. List comprehension requires less computation power than a for loop because it takes up less space and code.
Because of differences in how Python implements for loops and list comprehension, list comprehensions are almost always faster than for loops when performing operations. Below, the same operation is performed by list comprehension and by for loop.
data_clean1.drop([x for x in bad_index]).reset_index(level = 0, drop = True)
is equivalent to simply passing the bad_index
list to drop
:
data_clean1.drop(bad_index).reset_index(level = 0, drop = True)
drop
accepts a list, and drops every index present in the list.
Your explicit for
loop didn't work because in every iteration you simply dropped a different index from the dataclean1
dataframe without saving the intermediate dataframes, so by the last iteration dataclean2
was simply the result of executingdataclean2 = dataclean1.drop(29).reset_index(level = 0, drop = True)
EDIT: it turns out this is not your problem ... but if you did not have the problem mentioned in the other answer by Deepspace then you would have this problem
for i in bad_index:
dataclean2 = dataclean1.drop([i]).reset_index(level = 0, drop = True)
imagine your bad index is [1,2,3]
and your dataclean is [4,5,6,7,8]
now lets step through what actually happens
initial: dataclean == [4,5,6,7,8]
loop0 : i == 1 => drop index 1 ==>dataclean = [4,6,7,8]
loop1 : i == 2 => drop index 2 ==> dataclean = [4,6,8]
loop2 : i ==3 ==> drop index 3 !!!! uh oh there is no index 3
you could i guess do instead
for i in reversed(bad_index):
...
this way if you remove index3 first it will not affect index 1 and 2
but in general you should not mutate a list/dict as you iterate over it
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With