I have a problem with removing the duplicates. My program is based around a loop which generates tuples (x,y) which are then used as nodes in a graph. The final array/matrix of nodes is :
[[ 1. 1. ]
[ 1.12273268 1.15322175]
[..........etc..........]
[ 0.94120695 0.77802849]
**[ 0.84301344 0.91660517]**
[ 0.93096269 1.21383287]
**[ 0.84301344 0.91660517]**
[ 0.75506418 1.0798641 ]]
The length of the array is 22. Now, I need to remove the duplicate entries (see **). So I used:
def urows(array):
df = pandas.DataFrame(array)
df.drop_duplicates(take_last=True)
return df.drop_duplicates(take_last=True).values
Fantastic, but I still get :
0 1
0 1.000000 1.000000
....... etc...........
17 1.039400 1.030320
18 0.941207 0.778028
**19 0.843013 0.916605**
20 0.930963 1.213833
**21 0.843013 0.916605**
So drop duplicates is not removing anything. I tested to see if the nodes where actually the same and I get:
print urows(total_nodes)[19,:]
---> [ 0.84301344 0.91660517]
print urows(total_nodes)[21,:]
---> [ 0.84301344 0.91660517]
print urows(total_nodes)[12,:] - urows(total_nodes)[13,:]
---> [ 0. 0.]
Why is it not working ??? How can I remove those duplicate values ???
One more question....
Say two values are "nearly" equal (say x1 and x2), is there any way to replace them in a way that they are both equal ???? What I want is to replace x2 with x1 if they are "nearly" equal.
Similar to @Dougal answer, but in a slightly different way
In [20]: df.ix[~(df*1e6).astype('int64').duplicated(cols=[0])]
Out[20]:
0 1
0 1.000000 1.000000
1 1.122733 1.153222
2 0.941207 0.778028
3 0.843013 0.916605
4 0.930963 1.213833
6 0.755064 1.079864
If I copy-paste in your data, I get:
>>> df
0 1
0 1.000000 1.000000
1 1.122733 1.153222
2 0.941207 0.778028
3 0.843013 0.916605
4 0.930963 1.213833
5 0.843013 0.916605
6 0.755064 1.079864
>>> df.drop_duplicates()
0 1
0 1.000000 1.000000
1 1.122733 1.153222
2 0.941207 0.778028
3 0.843013 0.916605
4 0.930963 1.213833
6 0.755064 1.079864
so it is actually removed, and your problem is that the arrays aren't exactly equal (though their difference rounds to 0 for display).
One workaround would be to round the data to however many decimal places are applicable with something like df.apply(np.round, args=[4])
, then drop the duplicates. If you want to keep the original data but remove rows that are duplicate up to rounding, you can use something like
df = df.ix[~df.apply(np.round, args=[4]).duplicated()]
Here's one really clumsy way to do what you're asking for with setting nearly-equal values to be actually equal:
grouped = df.groupby([df[i].round(4) for i in df.columns])
subbed = grouped.apply(lambda g: g.apply(lambda row: g.irow(0), axis=1))
subbed.drop_index(level=list(df.columns), drop=True, inplace=True)
This reorders the dataframe, but you can then call .sort()
to get them back in the original order if you need that.
Explanation: the first line uses groupby
to group the data frame by the rounded values. Unfortunately, if you give a function to groupby it applies it to the labels rather than the rows (so you could maybe do df.groupby(lambda k: np.round(df.ix[k], 4))
, but that sucks too).
The second line uses the apply
method on groupby to replace the dataframe of near-duplicate rows, g
, with a new dataframe g.apply(lambda row: g.irow(0), axis=1)
. That uses the apply
method on dataframes to replace each row with the first row of the group.
The result then looks like
0 1
0 1
0.7551 1.0799 6 0.755064 1.079864
0.8430 0.9166 3 0.843013 0.916605
5 0.843013 0.916605
0.9310 1.2138 4 0.930963 1.213833
0.9412 0.7780 2 0.941207 0.778028
1.0000 1.0000 0 1.000000 1.000000
1.1227 1.1532 1 1.122733 1.153222
where groupby
has inserted the rounded values as an index. The reset_index
line then drops those columns.
Hopefully someone who knows pandas better than I do will drop by and show how to do this better.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With