Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

set() not removing duplicates

I'm trying to find unique instances of IP addresses in a file using regex. I find them fine and try to append them to a list and later try to use set() on my list to remove duplicates. I'm finding each item okay and there are duplicates but I can't get the list to simplify. The output of printing my set is the same as printing ips as a list, nothing is removed.

ips = [] # make a list
count = 0
count1 = 0
for line in f: #loop through file line by line
    match = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", line) #find IPs
    if match: #if there's a match append and keep track of the total number of Ips
        ips.append(match) #append to list
        count = count + 1
ipset = set(ips)
print(ipset, count)

This string <_sre.SRE_Match object; span=(0, 13), match='137.43.92.119'> shows up 60+ times in the output before and after trying to set() the list

like image 307
BitFlow Avatar asked Oct 21 '25 04:10

BitFlow


1 Answers

You are not storing the matched strings. You are storing the re.Match objects. These don't compare equal even if they matched the same text, so they are all seen as unique by a set object:

>>> import re
>>> line = '137.43.92.119\n'
>>> match1 = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", line)
>>> match1
<_sre.SRE_Match object; span=(0, 13), match='137.43.92.119'>
>>> match2 = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", line)
>>> match2
<_sre.SRE_Match object; span=(0, 13), match='137.43.92.119'>
>>> match1 == match2
False

Extract the matched text instead:

ips.append(match.group()) #append to list

matchobj.group() without arguments returns the part of the string that was matched (group 0):

>>> match1.group()
'137.43.92.119'
>>> match1.group() == match2.group()
True
like image 97
Martijn Pieters Avatar answered Oct 22 '25 20:10

Martijn Pieters



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!