Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding duplicates in a collection

Tags:

java

guava

What is the best way to find and mark duplicate objects in a Collection? Let us say we have a List persons and our duplicate strategy is based on exact match of first name and last name.

  1. Identify all duplicates
  2. Mark each duplicate person indicating it is a duplicate
  3. For each duplicate person, identify the object it is the duplicate of

Is there a simple way of doing this with guava?

like image 824
Aravind Yarram Avatar asked Nov 12 '11 22:11

Aravind Yarram


4 Answers

You don't need Guava to do this:

List<Person> people = ...
Map<Name, Person> peopleByName = new HashMap<>();
for (Person person : people) {
  // Name is a simple value class with equality based on its fields
  Name name = new Name(person.getFirstName(), person.getLastName());
  Person firstPersonWithName = peopleByName.get(name);
  if (firstPersonWithName == null) {
    peopleByName.put(name, person);
  } else {
    // or whatever you do to mark a duplicate
    person.setDuplicateOf(firstPersonWithName);
  }
}

That said, you could use a Guava Table instead of a Map and avoid needing to create the Name... use first name as the row keys and last name as column keys, say.

Another choice would be to use Multimaps.index to index all the people in your list by name. Then for every list of people mapped to a particular name, the first person would be the first person with that name from your list and the others would be duplicates.

like image 153
ColinD Avatar answered Oct 21 '22 04:10

ColinD


Why not try overriding .equals() in the person object. Then add a new field to each person object 'duplicateOf' or something.

Then just loop over the array, checking each person against the others. If the persons 'duplicateOf' field is null skip it. If .equals() returns true you can set the 'duplicateOf' field.

like image 34
Theblacknight Avatar answered Oct 21 '22 04:10

Theblacknight


You can try to use Guava's TreeMultimap.

Create a new one TreeMultimap initializing it with a comparator for comparing you persons as you like: TreeMultimap.create(Comparator, Ordering.arbitrary())

Here is a unit test:

package org.test.guava;

import java.util.Arrays;
import java.util.Comparator;
import java.util.List;

import org.junit.Test;

import com.google.common.collect.Multimap;
import com.google.common.collect.Ordering;
import com.google.common.collect.TreeMultimap;

public class GuavaTest {

    private static class Person {
        private String name;

        public Person(String name) {
            this.name = name;
        }

        public String getName() {
            return name;
        }

        @Override
        public String toString() {
            return "Person [name=" + name + "]";
        }

    }

    @Test
    public void test() throws Exception {
        List<Person> persons = Arrays.asList(new Person("person1"), new Person("person2"), new Person("person1"));
        Comparator<Person> comparator = new Comparator<Person>() {
            public int compare(Person o1, Person o2) {
                return o1.getName().compareTo(o2.getName());
            }
        };

        Multimap<Person, Person> groups = TreeMultimap.create(comparator, Ordering.arbitrary());
        for(Person person : persons) {
            groups.put(person, person);
        }

        System.out.println(groups.asMap());
    }

}
like image 26
szhem Avatar answered Oct 21 '22 04:10

szhem


The class Person must implement the boolean equals(Object o).

Then you can find duplicates this way:

You have somewhere: Collection<Person> list;

Person[] persons = list.toArray();
Integer[] duplicateOf = new Integer[persons.length];
Arrays.fill(duplicateOf, -1);

// For all the values in the Collection
for (int i = 0; i < persons.length; i++) {

  // Find the duplicate
  for (int j = 0; j < persons.length; j++) {
    if (persons[i].equals(persons[j]) && i != j)
      duplicateOf[j] = i;
  }
}

Now you have the Array duplicateOf which you can read this way: The duplicate of element j is at the index duplicateOf[j].

like image 1
Dimme Avatar answered Oct 21 '22 04:10

Dimme