Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding a majority of unorderable items

I have this one problem with finding solution to this task.

You have N students and N courses. Student can attend only one course and one course can be attended by many students. Two students are classmates if they are attending same course. How to find out if there are N/2 classmates in N students with this?

conditions: You can take two students and ask if they are classmates and only answer you can get is "yes" or "no". And you need to do this in O(N*log(N)).

I need just some idea how to make it, pseudo code will be fine. I guess it will divide the list of students like merge sort, which gives me the logarithmic part of complexity. Any ideas will be great.

like image 700
DonTettino Avatar asked Nov 01 '22 07:11

DonTettino


2 Answers

First, pair off each student (1&2, 3&4, 5&6... etc), and you check and see which pairs are in the same class. The first student of the pair gets "promoted". If there an "oddball" student, they are in their own class, so they get promoted as well. If a single class contains >=50% of the students, then >=50% of the promoted students are also in this class. If no students are promoted, then if a single class contains >=50% of the students then either the first or the second student must be in the class, so simply promote both of them. This leads to the case where >=50% of the promotions are in the large class. This always takes ⌊N/2⌋ comparisons.

Now when we examine the promoted students, then if a class contains >=50% of the students, then >=50% of the promoted students are in this class. Therefore, we can simply recurse, until we reach a stop condition: there are less than three promoted students. At each step we promote <=50% of the students (plus one sometimes), so this step occurs at most ⌈log(N,2)⌉ times.

If there are less than three promoted students, then we know that if >=50% of the original students are in the class, then at least one of these remaining students is in that class. Therefore, we can simply compare each and every original student against these promoted students, which will reveal either (A) the class with >=50% of the students, or (B) that no class has >=50% of the students. This takes at most (N-1) comparisons, but only occurs once. Note that there is the possibility where all the original students match with one of the two remaining students evenly, and this detects that both classes have =50% of the students.

So the complexity is N/2 *~ log(N,2) + N-1. However, the *~ signifies that we don't iterate over all N/2 students at each of the log(N,2) iterations, only decreasing fractions N/2, N/4, N/8..., which sum to N. So the total complexity is N/2 + N/2 + N-1 = 2N-1, and when we remove the constants we get O(N). (I feel like I may have made a math error in this paragraph. If someone spots it, let me know)

See it in action here: http://coliru.stacked-crooked.com/a/144075406b7566c2 (The comparison counts may be slightly over the estimate due to simplifications I made in the implementation)


Key here is that if >50% of the students are in a class, then >=50% of the arbitrary pairs are in that class, assuming the oddball student matches himself. One trick is that if exactly 50% match, it's possible that they alternate in the original order perfectly and thus nobody gets promoted. Luckily, the only cases is the alternating, so by promoting the first and second students, then even in that edge case, >=50% of the promotions are in the large class.

It's complicated to prove that >=50% of the promotions are in the large class, and I'm not even certain I can articulate why this is. Confusingly, it also doesn't hold prettily for any other fractions. If the target is >=30% of the comparisons, it's entirely possible that none of the promoted students are in the target class(s). So >=50% is the magic number, it isn't arbitrary at all.

like image 104
Mooing Duck Avatar answered Nov 15 '22 10:11

Mooing Duck


If one can know the number of students for each course then it should suffice to know if there is a course with a number of students >= N/2. In this case you have a complexity of O(N) in the worst case.

If it is not possible to know the number of students for each course then you could use an altered quicksort. In each cycle you pick a random student and split the other students in classmates and non-classmates. If the number of classmates is >= N/2 you stop because you have the answer, else you analyze the non-classmates partition. If the number of students in that partition is < N/2 you stop because it is not possible to have a quantity of classmates >= N/2, else you pick another student from the non-classmates partition and repeat everything using only the non-classmates elements.

What we take from the quicksort algorithm is just the way we partition the students. The above algorithm has nothing to do with sorting. In pseudo-code it would look something like this (array indexing starts from 1 for the sake of clarity):

Student[] students = all_students;
int startIndex = 1;
int endIndex = N;    // number of students
int i;

while(startIndex <= N/2){
    endIndex = N;    // this index must point to the last position in each cycle
    students.swap(startIndex, start index +  random_int % (endIndex-startIndex));

    for(i = startIndex + 1; i < endIndex;){
        if(students[startIndex].isClassmatesWith(students[i])){
            i++;
        }else{
            students.swap(i,endIndex);
            endIndex--;
        }
    }

    if(i-startIndex >= N/2){
        return true;
    }

    startIndex = i;
}
return false;

The situation of the partition before the algorithm starts would be as simple as:

| all_students_that_must_be_analyzed |

during the first run the set of students would be partitioned this way:

| classmates | to_be_analyzed | not_classmates |

and during each run after it, the set of students would be partitioned as follows:

| to_ignore | classmates | to_be_analyzed | not_classmates |

In the end of each run the set of students would be partitioned in the following way:

| to_ignore | classmates | not_classmates |

At this moment we need to check if the classmates partition has more than N/2 elements. If it has, then we have a positive result, if not we need to check if the not_classmates partition has >= N/2 elements. If it has, then we need to proceed with another run, otherwise we have a negative result.

Regarding complexity

Thinking more in depth on the complexity of the above algorithm, there are two main factors that affect it, which are:

  1. The number of students in each course (it's not necessary to know this number for the algorithm to work).
  2. The average number of classmates found in each iteration of the algorithm.

An important part of the algorithm is the random choice of the student to be analyzed.

The worst case scenario would be when each course has 1 student. In this case (for obvious reasons I would say) the complexity would be O(N^2). If the number of students for the courses varies then this case won't happen.

An example of the worst case scenario would be when we have, let's say, 10 students, 10 courses, and 1 student for each course. We would check 10 students the first time, 9 students the second time, 8 students the third time, and so on. This brings a O(N^2) complexity.

The best case scenario would be when the first student you choose is in a course with a number of students >= N/2. In this case the complexity would be O(N) because it would stop in the first run.

An example of the best case scenario would be when we have 10 students, 5 (or more) of which are classmates, and in the first run we pick one of these 5 students. In this case we would check only 1 time for the classmates, find 5 classmates, and return true.

The average case scenario is the most interesting part (and more close to a real-world scenario). In this case there are some probabilistic calculations to make.

First of all, the chances of a student from a particular course to be picked are [number_of_students_in_the_course] / N. This means that, in the first runs it's more probable to pick a student with many classmates.

That being said, let's consider the case where the average number of classmates found in each iteration is a number smaller that N/2 (as is the length of each partition in the average case for quicksort). Let's say that the average amount of classmates found in each iteration is 10% (number taken for ease of calculations) of the remaining M students (that are not classmates of the previously picked students). In this case we would have these values of M for each iteration:

  1. M1 = N - 0.1*N = 0.9*N
  2. M2 = M1 - 0.1*M1 = 0.9*M1 = 0.9*0.9*N = 0.81*N
  3. M3 = M2 - 0.1*M2 = 0.9*M2 = 0.9*0.81*N = 0.729*N and I would round it to 0.73*N for ease of calculations
  4. M4 = 0.9*M3 = 0.9*0.73*N = 0.657*N ~= 0.66*N
  5. M5 = 0.9*M4 = 0.9*0.66*N = 0.594*N ~= 0.6*N
  6. M6 = 0.9*M5 = 0.9*0.6*N = 0.54*N
  7. M7 = 0.9*M6 = 0.9*0.54*N = 0.486*N ~= 0.49*N
  8. The algorithm stops because we have 49% of remaining students and we can't have more than N/2 classmates among them.

Obviously, in the case of a smaller percentage of average classmates the number of iterations will be greater but, combined with the first fact (the students in courses with many students have a higher probability to get picked in the early iterations), the complexity will tend towards O(N), the number of iterations (in the outer cycle in the pseudo-code) will be (more or less) constant and will not depend on N.

To better explain this scenario let's work with greater (but more realistic) numbers and more than 1 distribution. Let's say that we have 100 students (number taken for the sake of simplicity in calculations) and that these students are distributed among the courses in one of the following (hypothetical) ways (the numbers are sorted just for explanation purposes, they are not necessary for the algorithm to work):

  1. 50, 30, 10, 5, 1, 1, 1, 1, 1
  2. 35, 27, 25, 10, 5, 1, 1, 1
  3. 11, 9, 9, 8, 7, 7, 5, 5, 5, 5, 5, 5, 5, 5, 5, 3, 1

The numbers given are also (in this particular case) the probabilities that a student in a course (not a particular student, just a student of that course) is picked in the first run. The 1st case is when we have a course with half the students. The 2nd case is when we don't have a course with half the students but more than 1 course with many students. The 3rd case is when we have a similar distribution among the courses.

In the 1st case we would have a 50% probability that a student of the 1st course gets picked in the first run, 30% probability that a student of the 2nd course gets picked, 10% probability that a student of the 3rd course gets picked, 5% probability that a student of the 4th course gets picked, 1% that a student from the 5th course gets picked, and so on for the 6th, 7th, 8th, and 9th course. The probabilities are higher for a student of the 1st case to get picked early, and in the case a student from this course does not get picked in the first run the probabilities it gets picked in the second run only increase. For example, let's suppose that in the 1st run a student from the second course is picked. 30% of the students would be "removed" (as in "not considered anymore") and not be analyzed in the 2nd run. In the 2nd run we would have 70 students remaining. The probability to pick a student from the 1st course in the second run would be 5/7, more than 70%. Let's suppose that - out of bad luck - in the 2nd run a student from the 3rd course gets picked. In the 3rd run we would have 60 students left and the probability that a student from the first course gets picked in the 3rd run would be 5/6 (more than 80%). I would say that we can consider our bad luck to be over in the 3rd run, a student from the 1st course gets picked, and the method returns true :)

For the 2nd and 3rd case I would follow the probabilities for each run, just for the sake of simplicity of calculations.

In the 2nd case we would have a student from the 1st course picked in the 1st run. Being that the number of classmates is not <= N/2 the algorithm would go on with the 2nd run. In the end of the 2nd run we would have "removed" from the student set 35+27=62 students. In the 3rd run we would have 38 students left, and being that 38 < (N/2) = 50 the computation stops and returns false.

The same happens in the 3rd case (in which we "remove" an average of 10% of the remaining students in each run), but with more steps.

Final considerations

In any case, the complexity of the algorithm in the worst case scenario is O(N^2). The average case scenario is heavily based on probabilities and tends to pick early the students from courses with many attendees. This behaviour tends to bring the complexity down to O(N), complexity that we also have in the best case scenario.

Test of the algorithm

In order to test the theoretical complexity of the algorithm I wrote the following code in C#:

public class Course
{
    public int ID { get; set; }

    public Course() : this(0) { }

    public Course(int id)
    {
        ID = id;
    }

    public override bool Equals(object obj)
    {
        return (obj is Course) && this.Equals((Course)obj);
    }

    public bool Equals(Course other)
    {
        return ID == other.ID;
    }
}

public class Student
{
    public int ID { get; set; }
    public Course Class { get; set; }

    public Student(int id, Course course)
    {
        ID = id;
        Class = course;
    }

    public Student(int id) : this(id, null) { }

    public Student() : this(0) { }

    public bool IsClassmatesWith(Student other)
    {
        return Class == other.Class;
    }

    public override bool Equals(object obj)
    {
        return (obj is Student) && this.Equals((Student)obj);
    }

    public bool Equals(Student other)
    {
        return ID == other.ID && Class == other.Class;
    }
}

class Program
{
    static int[] Sizes { get; set; }
    static List<Student> Students { get; set; }
    static List<Course> Courses { get; set; }

    static void Initialize()
    {
        Sizes = new int[] { 2, 10, 100, 1000, 10000, 100000, 1000000 };
        Students = new List<Student>();
        Courses = new List<Course>();
    }

    static void PopulateCoursesList(int size)
    {
        for (int i = 1; i <= size; i++)
        {
            Courses.Add(new Course(i));
        }
    }

    static void PopulateStudentsList(int size)
    {
        Random ran = new Random();

        for (int i = 1; i <= size; i++)
        {
            Students.Add(new Student(i, Courses[ran.Next(Courses.Count)]));
        }
    }

    static void Swap<T>(List<T> list, int i, int j)
    {
        if (i < list.Count && j < list.Count)
        {
            T temp = list[i];
            list[i] = list[j];
            list[j] = temp;
        }
    }

    static bool AreHalfOfStudentsClassmates()
    {
        int startIndex = 0;
        int endIndex;
        int i;
        int numberOfStudentsToConsider = (Students.Count + 1) / 2;
        Random ran = new Random();

        while (startIndex <= numberOfStudentsToConsider)
        {
            endIndex = Students.Count - 1;
            Swap(Students, startIndex, startIndex + ran.Next(endIndex + 1 - startIndex));

            for (i = startIndex + 1; i <= endIndex; )
            {
                if (Students[startIndex].IsClassmatesWith(Students[i]))
                {
                    i++;
                }
                else
                {
                    Swap(Students, i, endIndex);
                    endIndex--;
                }
            }

            if (i - startIndex + 1 >= numberOfStudentsToConsider)
            {
                return true;
            }

            startIndex = i;
        }

        return false;
    }

    static void Main(string[] args)
    {
        Initialize();
        int studentsSize, coursesSize;
        Stopwatch stopwatch = new Stopwatch();
        TimeSpan duration;
        bool result;

        for (int i = 0; i < Sizes.Length; i++)
        {
            for (int j = 0; j < Sizes.Length; j++)
            {
                Courses.Clear();
                Students.Clear();
                studentsSize = Sizes[j];
                coursesSize = Sizes[i];
                PopulateCoursesList(coursesSize);
                PopulateStudentsList(studentsSize);

                Console.WriteLine("Test for {0} students and {1} courses.", studentsSize, coursesSize);
                stopwatch.Start();
                result = AreHalfOfStudentsClassmates();
                stopwatch.Stop();
                duration = stopwatch.Elapsed;
                var studentsGrouping = Students.GroupBy(s => s.Class);
                var classWithMoreThanHalfOfTheStudents = studentsGrouping.FirstOrDefault(g => g.Count() >= (studentsSize + 1) / 2);
                Console.WriteLine(result ? "At least half of the students are classmates." : "Less than half of the students are classmates");

                if ((result && classWithMoreThanHalfOfTheStudents == null)
                    || (!result && classWithMoreThanHalfOfTheStudents != null))
                {
                    Console.WriteLine("There is something wrong with the result");
                }

                Console.WriteLine("Test duration: {0}", duration);
                Console.WriteLine();
            }
        }

        Console.ReadKey();
    }
}

The execution time matched the expectations of the average case scenario. Feel free to play with the code, you just need to copy and paste it and it should work.

like image 20
Gentian Kasa Avatar answered Nov 15 '22 08:11

Gentian Kasa