Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing 2 huge lists using C# multiple times (with a twist)

Hey everyone, great community you got here. I'm an Electrical Engineer doing some "programming" work on the side to help pay for bills. I say this because I want you to take into consideration that I don't have proper Computer Science training, but I have been coding for the past 7 years.

I have several excel tables with information (all numeric), basically it is "dialed phone numbers" in one column and number of minutes to each of those numbers on another. Separately I have a list of "carrier prefix code numbers" for the different carriers in my country. What I want to do is separate all the "traffic" per carrier. Here is the scenario:

First dialed number row: 123456789ABCD,100 <-- That would be a 13 digit phone number and 100 minutes.

I have a list of 12,000+ prefix codes for carrier 1, these codes vary in length, and I need to check everyone of them:

Prefix Code 1: 1234567 <-- this code is 7 digits long.

I need to check the first 7 digits for the dialed number an compare it to the dialed number, if a match is found, I would add the number of minutes to a subtotal for later use. Please consider that not all prefix codes are the same length, some times they are shorter or longer.

Most of this should be a piece of cake, and I could should be able to do it, but I'm getting kind of scared with the massive amount of data; Some times the dialed number lists consists of up to 30,000 numbers, and the "carrier prefix code" lists around 13,000 rows long, and I usually check 3 carriers, that means I have to do a lot of "matches".

Does anyone have an idea of how to do this efficiently using C#? Or any other language to be kind honest. I need to do this quite often and designing a tool to do it would make much more sense. I need a good perspective from someone that does have that "Computer Scientist" background.

Lists don't need to be in excel worksheets, I can export to csv file and work from there, I don't need an "MS Office" interface.

Thanks for your help.

Update:

Thank you all for your time on answering my question. I guess in my ignorance I over exaggerated the word "efficient". I don't perform this task every few seconds. It's something I have to do once per day and I hate to do with with Excel and VLOOKUPs, etc.

I've learned about new concepts from you guys and I hope I can build a solution(s) using your ideas.

like image 400
gus Avatar asked Nov 29 '22 20:11

gus


2 Answers

UPDATE

You can do a simple trick - group the prefixes by their first digits into a dictionary and match the numbers only against the correct subset. I tested it with the following two LINQ statements assuming every prefix has at least three digis.

const Int32 minimumPrefixLength = 3;

var groupedPefixes = prefixes
    .GroupBy(p => p.Substring(0, minimumPrefixLength))
    .ToDictionary(g => g.Key, g => g);

var numberPrefixes = numbers
    .Select(n => groupedPefixes[n.Substring(0, minimumPrefixLength)]
        .First(n.StartsWith))
    .ToList();

So how fast is this? 15.000 prefixes and 50.000 numbers took less than 250 milliseconds. Fast enough for two lines of code?

Note that the performance heavily depends on the minimum prefix length (MPL), hence on the number of prefix groups you can construct.

     MPL    Runtime
    -----------------
     1     10.198 ms
     2      1.179 ms
     3        205 ms
     4        130 ms
     5        107 ms

Just to give an rough idea - I did just one run and have a lot of other stuff going on.

Original answer

I wouldn't care much about performance - an average desktop pc can quiete easily deal with database tables with 100 million rows. Maybe it takes five minutes but I assume you don't want to perform the task every other second.

I just made a test. I generated a list with 15.000 unique prefixes with 5 to 10 digits. From this prefixes I generated 50.000 numbers with a prefix and additional 5 to 10 digits.

List<String> prefixes = GeneratePrefixes();
List<String> numbers = GenerateNumbers(prefixes);

Then I used the following LINQ to Object query to find the prefix of each number.

var numberPrefixes = numbers.Select(n => prefixes.First(n.StartsWith)).ToList();

Well, it took about a minute on my Core 2 Duo laptop with 2.0 GHz. So if one minute processing time is acceptable, maybe two or three if you include aggregation, I would not try to optimize anything. Of course, it would be realy nice if the programm could do the task in a second or two, but this will add quite a bit of complexity and many things to get wrong. And it takes time to design, write, and test. The LINQ statement took my only seconds.

Test application

Note that generating many prefixes is really slow and might take a minute or two.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;

namespace Test
{
    static class Program
    {
        static void Main()
        {
            // Set number of prefixes and calls to not more than 50 to get results
            // printed to the console.

            Console.Write("Generating prefixes");
            List<String> prefixes = Program.GeneratePrefixes(5, 10, 15);
            Console.WriteLine();

            Console.Write("Generating calls");
            List<Call> calls = Program.GenerateCalls(prefixes, 5, 10, 50);
            Console.WriteLine();

            Console.WriteLine("Processing started.");

            Stopwatch stopwatch = new Stopwatch();

            const Int32 minimumPrefixLength = 5;

            stopwatch.Start();

            var groupedPefixes = prefixes
                .GroupBy(p => p.Substring(0, minimumPrefixLength))
                .ToDictionary(g => g.Key, g => g);

            var result = calls
                .GroupBy(c => groupedPefixes[c.Number.Substring(0, minimumPrefixLength)]
                    .First(c.Number.StartsWith))
                .Select(g => new Call(g.Key, g.Sum(i => i.Duration)))
                .ToList();

            stopwatch.Stop();

            Console.WriteLine("Processing finished.");
            Console.WriteLine(stopwatch.Elapsed);

            if ((prefixes.Count <= 50) && (calls.Count <= 50))
            {
                Console.WriteLine("Prefixes");
                foreach (String prefix in prefixes.OrderBy(p => p))
                {
                    Console.WriteLine(String.Format("  prefix={0}", prefix));
                }

                Console.WriteLine("Calls");
                foreach (Call call in calls.OrderBy(c => c.Number).ThenBy(c => c.Duration))
                {
                    Console.WriteLine(String.Format("  number={0} duration={1}", call.Number, call.Duration));
                }

                Console.WriteLine("Result");
                foreach (Call call in result.OrderBy(c => c.Number))
                {
                    Console.WriteLine(String.Format("  prefix={0} accumulated duration={1}", call.Number, call.Duration));
                }
            }

            Console.ReadLine();
        }

        private static List<String> GeneratePrefixes(Int32 minimumLength, Int32 maximumLength, Int32 count)
        {
            Random random = new Random();
            List<String> prefixes = new List<String>(count);
            StringBuilder stringBuilder = new StringBuilder(maximumLength);

            while (prefixes.Count < count)
            {
                stringBuilder.Length = 0;

                for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
                {
                    stringBuilder.Append(random.Next(10));
                }

                String prefix = stringBuilder.ToString();

                if (prefixes.Count % 1000 == 0)
                {
                    Console.Write(".");
                }

                if (prefixes.All(p => !p.StartsWith(prefix) && !prefix.StartsWith(p)))
                {
                    prefixes.Add(stringBuilder.ToString());
                }
            }

            return prefixes;
        }

        private static List<Call> GenerateCalls(List<String> prefixes, Int32 minimumLength, Int32 maximumLength, Int32 count)
        {
            Random random = new Random();
            List<Call> calls = new List<Call>(count);
            StringBuilder stringBuilder = new StringBuilder();

            while (calls.Count < count)
            {
                stringBuilder.Length = 0;

                stringBuilder.Append(prefixes[random.Next(prefixes.Count)]);

                for (int i = 0; i < random.Next(minimumLength, maximumLength + 1); i++)
                {
                    stringBuilder.Append(random.Next(10));
                }

                if (calls.Count % 1000 == 0)
                {
                    Console.Write(".");
                }

                calls.Add(new Call(stringBuilder.ToString(), random.Next(1000)));
            }

            return calls;
        }

        private class Call
        {
            public Call (String number, Decimal duration)
            {
                this.Number = number;
                this.Duration = duration;
            }

            public String Number { get; private set; }
            public Decimal Duration { get; private set; }
        }
    }
}
like image 130
Daniel Brückner Avatar answered Dec 06 '22 23:12

Daniel Brückner


It sounds to me like you need to build a trie from the carrier prefixes. You'll end up with a single trie, where the terminating nodes tell you the carrier for that prefix.

Then create a dictionary from carrier to an int or long (the total).

Then for each dialed number row, just work your way down the trie until you find the carrier. Find the total number of minutes so far for the carrier, and add the current row - then move on.

like image 21
Jon Skeet Avatar answered Dec 07 '22 00:12

Jon Skeet