I'm given a very large list of numbers one at a time, and I need to print the "median number".
To be more clear there can be "125,000,000" numbers and it is guaranteed that each number is less than "1.e+18".
It's for a contest, so there is memory limit (20 MB tops) and time limit (5 seconds tops).
"The median number" is the one that is in the middle of sorted numbers.
For instance if this is the list of numbers:
23
8
16
42
15
4
108
After sorting numbers:
1) 4
2) 8
3) 15
4) 16
5) 23
6) 42
7) 108
"The median number" would be 16;
So I searched for it in the Internet but I couldn't find any answer that pass these limits.
My approach was to get all the numbers, save them in a text file, sort them, then get "the median number".
So I want to either optimize one of these ideas in order to pass limits or any new idea that pass these limits.
I prefer to use second idea because unlike other two, it passes limits but I can't do it because I don't know how to insert a line in the middle of a text file. So if I learn this, rest of the process will be so easy.
This is a function that receives a number and, by reading through a file, finds best place for it and puts it there.
As the matter of fact it represents my third idea.
So it works (I tested it with lots of inputs) but the problem as I mentioned before is with the time limit.
void insertNewCombinedNumber ( int combinedNumber )
{
char combinedNumberCharacterArray[ 20 ];
bool isInserted = false;
ofstream combinedNumbersOutputFile;
ifstream combinedNumbersInputFile;
// Operate on First File
if ( isFirstCombinedFileActive )
{
combinedNumbersOutputFile.open ( "Combined Numbers - File 01.txt" );
combinedNumbersInputFile.open ( "Combined Numbers - File 02.txt" );
}
// Operate on Second File
else
{
combinedNumbersOutputFile.open ( "Combined Numbers - File 02.txt" );
combinedNumbersInputFile.open ( "Combined Numbers - File 01.txt" );
}
if ( !combinedNumbersInputFile )
{
combinedNumbersInputFile.close ();
ofstream combinedNumbersInputCreateFile ( "Combined Numbers - File 02.txt" );
combinedNumbersInputCreateFile.close ();
combinedNumbersInputFile.open ( "Combined Numbers - File 02.txt" );
}
combinedNumbersInputFile.getline ( combinedNumberCharacterArray , 20 );
for ( int i = 0; !combinedNumbersInputFile.eof (); i++ )
{
if ( !isInserted && combinedNumber <= characterArrayToDecimal ( combinedNumberCharacterArray ) )
{
combinedNumbersOutputFile << combinedNumber << endl;
isInserted = true;
}
combinedNumbersOutputFile << combinedNumberCharacterArray << endl;
combinedNumbersInputFile.getline ( combinedNumberCharacterArray , 20 );
}
if ( !isInserted )
{
combinedNumbersOutputFile << combinedNumber << endl;
isInserted = true;
}
isFirstCombinedFileActive = !isFirstCombinedFileActive;
combinedNumbersOutputFile.close ();
combinedNumbersInputFile.close ();
}
To sort a file containing numeric data, use the -n flag with the command. By default, sort will arrange the data in ascending order. If you want to sort in descending order, reverse the arrangement using the -r option along with the -n flag in the command.
Assumptions:
I will assume the list of numbers is already in binary form (because we will need multiple passes through the data, and each time converting text to binary would take extra processing time). That would be a file of 1GB (125M * 64bit).
It is also not clear if the OS disk caching of that file would count for the memory limit. I will assume it's not, because reading a 1GB file cold from disk multiple times would already take more than 5 seconds.
Solution:
So let's start with a simple example of how this could be done (we will optimize and adjust this later):
uint32
array of size max value / 1 million
(too big for now) where we will put the count of the buckets (0-999999, 1000000-1999999, and so on).Of course, we need to adjust the above a bit.
First, instead of using ranges of 1 million, it's better to use a power of two. That way we can simply use an and
with a mask to get the position in the bucket/count list (instead of using a more expensive division).
Second, for using buckets with ranges of 1 million we would have to create an array that is way too big.
So the best option would be to do 3 passes: first with ranges of say 1e12, and then for the range the median is in, we loop again with ranges of 1e6 (but instead use powers of 2).
This way you would only have to sort the numbers belonging to one small bucket instead of the whole set of 125 million. Sorting takes O(n log n)
.
An example with the numbers given in the question:
23
8
16
42
15
4
108
Use buckets/ranges of 16 - first pass:
array_pos count
0 (0-15) 3
1 (16-31) 2
2 (32-47) 1
3 (48-63) 0
4 (64-79) 0
5 (80-95) 0
6 (96-111) 1
We can now calculate that the median must be in the bucket at array_pos
1.
remember/store these values:
Count before bucket 16-31: 3
Count after bucket 16-31: 2
Second pass - read values for bucket (16-31) - (again, if the bucket sizes are a power of two we can use some bitmasking to quickly check if the number is in the range):
23
16
Sort this small array and calculate the position of the median using the 2 counts (before
and after
).
count
3
16 -> median
23
2
What you really need is a divide and conquer algorithm for such kinds of problems. Have a look at External merge sort and distribution sort sections in External Sorting
The idea is to sort data in to multiple chunks and then merge those chunks again using divide and conquer approach.
It has a time complexity of O(n logn), which I think will pass the time limit.
These algorithms are quite famous and you can just google to get the implementation details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With