Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory-efficient distributed approach to determining unique values?

Problem

I'm trying to normalize columns in very large raw, de-normalized CSV tables. Column values are short strings (10-100 bytes). I'm trying to find a faster solution than my current approach(es).

Example 

input.csv

john,london
jean,paris
bill,london

Is converted to the following files:

input.normalized.csv

1,1
2,2
3,1

input.col1.csv

1,john
2,jean
3,bill

input.col2.csv

1,london
2,paris

I've currently got two approaches to normalizing these datasets.

Current Approaches

Single pass in-memory

A single pass approach, storing column values -> normalized_id values in an associative array (a Java HashMap in my case). This will run out of memory at some point, but it's fast when it can store everything in memory. A simple way of lowering memory usage, would be to do a single pass per column.

Multipass sorting

A multipass approach based on sorting. Column values gets their line number attached, and are then sorted (in a memory-efficient merge-sort manner). For examples, column values london,paris,london have line numbers attached and are then sorted: london;1,london;3,paris;2 .

I can now have a single "unique value counter", and simply compare each value with the previous value (e.g. London == London, so do not increment unique value counter). At the end, I have pairs of unique_id,linenum pairs that I can sort by line number to reconstruct the normalized column. Columns can then be merged in a single pass.

This approach can be done in very limited memory, depending on the memory usage of the sorting algorithm applied. The good news is that this approach is easy to implement in something like hadoop, utilising its distributed sorting step.

MY QUESTION

The multipass approach is painfully slow compared to a single-pass approach (or a single-pass-per-column approach). So I'm wondering what the best way to optimize that approach would be, or if someone could suggest alternative approaches?

I reckon I'm looking for a (distributed) key-value store of some kind, that has as low memory usage as possible.

It seems to me that using Trove would be a good, simple alternative to using Java HashMaps, but I'd like something that can handle the distribution of keys for me.

Redis would probably be a good bet, but I'm not impressed by it's memory usage per key-value pair.

like image 791
jkgeyti Avatar asked Jan 29 '26 20:01

jkgeyti


1 Answers

Do you know the rough order of magnitude of the input columns? If so, and you don't need to preserve the original input file order? Then you can just use a sufficiently large hash function to avoid collisions for the input keys.

If you insist on having a dense consecutive key space, then you've already covered the two primary choices. You could certainly try redis, I've seen it used for 10s of millions of key-value pairs, but it is probably not going to scale beyond that. You could also try memcached. It might have a slightly lower memory overhead than redis, but I would definitely experiment with both, since they are fairly similar for this particular usage. You don't actually need Redis's advanced data structures.

If you need more key-values than you can store in memory on a single machine, you could fall back to something like BDB or Kyoto cabinet, but eventually this step is going to become the bottleneck to your processing. The other red flag is if you can fit an entire column in memory on a single machine, then why are you using Hadoop?

Honestly, relying on a dense ordered primary key is one of the first things that gets thrown out in a NoSQL DB as it assumes a single coordinated master. If you can allow for even some gaps, then you can do something similar to a vector clock.

One final alternative, would be to use a map-reduce job to collect all the duplicate values up by key and then assign a unique value using some external transactional DB counter. However, the map-reduce job is essentially a multi-pass approach, so it may be worse. The main advantage is that you will be getting some IO parallelism. (Although the id assignment is still a serial transaction.)

like image 68
b4hand Avatar answered Jan 31 '26 15:01

b4hand



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!