I'm working with some binary data that I have stored in arbitrarily long arrays of unsigned ints. I've found that I have some duplication of data, and am looking to ignore duplicates in the short term and remove whatever bug is causing them in the long term.
I'm looking at inserting each dataset into a map before storing it, but only if it was not found in the map to start with. My initial thought was to have a map of strings and use memcpy as a hammer to force the ints into a character array, and then copy that into a string and store the string. This failed because a good deal of my data contains multiple bytes of 0
(aka NULL
) at the front of the relevant data, so a majority of very real data got thrown out.
My next attempt is planned to be std::map<std::vector<unsigned char>,int>
, but I'm realizing that I don't know if the map insert function will work.
Is this doable, even if ill advised, or is there a better way to approach this problem?
Edit
So it's been remarked that I didn't make clear what I'm doing, so here's a hopefully better description.
I'm working on generating a minimum spanning tree, given that I have a number of trees containing the actual end nodes I'm working with. The goal is to come up with the selection of trees that has the shortest length and that covers all of the end nodes, where the chosen trees share at most one node with each other and are all connected. I'm basing my approach off of a binary decision tree, but making a few changes to hopefully allow for greater parallelism.
Rather than taking the binary tree approach I've opted to make a bit vector out of unsigned integers for each dataset, where a 1 in a bit position indicates the inclusion of the corresponding tree.
For example if just tree 0 were included in a 5 tree dataset I would start with
00001
From here I can generate:
00011
00101
01001
10001
Each of these can then be processed in parallel, since none of them depend on each other. I do this for all of the single trees (00010, 00100, etc..) and should, I haven't taken the time to formally prove it, be able to generate all values in the range (0,2^n) once and only once.
I started to notice that many datasets were taking far longer to complete than I thought they should, and enabled a debugging output to look at all of the generated results, and a quick Perl script later it was confirmed that I had multiple processes generating the same output. Since then I've been trying to resolve where the duplicates are coming from with very little success, and I'm hoping that this will work well enough to let me verify the results that are being generated without the, sometimes, 3 day wait on computations.
You will not have problems with that, as std::vector provides you the "==", "<" and ">" operators:
http://en.cppreference.com/w/cpp/container/vector/operator_cmp
The requirements for being a key in std::map
are satisfied by std::vector
, so yes you can do that. Sounds like a good temporary solution (easy to code, minimum of hassle) -- but you know what they say: "there is nothing more permanent than the temporary".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With