I've been using c/c++/cuda for less than a week and not familiar with all the options available in terms of libraries(sorry if my question is too wacky or impossible). Here's my problem, I have a process that takes data and analyzes it then does 1 of 3 things, (1) saves the results, (2) discards the results or (3) breaks the data down and sends it back to be processed.
Often option (3) creates a lot of data and I very quickly exceed the memory available to me(my server is 16 gigs) so the way I got around that was to setup a queue server(rabbitmq) that I would send and receive work from(it swaps the queue once it reaches a certain size of memory). This worked perfectly when I used small servers with faster nics to transfer the data, but lately I have been learning and converting my code from Java to c/c++ and running it on a GPU which has made the queues a big bottleneck. The bottleneck was obviously the network io(profiling on cheap systems showed high cpu usage and similar on old gpu's but new faster cpus/gpus are not getting utilized as much and network IO is steady at 300-400/mbs). So I decided to try to eliminate the network totally and run the queue server locally on the server which made it faster but I suspect it could be even more faster if I used a solution that didn't rely on external network services(even if I am running them locally). It may not work but I want to experiment.
So my question is, is there anything that I can use like a queue that I can remove entries as I read them but also swaps the queue to disk once it reaches a certain size(but keeps the in-memory queue always full so I don't have to wait to read from disk)? When learning about Cuda, there are many examples of researchers running analysis on huge datasets, any ideas of how they keep data coming in at the fastest rate for the system to process(I imagine they aren't bound by disk/network otherwise faster gpu's wouldn't really give them magnitudes increase in performance)?
Does anything like this exist?
p.s. if it helps, so far I have experimented with rabbitmq(too slow for my situation), apollo mq(good but still network based), reddis(really liked it but cannot exceed physical memory), playing with mmap(), and I've also compressed my data to get better throughput. I know general solutions but I'm wondering if there's something native to c/c++, cuda or a library I can use(ideally, I would have a queue in Cuda global memory that swapped to the host memory that swapped to the disk so the GPU's would always be at full speed but that maybe wishful thinking). If there's anything else you can think of let me know and I'd enjoy experimenting with it(if it helps, I develop on a Mac and run it on linux).
Let me suggest something quite different.
Building a custom solution would not be excessively hard for an experienced programmer, but it is probably impossible for an inexperienced or even intermediate programmer to produce something robust and reliable.
Have you considered a DBMS?
For small data sets it will all be cached in memory. As it grows, the DBMS will have some very sophisticated caching/paging techniques. You get goodies like sorting/prioritisation, synchronisation/sharing for free.
A really well-written custom solution will be much faster than a DBMS, but will have huge costs in developing and maintaining the custom solution. Spend a bit of time optimising and tuning the DBMS and it starts looking pretty fast and will be very robust.
It may not fit your needs, but I'd suggest having a long hard look at a DBMS before you reject it.
There's an open source implementation of the Standard Template Library containers that's created to address exactly this problem.
STXXL nearly transparently swaps data to the disk for any of the standard STL containers. It's very well-written and well-maintained, and is very easy to adapt/migrate your code to given its similarity to the STL.
Another option is to use the existing STL containers but specify a disk-backed allocator. All the STL containers have a template parameter for the STL allocator, which specifies how the memory for entries is stored. There's a good disk-backed STL allocator that's on the tip of my tongue, but I can't seem to find via Google (I'll update this if/when I do).
Edit: I see Roger had actually already mentioned this in the comments.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With