In a nutshell, I have a customer who wants the data contained in a bunch of ASCII text files (a.k.a "input files") ingested into Accumulo.
These files are output from diverse data feed devices and will be generated continuously on non-Hadoop/non-Accumulo node(s) (a.k.a "feed nodes"). The overall data throughput rate across all feeds is expected to be very high.
For the sake of simplicity, assume that all the data will end up in one forward index table and one inverted [reverse] index table in Accumulo.
I've already written an Accumulo client module using pyaccumulo that can establish a connection to Accumulo through the Thrift Proxy, read and parse the input files from a local filesystem (not HDFS), create the appropriate forward and reverse index mutations in code, and use BatchWriter to write the mutations to the forward and reverse index tables. So far, so good. But there's more to it.
From various sources, I've learned that there are at least a few standard approaches for Accumulo high speed ingest that might apply in my scenario, and I'm asking for some advice regarding what options make the most sense in terms of resource usage, and ease of implementation & maintenance. Here are some options:
- BatchWriter clients on feed nodes: Run my Accumulo client on the feed nodes. This option has the disadvantage sending both forward and reverse index mutations across the network. Also, Accumulo/Thrift libraries need to be available on the feed nodes to support the Accumulo client. However, this option has the advantage that it parallelizes the work of parsing the input files and creating the mutations, and seems to minimize disk I/O on the Hadoop cluster compared with the options below.
- BatchWriter client on Accumulo master node: scp/sftp the input files from the feed nodes to the Accumulo master node, into some directory on the local filesystem. Then run my Accumulo client on the Accumulo master node only. This option has the advantage that it doesn't send both forward and reverse index mutations across the network from the feed nodes to the Accumulo master node, and it doesn't require the Accumulo/Thrift libraries to be available on the feed nodes. However, it has the disadvantage that it makes the Accumulo master node do all the work of parsing the input files and creating the mutations, and it uses the Accumulo master's local disk as a waypoint for the input files.
- MapReduce with AccumuloOutputFormat: scp/sftp the input files from the feed nodes to the Accumulo master node. Then periodically copy them to HDFS and run a MapReduce job that reads and parses the input files from HDFS, creates the mutations, and uses AccumuloOutputFormat to write them. This option has the advantages of #2 above, plus it parallelizes the work of parsing the input files and creating the mutations. However, it has the disadvantage that it's going to constantly spin up and break down MapReduce jobs, and invoke all the overhead involved with those processes. It also has the disadvantage that it uses two disk waypoints (local and HDFS) with associated disk I/O. It sounds somewhat painful to implement and maintain for continuous ingest.
- MapReduce with AccumuloOutput*File*Format (rfiles): scp/sftp the input files from the feed nodes to the Accumulo master node. Then periodically copy them to HDFS and run a MapReduce job that reads and parses the input files from HDFS, creates the mutations, and uses AccumuloOutputFileFormat to write rfiles. Then use the Accumulo shell to "ingest" the rfiles. This option has all the advantages of #3 above, but I don't know if it has other advantages (Does it? The Accumulo manual states about bulk ingest: "In some cases it may be faster to load data this way rather than via ingesting through clients using BatchWriters." What cases?). It also has all the disadvantages of #3 above, except that it uses three disk waypoints (local, HDFSx2) with associated disk I/O. It sounds painful to implement and maintain for continuous ingest.
Personally, I like option #2 the most, as long as the Accumulo master node can handle the processing load involved on its own (non-parallel input file parsing). The variant of #2 where I could run my Accumulo client on each Accumulo node, and send the output of different feed nodes to different Accumulo nodes, or round-robin, still has the disadvantage of sending the forward and reverse index mutations across the cloud network to the Accumulo master, but does have the advantage of performing the input file parsing more in parallel.
What I need to know is: Have I missed any viable options? Have I missed any advantages/disadvantages of each option? Are any of the advantages/disadvantages trivial or highly important regardless of my problem context, especially network bandwidth / CPU cycle / disk I/O tradeoffs? Is MapReduce with or without rfiles worth the trouble compared to BatchWriter? Does anyone have "war stories"?
Thanks!
Even with every use case, people have personal preferences regarding how they would like to implement a solution for a specific use case. I would actually run flume agents on the feed nodes and collect the data in HDFS and periodically run a MapReduce on the new data that arrives in HDFS using the RFile approach.