I want to test and configure Impala with my Hadoop 2.2.0 distribution, not Cloudera ones.
I want to know if its possible to use Impala without CDH, because I only read that Impala is CDH dependent.
I'm trying to follow the guide in Impala Github - https://github.com/cloudera/impala - and I'll do the possible changes to make it work.
Does anyone already done that? or is it really impossible?
Impala uses the distributed filesystem HDFS as its primary data storage medium. Impala relies on the redundancy provided by HDFS to guard against hardware or network outages on individual nodes. Impala table data is physically represented as data files in HDFS, using familiar HDFS file formats and compression codecs.
Impala does not replace the batch processing frameworks built on MapReduce such as Hive. Hive and other frameworks built on MapReduce are best suited for long running batch jobs, such as those involving batch processing of Extract, Transform, and Load (ETL) type jobs.
Impala is faster than Hive because it's a whole different engine and Hive is over MapReduce (which is very slow due to its too many disk I/O operations).
Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop.
I think there are two things here that should be addressed separately:
So yeah, it's possible, though it probably won't be a smooth installation and there isn't a lot of help for this use case. Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With