I have a requirement to ingest the data from an Oracle database to Hadoop in real-time. What's the best way to achieve this on Hadoop?

The important problem here is getting the data out of the Oracle DB in real time. This is usually called Change Data Capture, or CDC. The complete solution depends on how you do this part. Other things that matter for this answer are: <ul> <li>What is the target for the data and what are you going to do with it? <ul> <li>just store plain HDFS files and access for adhoc queries with something like Impala?</li> <li>store in HBase for use in other apps?</li> <li>use in a CEP solution like Storm?</li> <li>...</li> </ul> </li> <li>What tools is your team familiar with <ul> <li>Do you prefer the DIY approach, gluing together existing open-source tools and writing code for the missing parts?</li> <li>or do you prefer a Data integration tool like Informatica?</li> </ul> </li> </ul> Coming back to CDC, there are three different approaches to it: <ul> <li>Easy: if you don't need true real-time and have a way to identify new data with an SQL query that executes fast enough for the required data latency. Then you can run this query over and over and ingest its results (the exact method depends on the target, the size of each chunk, and the preferred tools)</li> <li>Complicated: Roll your own CDC solution: download the database logs, parse them into series of inserts/updates/deletes, ingest these to Hadoop.</li> <li>Expensive: buy a CDC solution, that does this for you (like GoldenGate or Attunity)</li> </ul>

Oracle to Hadoop data ingestion in real-time

2 Answers

The important problem here is getting the data out of the Oracle DB in real time. This is usually called Change Data Capture, or CDC. The complete solution depends on how you do this part.

Other things that matter for this answer are:

What is the target for the data and what are you going to do with it?
- just store plain HDFS files and access for adhoc queries with something like Impala?
- store in HBase for use in other apps?
- use in a CEP solution like Storm?
- ...
What tools is your team familiar with
- Do you prefer the DIY approach, gluing together existing open-source tools and writing code for the missing parts?
- or do you prefer a Data integration tool like Informatica?

Coming back to CDC, there are three different approaches to it:

Easy: if you don't need true real-time and have a way to identify new data with an SQL query that executes fast enough for the required data latency. Then you can run this query over and over and ingest its results (the exact method depends on the target, the size of each chunk, and the preferred tools)
Complicated: Roll your own CDC solution: download the database logs, parse them into series of inserts/updates/deletes, ingest these to Hadoop.
Expensive: buy a CDC solution, that does this for you (like GoldenGate or Attunity)

151

answered Oct 26 '22 22:10

Nickolay

Expanding a bit on what @Nickolay mentioned, there are a few options, but the best would be too opinion based to state.

Tungsten (open source)

Tungsten Replicator is an open source replication engine supporting a variety of different extractor and applier modules. Data can be extracted from MySQL, Oracle and Amazon RDS, and applied to transactional stores, including MySQL, Oracle, and Amazon RDS; NoSQL stores such as MongoDB, and datawarehouse stores such as Vertica, Hadoop, and Amazon rDS.

Oracle GoldenGate

Oracle GoldenGate is a comprehensive software package for real-time data integration and replication in heterogeneous IT environments. The product set enables high availability solutions, real-time data integration, transactional change data capture, data replication, transformations, and verification between operational and analytical enterprise systems. It provides a handler for HDFS.

Dell Shareplex

SharePlex™ Connector for Hadoop® loads and continuously replicates changes from an Oracle® database to a Hadoop® cluster. This gives you all the benefits of maintaining a real-time or near real-time copy of source tables

answered Oct 27 '22 00:10

ethesx

Related questions
                            
                                Joining two Tables in Hive using HiveQL(Hadoop) [duplicate]
                            
                                how to write java Log file using the logger api while using hadoop
                            
                                Using Hive for real time queries
                            
                                Hadoop ChainMapper, ChainReducer [duplicate]
                            
                                Hadoop Mapreduce multiple Input files
                            
                                JobTracker in hadoop not running
                            
                                CDH4 Hbase using Pig ERROR 2998 java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/Filter
                            
                                split function does not work in Cloudera Impala
                            
                                In Hadoop is it possible to specify the record delimiter for TextOutputFormat
                            
                                Hadoop namenode disk size
                            
                                Map reduce job getting stuck at map 0% reduce 0%
                            
                                hadoop: tasks not local with file?
                            
                                How to set up hadoop environment variables
                            
                                Running Amazon EMR with a custom AMI?
                            
                                Getting current time in oozie workflow
                            
                                Editing a multi million row file on Hadoop cluster
                            
                                How to find total number of nodes on which hadoop is installed
                            
                                Not able to run oozie workflow with java action
                            
                                Hadoop Java vs C/C++ on cpu-intensive tasks
                            
                                mapreduce in java - gzip input files

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Oracle to Hadoop data ingestion in real-time

Tags:

real-time

hadoop

Venkat Ankam

People also ask

2 Answers

Nickolay

ethesx

Recent Activity

Donate For Us