I have a requirement to ingest the data from an Oracle database to Hadoop in real-time.
What's the best way to achieve this on Hadoop?
You can use various methods to ingest data into Big SQL, which include adding files directly to HDFS, using Big SQL EXTERNAL HADOOP tables, using Big SQL LOAD HADOOP , and using INSERT… SELECT/CTAS from Big SQL and Hive.
Apache Flume It is mainly designed for the ingestion of data into a Hadoop Distributed File System (HDFS). The tool extracts, aggregates, and loads high volumes of streaming data from different data sources onto HDFS.
Oracle big data services help data professionals manage, catalog, and process raw data. Oracle offers object storage and Hadoop-based data lakes for persistence, Spark for processing, and analysis through Oracle Cloud SQL or the customer's analytical tool of choice.
Data ingest is the process of collecting and classifying user data in the Oracle Data Cloud platform. The data ingest process entails extracting user's attributes from your online, offline, and mobile sources and then mapping the collected attributes into categories in your taxonomy via classification rules.
Hadoop Data ingestion is the beginning of your data pipeline in a data lake. It means taking data from various silo databases and files and putting it into Hadoop. Sounds arduous? For many companies, it does turn out to be an intricate task. That is why they take more than a year to ingest all their data into Hadoop data lake .
The Near real time ingestion API enables you to ingest data directly into your Oracle Audience Segmentation data objects. Unlike the Stream API, you do not need to run a data warehouse job after ingesting it via the API, for the data to be available in Oracle Audience Segmentation. Data is ingested directly to your data objects in near real time.
The reason is as Hadoop is an open source; there are a variety of ways you can ingest data into Hadoop. It gives every developer the choice of using her/his favorite tool or language to ingest data into Hadoop. Developers while choosing a tool/technology stress on performance, but this makes governance very complicated.
Sqoop is a command line application that helps us to transfer data from a relational database to HDFS. Internally Sqoop uses Mappers from MapReduce to connects to the database using JDBC after that it selects the data and writes it into HDFS.
The important problem here is getting the data out of the Oracle DB in real time. This is usually called Change Data Capture, or CDC. The complete solution depends on how you do this part.
Other things that matter for this answer are:
Coming back to CDC, there are three different approaches to it:
Expanding a bit on what @Nickolay mentioned, there are a few options, but the best would be too opinion based to state.
Tungsten (open source)
Tungsten Replicator is an open source replication engine supporting a variety of different extractor and applier modules. Data can be extracted from MySQL, Oracle and Amazon RDS, and applied to transactional stores, including MySQL, Oracle, and Amazon RDS; NoSQL stores such as MongoDB, and datawarehouse stores such as Vertica, Hadoop, and Amazon rDS.
Oracle GoldenGate
Oracle GoldenGate is a comprehensive software package for real-time data integration and replication in heterogeneous IT environments. The product set enables high availability solutions, real-time data integration, transactional change data capture, data replication, transformations, and verification between operational and analytical enterprise systems. It provides a handler for HDFS.
Dell Shareplex
SharePlex™ Connector for Hadoop® loads and continuously replicates changes from an Oracle® database to a Hadoop® cluster. This gives you all the benefits of maintaining a real-time or near real-time copy of source tables
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With