Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Oracle to Hadoop data ingestion in real-time

I have a requirement to ingest the data from an Oracle database to Hadoop in real-time.

What's the best way to achieve this on Hadoop?

like image 906
Venkat Ankam Avatar asked Dec 12 '14 00:12

Venkat Ankam


People also ask

How you ingest the data in Hadoop?

You can use various methods to ingest data into Big SQL, which include adding files directly to HDFS, using Big SQL EXTERNAL HADOOP tables, using Big SQL LOAD HADOOP , and using INSERT… SELECT/CTAS from Big SQL and Hive.

Which tool is used for data ingestion in HDFS?

Apache Flume It is mainly designed for the ingestion of data into a Hadoop Distributed File System (HDFS). The tool extracts, aggregates, and loads high volumes of streaming data from different data sources onto HDFS.

Can Oracle database process big data?

Oracle big data services help data professionals manage, catalog, and process raw data. Oracle offers object storage and Hadoop-based data lakes for persistence, Spark for processing, and analysis through Oracle Cloud SQL or the customer's analytical tool of choice.

What is Oracle ingestion?

Data ingest is the process of collecting and classifying user data in the Oracle Data Cloud platform. The data ingest process entails extracting user's attributes from your online, offline, and mobile sources and then mapping the collected attributes into categories in your taxonomy via classification rules.

What is data ingestion in Hadoop?

Hadoop Data ingestion is the beginning of your data pipeline in a data lake. It means taking data from various silo databases and files and putting it into Hadoop. Sounds arduous? For many companies, it does turn out to be an intricate task. That is why they take more than a year to ingest all their data into Hadoop data lake .

What is the near real time ingestion API?

The Near real time ingestion API enables you to ingest data directly into your Oracle Audience Segmentation data objects. Unlike the Stream API, you do not need to run a data warehouse job after ingesting it via the API, for the data to be available in Oracle Audience Segmentation. Data is ingested directly to your data objects in near real time.

Why Hadoop is the best for big data?

The reason is as Hadoop is an open source; there are a variety of ways you can ingest data into Hadoop. It gives every developer the choice of using her/his favorite tool or language to ingest data into Hadoop. Developers while choosing a tool/technology stress on performance, but this makes governance very complicated.

What is Sqoop in Hadoop?

Sqoop is a command line application that helps us to transfer data from a relational database to HDFS. Internally Sqoop uses Mappers from MapReduce to connects to the database using JDBC after that it selects the data and writes it into HDFS.


2 Answers

The important problem here is getting the data out of the Oracle DB in real time. This is usually called Change Data Capture, or CDC. The complete solution depends on how you do this part.

Other things that matter for this answer are:

  • What is the target for the data and what are you going to do with it?
    • just store plain HDFS files and access for adhoc queries with something like Impala?
    • store in HBase for use in other apps?
    • use in a CEP solution like Storm?
    • ...
  • What tools is your team familiar with
    • Do you prefer the DIY approach, gluing together existing open-source tools and writing code for the missing parts?
    • or do you prefer a Data integration tool like Informatica?

Coming back to CDC, there are three different approaches to it:

  • Easy: if you don't need true real-time and have a way to identify new data with an SQL query that executes fast enough for the required data latency. Then you can run this query over and over and ingest its results (the exact method depends on the target, the size of each chunk, and the preferred tools)
  • Complicated: Roll your own CDC solution: download the database logs, parse them into series of inserts/updates/deletes, ingest these to Hadoop.
  • Expensive: buy a CDC solution, that does this for you (like GoldenGate or Attunity)
like image 151
Nickolay Avatar answered Oct 26 '22 22:10

Nickolay


Expanding a bit on what @Nickolay mentioned, there are a few options, but the best would be too opinion based to state.

Tungsten (open source)

Tungsten Replicator is an open source replication engine supporting a variety of different extractor and applier modules. Data can be extracted from MySQL, Oracle and Amazon RDS, and applied to transactional stores, including MySQL, Oracle, and Amazon RDS; NoSQL stores such as MongoDB, and datawarehouse stores such as Vertica, Hadoop, and Amazon rDS.

Oracle GoldenGate

Oracle GoldenGate is a comprehensive software package for real-time data integration and replication in heterogeneous IT environments. The product set enables high availability solutions, real-time data integration, transactional change data capture, data replication, transformations, and verification between operational and analytical enterprise systems. It provides a handler for HDFS.

Dell Shareplex

SharePlex™ Connector for Hadoop® loads and continuously replicates changes from an Oracle® database to a Hadoop® cluster. This gives you all the benefits of maintaining a real-time or near real-time copy of source tables

like image 32
ethesx Avatar answered Oct 27 '22 00:10

ethesx