Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Incremental data load using sqoop without primary key or timestamp

Tags:

hadoop

hdfs

sqoop

I have a table that doesn't have any primary key and datemodified/timestamp. This table is just like a transaction table that keeps saving all data (No delete/update).

My problem now is I want to inject the data to HDFS without loading the whole table again every time I run the incremental load.

The code below gets the latest row imported to HDFS if my table has primary key.

sqoop job \
--create tb_w_PK_DT_append \
-- \
import \
--connect jdbc:mysql://10.217.55.176:3306/SQOOP_Test \
--username root \
--incremental append \
--check-column P_id \
--last-value 0 \
--target-dir /data \
--query "SELECT * FROM tb_w_PK_DT WHERE \$CONDITIONS" \
-m 1;

Any solution to get the latest data imported without any primary key or date modified.

like image 250
MMakati Avatar asked Oct 18 '22 18:10

MMakati


2 Answers

I know I am bit late to answer this, but just wanted to share for reference. If There's a scenario that you don't have primary key column or date column on your source table and you want to sqoop the increment data only to hdfs.

Let's say there's some table which holds history of data and new rows being inserted to on daily basis and you just need the newly inserted rows to hdfs. if your source is sql server you can create Insert or Update trigger on your history table.

TransactionHistoryTable

you can create a Insert trigger as shown below:

CREATE TRIGGER transactionInsertTrigger 
ON  [dbo].[TransactionHistoryTable]
AFTER INSERT
AS
BEGIN
    SET NOCOUNT ON;
INSERT INTO [dbo].[TriggerHistoryTable]
(
 product ,price,payment_type,name,city,state,country,Last_Modified_Date
 )
SELECT
 product,price,payment_type,name,city,state,country,GETDATE() as Last_Modified_Date
FROM
inserted i
END

Create a Table to hold the records when an insert events occurs on your main table. Keep the schema same as your main table, however you can add extra columns to this. the above trigger will insert a row into table whenever there's any new row gets inserted to your main TransactionHistoryTable.

CREATE TABLE [dbo].[TriggerHistoryTable](
    [product] [varchar](20) NULL,
    [price] [int] NULL,
    [payment_type] [varchar](20) NULL,
    [name] [varchar](20) NULL,
    [city] [varchar](20) NULL,
    [state] [varchar](20) NULL,
    [country] [varchar](20) NULL,
    [Last_Modified_Date] [date] NULL
) ON [PRIMARY]

Now if we insert two new rows to main TransactionHistoryTable, because of this insert evert, our triggered was fired and has inserted these two rows to TriggerHistoryTable also along with main TransactionHistoryTable

insert into [Transaction_db].[dbo].[TransactionHistoryTable]
values
('Product3',2100,'Visa','Cindy' ,'Kemble','England','United Kingdom')
,('Product4',50000,'Mastercard','Tamar','Headley','England','United Kingdom')
;

select * from TriggerHistoryTable;

TriggerHistoryTable

Now you can sqoop from your TriggerHistoryTable, which will be having daily insert or updated records. You can use Incremental sqoop also since we have added a date column to this. once you have imported data to hdfs you can clear this table on daily basis or weekly. This is just an example with sql server. you can have triggers with Teradata and oracle and other databases also. you can also set up a update/delete trigger also.

like image 132
vikrant rana Avatar answered Nov 01 '22 16:11

vikrant rana


You can follow these steps

1) The initial load data (previous day data) is in hdfs  - Relation A
2) Import the current data into HDFS using sqoop -- Relation B
3) Use pig Load the above two hdfs directories in relation A and B define schema.
4) Convert them to tuples and join them by all columns
5) The join result will have two tuples in each row((A,B),(A,B)) , fetch the result from join where tuple B is null ((A,D),).
6) Now flatten the join by tuple A you will have new/updated records(A,D).
like image 29
vinayak_narune Avatar answered Nov 01 '22 17:11

vinayak_narune