I have a table that doesn't have any primary key and datemodified/timestamp. This table is just like a transaction table that keeps saving all data (No delete/update).
My problem now is I want to inject the data to HDFS without loading the whole table again every time I run the incremental load.
The code below gets the latest row imported to HDFS if my table has primary key.
sqoop job \
--create tb_w_PK_DT_append \
-- \
import \
--connect jdbc:mysql://10.217.55.176:3306/SQOOP_Test \
--username root \
--incremental append \
--check-column P_id \
--last-value 0 \
--target-dir /data \
--query "SELECT * FROM tb_w_PK_DT WHERE \$CONDITIONS" \
-m 1;
Any solution to get the latest data imported without any primary key or date modified.
I know I am bit late to answer this, but just wanted to share for reference. If There's a scenario that you don't have primary key column or date column on your source table and you want to sqoop the increment data only to hdfs.
Let's say there's some table which holds history of data and new rows being inserted to on daily basis and you just need the newly inserted rows to hdfs. if your source is sql server you can create Insert or Update trigger on your history table.
you can create a Insert trigger as shown below:
CREATE TRIGGER transactionInsertTrigger
ON [dbo].[TransactionHistoryTable]
AFTER INSERT
AS
BEGIN
SET NOCOUNT ON;
INSERT INTO [dbo].[TriggerHistoryTable]
(
product ,price,payment_type,name,city,state,country,Last_Modified_Date
)
SELECT
product,price,payment_type,name,city,state,country,GETDATE() as Last_Modified_Date
FROM
inserted i
END
Create a Table to hold the records when an insert events occurs on your main table. Keep the schema same as your main table, however you can add extra columns to this. the above trigger will insert a row into table whenever there's any new row gets inserted to your main TransactionHistoryTable.
CREATE TABLE [dbo].[TriggerHistoryTable](
[product] [varchar](20) NULL,
[price] [int] NULL,
[payment_type] [varchar](20) NULL,
[name] [varchar](20) NULL,
[city] [varchar](20) NULL,
[state] [varchar](20) NULL,
[country] [varchar](20) NULL,
[Last_Modified_Date] [date] NULL
) ON [PRIMARY]
Now if we insert two new rows to main TransactionHistoryTable, because of this insert evert, our triggered was fired and has inserted these two rows to TriggerHistoryTable also along with main TransactionHistoryTable
insert into [Transaction_db].[dbo].[TransactionHistoryTable]
values
('Product3',2100,'Visa','Cindy' ,'Kemble','England','United Kingdom')
,('Product4',50000,'Mastercard','Tamar','Headley','England','United Kingdom')
;
select * from TriggerHistoryTable;
Now you can sqoop from your TriggerHistoryTable, which will be having daily insert or updated records. You can use Incremental sqoop also since we have added a date column to this. once you have imported data to hdfs you can clear this table on daily basis or weekly. This is just an example with sql server. you can have triggers with Teradata and oracle and other databases also. you can also set up a update/delete trigger also.
You can follow these steps
1) The initial load data (previous day data) is in hdfs - Relation A
2) Import the current data into HDFS using sqoop -- Relation B
3) Use pig Load the above two hdfs directories in relation A and B define schema.
4) Convert them to tuples and join them by all columns
5) The join result will have two tuples in each row((A,B),(A,B)) , fetch the result from join where tuple B is null ((A,D),).
6) Now flatten the join by tuple A you will have new/updated records(A,D).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With