I am using a 3rd party CDC tool that replicates data from a source database into Kafka topics. An example row is shown below: <pre class="prettyprint"><code>{ "data":{ "USER_ID":{ "string":"1" }, "USER_CATEGORY":{ "string":"A" } }, "beforeData":{ "Data":{ "USER_ID":{ "string":"1" }, "USER_CATEGORY":{ "string":"B" } } }, "headers":{ "operation":"UPDATE", "timestamp":"2018-05-03T13:53:43.000" } } </code></pre> What configuration is needed in the sink file in order to extract all the (sub)fields under <code>data</code> and <code>headers</code> and ignore those under <code>beforeData</code> so that the target table in which the data will be transferred by Kafka Sink will contain the following fields: <pre class="prettyprint"><code>USER_ID, USER_CATEGORY, operation, timestamp </code></pre> I went through the transformation list in confluent's docs but I was not able to find how to use them in order to achieve the aforementioned target.

If you're willing to list specific field names, you can solve this by: <ol> <li>Using a Flatten transform to collapse the nesting (which will convert the original structure's paths into dot-delimited names)</li> <li>Using a Replace transform with <code>rename</code> to make the field names be what you want the sink to emit</li> <li>Using another Replace transform with <code>whitelist</code> to limit the emitted fields to those you select</li> </ol> For your case it might look like: <pre class="prettyprint"><code> "transforms": "t1,t2,t3", "transforms.t1.type": "org.apache.kafka.connect.transforms.Flatten$Value", "transforms.t2.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", "transforms.t2.renames": "data.USER_ID:USER_ID,data.USER_CATEGORY:USER_CATEGORY,headers.operation:operation,headers.timestamp:timestamp", "transforms.t3.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", "transforms.t3.whitelist": "USER_ID,USER_CATEGORY,operation,timestamp", </code></pre>

How to transform and extract fields in Kafka sink JDBC connector

Tags:

apache-kafka

apache-kafka-connect

I am using a 3rd party CDC tool that replicates data from a source database into Kafka topics. An example row is shown below:

{  
   "data":{  
      "USER_ID":{  
         "string":"1"
      },
      "USER_CATEGORY":{  
         "string":"A"
      }
   },
   "beforeData":{  
      "Data":{  
         "USER_ID":{  
            "string":"1"
         },
         "USER_CATEGORY":{  
            "string":"B"
         }
      }
   },
   "headers":{  
      "operation":"UPDATE",
      "timestamp":"2018-05-03T13:53:43.000"
   }
}

What configuration is needed in the sink file in order to extract all the (sub)fields under data and headers and ignore those under beforeData so that the target table in which the data will be transferred by Kafka Sink will contain the following fields:

USER_ID, USER_CATEGORY, operation, timestamp

I went through the transformation list in confluent's docs but I was not able to find how to use them in order to achieve the aforementioned target.

915

asked May 10 '18 19:05

Giorgos Myrianthous

Video Answer

2 Answers

I think you want ExtractField, and unfortunately, it's a Map.get operation, so that means 1) nested fields cannot be gotten in one pass 2) multiple fields need multiple transforms.

That being said, you might to attempt this (untested)

transforms=ExtractData,ExtractHeaders
transforms.ExtractData.type=org.apache.kafka.connect.transforms.ExtractField$Value
transforms.ExtractData.field=data
transforms.ExtractHeaders.type=org.apache.kafka.connect.transforms.ExtractField$Value
transforms.ExtractHeaders.field=headers

If that doesn't work, you might be better off implementing your own Transformations package that can at least drop values from the Struct / Map.

answered Oct 16 '22 18:10

OneCricketeer

If you're willing to list specific field names, you can solve this by:

Using a Flatten transform to collapse the nesting (which will convert the original structure's paths into dot-delimited names)
Using a Replace transform with rename to make the field names be what you want the sink to emit
Using another Replace transform with whitelist to limit the emitted fields to those you select

For your case it might look like:

  "transforms": "t1,t2,t3",
  "transforms.t1.type": "org.apache.kafka.connect.transforms.Flatten$Value",
  "transforms.t2.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
  "transforms.t2.renames": "data.USER_ID:USER_ID,data.USER_CATEGORY:USER_CATEGORY,headers.operation:operation,headers.timestamp:timestamp",
  "transforms.t3.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
  "transforms.t3.whitelist": "USER_ID,USER_CATEGORY,operation,timestamp",

answered Oct 16 '22 19:10

Marty Woodlee

Related questions
                            
                                Difference between group id, Client id and id in KafkaListener Spring Boot
                            
                                Where to set maximum message size in Apache Kafka?
                            
                                Can you use Consul instead of Zookeeper for Kafka
                            
                                Spark unable to download kafka library
                            
                                What happens if offset specified by kafka consumer is not present in Broker?
                            
                                Kafka docker image that works without zookeepr
                            
                                Is it possible to read from multiple partitions using Kafka Simple Consumer?
                            
                                how to use kafka acls?
                            
                                Confluent Schema Registry Cluster Mode
                            
                                IoT data system design: Google Pub/Sub vs Kafka vs Kinesis vs PubNub for IoT data ingestion?
                            
                                Streaming from particular partition within a topic (Kafka Streams)
                            
                                How to manage page cache resources when running Kafka in Kubernetes
                            
                                How to write a Dataset to Kafka topic?
                            
                                Confluent connect-jdbc and exactly once delivery
                            
                                What Happens when there is only one partition in Kafka topic and multiple consumers?
                            
                                How to use foreach or foreachBatch in PySpark to write to database?
                            
                                Kafka MirrorMaker2 - not mirroring consumer group offsets
                            
                                can I limit consumption of kafka-node consumer?
                            
                                How to implement a microservice Event Driven architecture with Spring Cloud Stream Kafka and Database per service
                            
                                Kafka input to logstash plugin

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With