how to process data in chunks/batches with kafka streams?

Tags:

For many situations in Big Data it is preferable to work with a small buffer of records at a go, rather than one record at a time.

The natural example is calling some external API that supports batching for efficiency.

How can we do this in Kafka Streams? I cannot find anything in the API that looks like what I want.

So far I have:

builder.stream[String, String]("my-input-topic")
.mapValues(externalApiCall).to("my-output-topic")

What I want is:

builder.stream[String, String]("my-input-topic")
.batched(chunkSize = 2000).map(externalBatchedApiCall).to("my-output-topic")

In Scala and Akka Streams the function is called grouped or batch. In Spark Structured Streaming we can do mapPartitions.map(_.grouped(2000).map(externalBatchedApiCall)).

227

asked Sep 17 '18 11:09

samthebest

2 Answers

Doesn't seem to exist yet. Watch this space https://issues.apache.org/jira/browse/KAFKA-7432

104

answered Sep 25 '22 08:09

samthebest

you could use a queue. something like below,

@Component
@Slf4j
public class NormalTopic1StreamProcessor extends AbstractStreamProcessor<String> {

    public NormalTopic1StreamProcessor(KafkaStreamsConfiguration configuration) {
        super(configuration);
    }

    @Override
    Topology buildTopology() {
        KStream<String, String> kStream = streamsBuilder.stream("normalTopic", Consumed.with(Serdes.String(), Serdes.String()));
        // .peek((key, value) -> log.info("message received by stream 0"));
        kStream.process(() -> new AbstractProcessor<String, String>() {
            final LinkedBlockingQueue<String> queue = new LinkedBlockingQueue<>(100);
            final List<String> collection = new ArrayList<>();

            @Override
            public void init(ProcessorContext context) {
                super.init(context);
                context.schedule(Duration.of(1, ChronoUnit.MINUTES), WALL_CLOCK_TIME, timestamp -> {
                    processQueue();
                    context().commit();
                });
            }

            @Override
            public void process(String key, String value) {
                queue.add(value);
                if (queue.remainingCapacity() == 0) {
                    processQueue();
                }
            }

            public void processQueue() {
                queue.drainTo(collection);
                long count = collection.stream().peek(System.out::println).count();
                if (count > 0) {
                    System.out.println("count is " + count);
                    collection.clear();
                }
            }
        });
        kStream.to("normalTopic1");
        return streamsBuilder.build();
    }

}

answered Sep 25 '22 08:09

Rajesh Rai

Related questions
                            
                                Debugging with headless browser
                            
                                Spring security OAuth2 Refresh Token - IllegalStateException, UserDetailsService is required
                            
                                JUnit 5 TestSuite alternative?
                            
                                @Nullable and SonarQube 'Conditionally executed blocks should be reachable' warning
                            
                                Inflating WebView is slow since Lollipop
                            
                                How to add Menus and SubMenus recursively in JavaFX?
                            
                                "session is down" error when opening an SSH channel with JSch
                            
                                SonarQube Leak Period between Branches or Projects
                            
                                Hibernate two ManyToOne relations on one Table, the first gets Eager and the second LAZY loaded
                            
                                Can we use @Autowired in a Tasklet in Spring Batch?
                            
                                Spring WebFlux: Serve files from controller
                            
                                How to properly throw MethodArgumentNotValidException
                            
                                Spring Boot 2.0 Quartz - Use non-primary datasource
                            
                                Join more than two tables using Annotations in Spring Data JPA
                            
                                How to invoke constructor using LambdaMetaFactory?
                            
                                Spring Data JPA 'jpaMappingContext' error, IllegalStateException: Expected to be able to resolve a type but got null
                            
                                Java primitive array layout in memory
                            
                                How can Spring match query parameter only by formal parameter name?
                            
                                Intellij Can't access classes from one module in other module
                            
                                How can I unzip huge folder with multithreading with java - preferred java8?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to process data in chunks/batches with kafka streams?

Tags:

java

scala

apache-kafka

apache-spark

apache-kafka-streams

samthebest

People also ask

2 Answers

samthebest

Rajesh Rai

Recent Activity

Donate For Us