flink job is not distributed across machines

Tags:

I have small use case in Apache flink, which is, a batch processing system. I need to process a colletion of files. Processing of each file must be handled by one machine. I have this below code. All the time only one task slot is occupied, and the files are processed one after the other. I have 6 nodes (so 6 task managers) and configured 4 task slot in each node. So, i expect 24 files are processed at a time.

class MyMapPartitionFunction extends RichMapPartitionFunction[java.io.File, Int] {
  override def mapPartition(
      myfiles: java.lang.Iterable[java.io.File],
      out:org.apache.flink.util.Collector[Int])
    : Unit  =  {
    var temp = myfiles.iterator()
    while(temp.hasNext()){
      val fp1 = getRuntimeContext.getDistributedCache.getFile("hadoopRun.sh")
      val file = new File(temp.next().toURI)
      Process(
        "/bin/bash ./run.sh  " + argumentsList(3)+ "/" + file.getName + " " + argumentsList(7) + "/" + file.getName + ".csv",
        new File(fp1.getAbsoluteFile.getParent))
        .lines
        .foreach{println}
      out.collect(1)
    }
  }
}

I launched flink as ./bin/start-cluster.sh command and the web user interface shows it has 6 task managers, 24 task slots.

The folders contain about 49 files. When I create mapPartition on this collection, i expect 49 parallel processes are spanned. But then, in my infrastructure, they are all processed one after the other. This means that only one machine (one task manager) handles all the 49 filenames. What i want is, as configured 2 tasks per slots, I expect 24 files to be processed simultaneously.

Any pointers will surely help here. I have these parameters in flink-conf.yaml file

jobmanager.heap.mb: 2048
taskmanager.heap.mb: 1024
taskmanager.numberOfTaskSlots: 4
taskmanager.memory.preallocate: false
parallelism.default: 24

Thanks in advance. Can someone throw me light on where I am going wrong?

731

asked May 04 '17 09:05

Bala

1 Answers

As David described the problem is that env.fromCollection(Iterable[T]) creates a DataSource with a non parallel InputFormat. Therefore, the DataSource is executed with a parallelism of 1. The subsequent operators (mapPartition) inherit this parallelism from the source so that they can be chained (this saves us one network shuffle).

The way to solve this problem is to either explicitly rebalance the source DataSet via

env.fromCollection(folders).rebalance()

or to explicitly set the wished parallelism at the subsequent operator (mapPartition):

env.fromCollection(folders).mapPartition(...).setParallelism(49)

answered Oct 21 '22 10:10

Till Rohrmann

Related questions
                            
                                Akka-streams: how to get flow names in metrics reported by kamon-akka
                            
                                Implementing a Cake Pattern with implicit functionality
                            
                                No implementation for OWrites and Reads was bound in Scala Play app
                            
                                Including a Spark Package JAR file in a SBT generated fat JAR
                            
                                Cannot infer contravariant Nothing type parameter
                            
                                Should I persist a Spark dataframe if I keep adding columns in it?
                            
                                How to separate out parsing from validation in case of versioned config using scala?
                            
                                Proper way to access shared resource in Scala actors
                            
                                How to specialize on a type projection in Scala?
                            
                                Is there a quick way to convert Java xml objects to Scala xml objects?
                            
                                Lift Ajax multi select box
                            
                                How to handle exceptions in a playframework 2 Async block (scala)
                            
                                How to add custom IntelliJ Language Injection to scala string-interpolation?
                            
                                How to use Scala ARM with Futures?
                            
                                Apache Spark: distinct doesnt work?
                            
                                Advantages of using ScalaFutures from ScalaTest vs. Await.result
                            
                                How to do time-series simple forecast?
                            
                                Forcing all implementations of a trait to override equals
                            
                                Event-sourcing with akka-persistance: growing state as list?
                            
                                How to clean up other resources when spark gets stopped

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

flink job is not distributed across machines

Tags:

scala

batch-processing

apache-flink

Bala

People also ask

1 Answers

Till Rohrmann

Recent Activity

Donate For Us