Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java8 Stream batch processing to avoid OutOfMemory

I'm having something like:

    List<Data> dataList = stepts.stream()
        .flatMap(step -> step.getPartialDataList().stream())
        .collect(Collectors.toList());

So I'm combining into dataList multiple lists from every step.

My problem is that dataList might run into OutOfMemoryError. Any suggestions on how I can batch the dataList and save the batches into db?

My primitive idea is to:

    for (Step step : steps) {
        List<Data> partialDataList = step.getPartialDataList();

        if (dataList.size() + partialDataList.size() <= MAXIMUM_SIZE) {
            dataList.addAll(partialDataList);
        } else {
            saveIntoDb(dataList);
            dataList = new ArrayList<>();
        }
    }

PS: I know there is this post, but the difference is that I might not be able to store whole data in memory.

LE: getPartialDataList metod is more like createPartialDataList()

like image 601
UnguruBulan Avatar asked Nov 19 '19 16:11

UnguruBulan


People also ask

Is Java 8 stream faster than for loop?

Yes, streams are sometimes slower than loops, but they can also be equally fast; it depends on the circumstances. The point to take home is that sequential streams are no faster than loops.

Does Java stream save memory?

No storage. Streams don't have storage for values; they carry values from a source (which could be a data structure, a generating function, an I/O channel, etc) through a pipeline of computational steps.

What is the disadvantage of parallel stream in Java 8?

Parallel Streams can actually slow you down It breaks them into subproblems which then run on separate threads for processing, these can go to different cores and then get combined when they're done. This all happens under the hood using the fork/join framework.

Why streams in Java 8 are lazy?

Streams are lazy because intermediate operations are not evaluated until terminal operation is invoked. Each intermediate operation creates a new stream, stores the provided operation/function and return the new stream. The pipeline accumulates these newly created streams.


1 Answers

If your concern is OutOfMemoryError you probably shouldn't create additional intermediate data structures like lists or streams before saving to the database.

Since the Step.getPartialDataList() already returns List<Data> the data is already in the memory, unless you have your own List implementation. You just need to use JDBC batch insert:

PreparedStatement ps = c.prepareStatement("INSERT INTO data VALUES (?, ?, ...)");
for (Step step : steps) {
    for (Data data : step.getPartialDataList()) {
        ps.setString(1, ...);
        ps.setString(2, ...);
        ...
        ps.addBatch();
    }   
}
ps.executeBatch();

There is no need to chunk into smaller batches prematurely with dataList. First see what your database and JDBC driver are supporting before doing premature optimizations.

Do note that for most databases the right way to insert large amount of data is an external utility and not JDBC e.g. PostgreSQL has COPY.

like image 157
Karol Dowbecki Avatar answered Sep 18 '22 03:09

Karol Dowbecki