Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing data fetching and insertion in Spring Boot

I have 270000 records in a CSV file with columns user_id, book_ISBN, and book_rating, I need to insert the records into a many-to-many table. I parsed the data with openCSV library and the result is a list.

public List<UserRatingDto> uploadRatings(MultipartFile file) throws IOException{
        BufferedReader fileReader = new BufferedReader(new
                InputStreamReader(file.getInputStream(), "UTF-8"));

        List<UserRatingDto> ratings = new CsvToBeanBuilder<UserRatingDto>(fileReader)
                .withType(UserRatingDto.class)
                .withSeparator(';')
                .withIgnoreEmptyLine(true)
                .withSkipLines(1)
                .build()
                .parse();
        return ratings;
    }

There are no performance issues with this, it takes approximately 1 minute to parse. However, in order to insert these into a table, I need to fetch books and users from the DB in order to form the relationship, I tried to make the method async with @Async annotation, I tried parallel stream, I tried putting the objects into a stack and using saveAll() to bulk insert, but it still takes way too much time.

 public void saveRatings(final MultipartFile file) throws IOException{
        List<UserRatingDto> userRatingDtos = uploadRatings(file);

        userRatingDtos.parallelStream().forEach(bookRating->{
            UserEntity user = userRepository.findByUserId(bookRating.getUserId());
            bookRepository.findByISBN(bookRating.getBookISBN()).ifPresent(book -> {
                BookRating bookRating1 = new BookRating();
                bookRating1.setRating(bookRating.getBookRating());
                bookRating1.setUser(user);
                bookRating1.setBook(book);
                book.getRatings().add(bookRating1);
                user.getRatings().add(bookRating1);
                bookRatingRepository.save(bookRating1);
            });

        });
}

This is what I have now, is there anything I can change to make this faster?

like image 445
Elen Mouradian Avatar asked Sep 03 '25 03:09

Elen Mouradian


1 Answers

The problem is data is being fetched and persisted one by one. The most performant way to access data is usually well defined batches, then following the pattern:

  • fetch data required for processing the batch
  • process the batch in memory
  • persist processing results before fetching the next batch

For your specific use case, you can do something like:

    public void saveRatings(final MultipartFile file) throws IOException {
        List<UserRatingDto> userRatingDtos = uploadRatings(file);

        // Split the list into batches
        getBatches(userRatingDtos, 100).forEach(this::processBatch);
    }

    private void processBatch(List<UserRatingDto> userRatingBatch) {
        
        // Retrieve all data required to process a batch
        Map<String, UserEntity> users = userRepository
                .findAllById(userRatingBatch.stream().map(UserRatingDto::getUserId).toList())
                .stream()
                .collect(toMap(UserEntity::getId, user -> user));
        Map<String, Book> books = bookRepository.findAllByIsbn(userRatingBatch.stream().map(UserRatingDto::getBookISBN).toList())
                .stream()
                .collect(toMap(Book::getIsbn, book -> book));

        // Process each rating in memory
        List<BookRating> ratingsToSave = userRatingBatch.stream().map(bookRatingDto -> {
            Book book = books.get(bookRatingDto.getBookISBN());
            if (book == null) {
                return null;
            }
            UserEntity user = users.get(bookRatingDto.getUserId());
            BookRating bookRating = new BookRating();
            bookRating.setRating(bookRatingDto.getBookRating());
            bookRating.setUser(user);
            bookRating.setBook(book);
            book.getRatings().add(bookRating);
            user.getRatings().add(bookRating);
            return bookRating;
        }).filter(Objects::nonNull).toList();

        // Save data in batches
        bookRatingRepository.saveAll(ratingsToSave);
        bookRepository.saveAll(books.values());
        userRepository.saveAll(users.values());

    }

    public <T> List<List<T>> getBatches(List<T> collection, int batchSize) {
        List<List<T>> batches = new ArrayList<>();
        for (int i = 0; i < collection.size(); i += batchSize) {
            batches.add(collection.subList(i, Math.min(i + batchSize, collection.size())));
        }
        return batches;
    }

Note that all I/O should always be done in batches. If you have a single DB lookup or save in the inner processing loop this will not work at all.

You can try different batch sizes to see what brings better performance - the bigger the batch the longer transactions will remain open, and not always bigger batches result in better overall performance.

Also, make sure you handle errors gracefully - for example:

  • if a batch throws an error, you can break such a batch in two, and so on until only one rating fails.
  • you can also retry a failing batch with backoff if for example there's a DB access problem.
  • you can discard a rating if for example you have a null required field

EDIT: As per OP's comment, this increased performance 10x+. Also, if ordering is not important performance can still be greatly improved by processing each batch in parallel.

EDIT 2: As a general pattern, ideally we wouldn't have all records in memory to begin with, instead retrieving data to be processed in batches as well. This would further improve performance and avoid OOM errors.

Also, this can be done in many concurrency patterns, for example having dedicated threads to fetch data, worker threads to process it, and another set of threads to persist the results.

The easiest pattern is having each unit of work being independent - they're given what they should process (e.g. a set of ids to fetch from DB), then retrieve the necessary data for processing, process it in memory, and persist the results.

like image 179
Tomaz Fernandes Avatar answered Sep 04 '25 23:09

Tomaz Fernandes