Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to export a million records formatted as json?

I have a request to export roughly a million records to do a one time migration and I need those records exported in JSON format and adhering to the same api contract object structure we normally expose to this other team.

The object I need to expose is composed of data from two tables and I have a conditional in place to limit the resultset from millions to 1 million.

The batching logic I have below does the initial fetch then batches them in groups of 1000 and add them to a list that I want to expose eventually.

Service method below:

public List<NewEntity> getNewEntities() {
    int i=0;
    int BATCH_SIZE = 1000; 

    List<Entity> totalEntityList = new ArrayList<>();
    
    List<Entity> entityList = entityManager.createNamedQuery(Entity.SELECT_NEW_ENTITIES, Entity.class)
        .setFirstResult(i)
        .setMaxResults(BATCH_SIZE).getResultList();

    while(entityList.size() == BATCH_SIZE) {
        i+=BATCH_SIZE;
        entityList = entityManager.createNamedQuery(Entity.SELECT_NEW_ENTITIES, Entity.class)
            .setFirstResult(i)
            .setMaxResults(BATCH_SIZE)
            .getResultList();
        totalEntityList.addAll(entityList);
    }

    return totalEntityList.stream()
        .map(entity -> entity.toNewEntity())
        .collect(toList());
}

This process was taking 2.5 (with batch of 1000) and 3.5 hours (with batch of 2000). The first time I ran it, I had a bug where I didn't save all of them to the list, so I ran it again overnight. I checked in the morning, it took longer and my IDE froze so I don't know why the request eventually failed with 500 error. I force quit my IDE and trying again now but I won't know if it works until after 2 to 3 hours. This is very slow.

Is there a way to improve how I'm doing this? Any other approaches here?

I'm using Spring Boot, Java, JPA.

Thanks!

EDIT: UPDATE 2 after adjusting per recommendations from the comments:

public void getNewEntities() {
    int i=0;
    int BATCH_SIZE = 1000; 
    
    List<Entity> entityList = entityManager.createNamedQuery(Entity.SELECT_NEW_ENTITIES, Entity.class)
        .setFirstResult(i)
        .setMaxResults(BATCH_SIZE).getResultList();

    try{
        FileWriter file = new FileWriter("new_entities.json");
        while(entityList.size() == BATCH_SIZE) {
            i+=BATCH_SIZE;
            entityList = entityManager.createNamedQuery(Entity.SELECT_NEW_ENTITIES, Entity.class)
                .setFirstResult(i)
                .setMaxResults(BATCH_SIZE)
                .getResultList();
            file.write(new Gson().toJson(entityList));
        }
        file.close();
    }
    catch(IOException ex) {
        ex.printStackTrace(); 
    };
}

Is this a better(correct?) approach? I'm letting it run right now and not sure how long it would take to write 1,000,000 records to JSON file but let me know if this on the right track? Currently, it's been two hours and it's only about 450,000 records through!

UPDATE 3 adding Entity for visibility. I'm assuming now it could be an issue with the mapping to locations where I'm converting it to list at the end. Should I just keep it as a stream instead? Also I obviously have a lot more fields but I removed the ones where it was just a simple column field with no complexity to it.

@Builder
@Getter
@Entity
@AllArgsConstructor
@NoArgsConstructor
@Table(name = "TABLE_NAME", schema = "SCHEMA")
@NamedQuery(name = SELECT_NEW_ENTITIES, query = FIND_NEW_ENTITIES)
public class Entity { 

    public static final String SELECT_NEW_ENTITIES = "SELECT_NEW_ENTITIES";

    protected static final String FIND_NEW_ENTITIES = "" + 
        "SELECT a FROM Entity a " + 
        "WHERE a.code IS NOT NULL " + 
        "AND a.someId IS NOT NULL " +
        "ORDER BY a.id ";
    
    @Id 
    @Column(name = "NEW_ENTITY_ID") 
    private Long id;

    @Column(name = "SHARED_ID") 
    private Long sharedId;

    @OneToMany(cascade = CascadeType.ALL) 
    @JoinColumn(name = "NEW_ENTITY_ID")
    @Builder.Default 
    private List<Location> locations = new ArrayList<>();
    
    @Formula(value = "(SELECT eng.SOME_ID from SCHEMA.ANOTHER_TABLE eng WHERE eng.SHARED_ID = SHARED_ID)") 
    private String someId;
    
    @Formula(value = "(SELECT eng.CODE from SCHEMA.ANOTHER_TABLE eng WHERE eng.SHARED_ID = SHARED_ID)") 
    private String code;

    public NewEntity toNewEntity() { 
        return NewEntity
        .builder() 
        .newEntityId(this.getId())
        .code(code) 
        .locations(mapLocations()) 
        .build(); 
        }
 
    private List<LocationModel> mapLocations() { 
        return this.getLocations()
        .stream() 
        .map(location -> LocationModel.builder() 
            .id(location.getId()) 
            .newEntityId(location.getNewEntityId()) 
            .city(location.getCity()) 
            .state(location.getState()) 
            .country(location.getCountry())
            .build()) 
        .collect(toList()); 
    }
}
like image 395
Euridice01 Avatar asked Aug 31 '25 17:08

Euridice01


1 Answers

The way I see this is, you are fetching a million records as an entity. After the query is executed, the resultset will be mapped into the Entity class. In your case, 2 @Formula are being executed for each record and also a list of locations being populated too. I suggest you fetching them directly via native queries.

I am writing a pseudo code,

@Transactional(readOnly=true)
public void writeMyEntitiesToJsonFile() {
        Gson gson = new Gson();
        try {

            FileWriter file = new FileWriter("new_entities.json");

            //this query will fetch all the records from your table name
            String rawQuery = "SELECT \n"
                    + "t.NEW_ENTITY_ID AS id,\n"
                    + "t.SHARED_ID AS sharedId,\n"
                    + "a.SOME_ID AS someId,\n"
                    + "a.CODE AS code\n"
                    + "FROM TABLE_NAME t\n"
                    + "JOIN ANOTHER_TABLE a \n"
                    + "ON t.SHARED_ID=a.SHARED_ID";

            List<Object[]> newEntityRawList = em.createNativeQuery(rawQuery)
                    .getResultList();

            //mapping all those Object[] into a DTO> EntityDTO
            List<EntityDTO> newEntityDTOList = newEntityRawList
                    .stream()
                    .map(obj -> new EntityDTO((Number) obj[0], (Number) obj[1], (String) obj[2], (String) obj[3]))
                    .collect(Collectors.toList());

            //this query will fetch all the location from your table
            String locationQuery = "SELECT \n"
                    + "l.newEntityId,\n"
                    + "l.id,\n"
                    + "l.city,\n"
                    + "l.state,\n"
                    + "l.country \n"
                    + "FROM location l";
            List<Object[]> locationRawList = em.createNativeQuery(locationQuery)
                    .getResultList();

            //mapping all those Object[] into a DTO> LocationDTO
            List<LocationDTO> locationDTOList = locationRawList
                    .stream()
                    .map(ob -> new LocationDTO((Number) ob[0], (Number) ob[1], (String) ob[2], (String) ob[3], (String) ob[4]))
                    .collect(Collectors.toList());

            //Using lambda to get a map with key> new entity id. and values> List of Location 
            Map<Long, List<LocationDTO>> mapOfNewEntityIdAndLocationList = locationDTOList
                    .stream()
                    .collect(Collectors.groupingBy(LocationDTO::getNewEntityId));

            //setting The list of locations to EntityDTO
            for (EntityDTO ne : newEntityDTOList) {
                ne.setLocations(mapOfNewEntityIdAndLocationList.get(ne.getId()));
            }
            
            //writing data to file
            file.write(gson.toJson(newEntityDTOList));
        } catch (IOException io) {
            io.printStackTrace();
        }
    }

    @Getter
    public class EntityDTO {

        private final long id;
        private final long sharedId;
        private final String someId;
        private final String code;

        @Setter
        private List<LocationDTO> locations;

        public EntityDTO(Number id,
                Number sharedId,
                String someId,
                String code) {
            this.id = id.longValue();
            this.sharedId = sharedId.longValue();
            this.someId = someId;
            this.code = code;
        }
    }

    @Getter
    public class LocationDTO {

        private final long id;
        private final long newEntityId;
        private final String city;
        private final String state;
        private final String country;

        public LocationDTO(Number id,
                Number newEntityId,
                String city,
                String state,
                String country) {
            this.id = id.longValue();
            this.newEntityId = newEntityId.longValue();
            this.city = city;
            this.state = state;
            this.country = country;
        }

    }
    

Instead of using the constructors, you can also use SqlResultSetMapping.

Of course, this will still be an expensive operation and you have figured out the batching yourself from the other answers in the answer section.

This approach of solution should eliminate the additional expensive queries.

I did not run the process I don't have any data to work with.

P.S. Please do try the batching too. The maxResult for the first native query is easy, fetching the corresponding data from the second query will be a little tricky as you can you will need to set the WHERE NEW_ENTITY_ID BETWEEN MIN AND MAX ids obtained from the first result(should not be very hard as you know how the data are sitting on your database tables).

Good luck!

like image 164
Asgar Avatar answered Sep 02 '25 06:09

Asgar