I have a request to export roughly a million records to do a one time migration and I need those records exported in JSON format and adhering to the same api contract object structure we normally expose to this other team.
The object I need to expose is composed of data from two tables and I have a conditional in place to limit the resultset from millions to 1 million.
The batching logic I have below does the initial fetch then batches them in groups of 1000 and add them to a list that I want to expose eventually.
Service method below:
public List<NewEntity> getNewEntities() {
int i=0;
int BATCH_SIZE = 1000;
List<Entity> totalEntityList = new ArrayList<>();
List<Entity> entityList = entityManager.createNamedQuery(Entity.SELECT_NEW_ENTITIES, Entity.class)
.setFirstResult(i)
.setMaxResults(BATCH_SIZE).getResultList();
while(entityList.size() == BATCH_SIZE) {
i+=BATCH_SIZE;
entityList = entityManager.createNamedQuery(Entity.SELECT_NEW_ENTITIES, Entity.class)
.setFirstResult(i)
.setMaxResults(BATCH_SIZE)
.getResultList();
totalEntityList.addAll(entityList);
}
return totalEntityList.stream()
.map(entity -> entity.toNewEntity())
.collect(toList());
}
This process was taking 2.5 (with batch of 1000) and 3.5 hours (with batch of 2000). The first time I ran it, I had a bug where I didn't save all of them to the list, so I ran it again overnight. I checked in the morning, it took longer and my IDE froze so I don't know why the request eventually failed with 500 error. I force quit my IDE and trying again now but I won't know if it works until after 2 to 3 hours. This is very slow.
Is there a way to improve how I'm doing this? Any other approaches here?
I'm using Spring Boot, Java, JPA.
Thanks!
EDIT: UPDATE 2 after adjusting per recommendations from the comments:
public void getNewEntities() {
int i=0;
int BATCH_SIZE = 1000;
List<Entity> entityList = entityManager.createNamedQuery(Entity.SELECT_NEW_ENTITIES, Entity.class)
.setFirstResult(i)
.setMaxResults(BATCH_SIZE).getResultList();
try{
FileWriter file = new FileWriter("new_entities.json");
while(entityList.size() == BATCH_SIZE) {
i+=BATCH_SIZE;
entityList = entityManager.createNamedQuery(Entity.SELECT_NEW_ENTITIES, Entity.class)
.setFirstResult(i)
.setMaxResults(BATCH_SIZE)
.getResultList();
file.write(new Gson().toJson(entityList));
}
file.close();
}
catch(IOException ex) {
ex.printStackTrace();
};
}
Is this a better(correct?) approach? I'm letting it run right now and not sure how long it would take to write 1,000,000 records to JSON file but let me know if this on the right track? Currently, it's been two hours and it's only about 450,000 records through!
UPDATE 3 adding Entity for visibility. I'm assuming now it could be an issue with the mapping to locations where I'm converting it to list at the end. Should I just keep it as a stream instead? Also I obviously have a lot more fields but I removed the ones where it was just a simple column field with no complexity to it.
@Builder
@Getter
@Entity
@AllArgsConstructor
@NoArgsConstructor
@Table(name = "TABLE_NAME", schema = "SCHEMA")
@NamedQuery(name = SELECT_NEW_ENTITIES, query = FIND_NEW_ENTITIES)
public class Entity {
public static final String SELECT_NEW_ENTITIES = "SELECT_NEW_ENTITIES";
protected static final String FIND_NEW_ENTITIES = "" +
"SELECT a FROM Entity a " +
"WHERE a.code IS NOT NULL " +
"AND a.someId IS NOT NULL " +
"ORDER BY a.id ";
@Id
@Column(name = "NEW_ENTITY_ID")
private Long id;
@Column(name = "SHARED_ID")
private Long sharedId;
@OneToMany(cascade = CascadeType.ALL)
@JoinColumn(name = "NEW_ENTITY_ID")
@Builder.Default
private List<Location> locations = new ArrayList<>();
@Formula(value = "(SELECT eng.SOME_ID from SCHEMA.ANOTHER_TABLE eng WHERE eng.SHARED_ID = SHARED_ID)")
private String someId;
@Formula(value = "(SELECT eng.CODE from SCHEMA.ANOTHER_TABLE eng WHERE eng.SHARED_ID = SHARED_ID)")
private String code;
public NewEntity toNewEntity() {
return NewEntity
.builder()
.newEntityId(this.getId())
.code(code)
.locations(mapLocations())
.build();
}
private List<LocationModel> mapLocations() {
return this.getLocations()
.stream()
.map(location -> LocationModel.builder()
.id(location.getId())
.newEntityId(location.getNewEntityId())
.city(location.getCity())
.state(location.getState())
.country(location.getCountry())
.build())
.collect(toList());
}
}
The way I see this is, you are fetching a million records as an entity.
After the query is executed, the resultset will be mapped into the Entity class. In your case, 2 @Formula
are being executed for each record and also a list of locations being populated too. I suggest you fetching them directly via native queries.
I am writing a pseudo code,
@Transactional(readOnly=true)
public void writeMyEntitiesToJsonFile() {
Gson gson = new Gson();
try {
FileWriter file = new FileWriter("new_entities.json");
//this query will fetch all the records from your table name
String rawQuery = "SELECT \n"
+ "t.NEW_ENTITY_ID AS id,\n"
+ "t.SHARED_ID AS sharedId,\n"
+ "a.SOME_ID AS someId,\n"
+ "a.CODE AS code\n"
+ "FROM TABLE_NAME t\n"
+ "JOIN ANOTHER_TABLE a \n"
+ "ON t.SHARED_ID=a.SHARED_ID";
List<Object[]> newEntityRawList = em.createNativeQuery(rawQuery)
.getResultList();
//mapping all those Object[] into a DTO> EntityDTO
List<EntityDTO> newEntityDTOList = newEntityRawList
.stream()
.map(obj -> new EntityDTO((Number) obj[0], (Number) obj[1], (String) obj[2], (String) obj[3]))
.collect(Collectors.toList());
//this query will fetch all the location from your table
String locationQuery = "SELECT \n"
+ "l.newEntityId,\n"
+ "l.id,\n"
+ "l.city,\n"
+ "l.state,\n"
+ "l.country \n"
+ "FROM location l";
List<Object[]> locationRawList = em.createNativeQuery(locationQuery)
.getResultList();
//mapping all those Object[] into a DTO> LocationDTO
List<LocationDTO> locationDTOList = locationRawList
.stream()
.map(ob -> new LocationDTO((Number) ob[0], (Number) ob[1], (String) ob[2], (String) ob[3], (String) ob[4]))
.collect(Collectors.toList());
//Using lambda to get a map with key> new entity id. and values> List of Location
Map<Long, List<LocationDTO>> mapOfNewEntityIdAndLocationList = locationDTOList
.stream()
.collect(Collectors.groupingBy(LocationDTO::getNewEntityId));
//setting The list of locations to EntityDTO
for (EntityDTO ne : newEntityDTOList) {
ne.setLocations(mapOfNewEntityIdAndLocationList.get(ne.getId()));
}
//writing data to file
file.write(gson.toJson(newEntityDTOList));
} catch (IOException io) {
io.printStackTrace();
}
}
@Getter
public class EntityDTO {
private final long id;
private final long sharedId;
private final String someId;
private final String code;
@Setter
private List<LocationDTO> locations;
public EntityDTO(Number id,
Number sharedId,
String someId,
String code) {
this.id = id.longValue();
this.sharedId = sharedId.longValue();
this.someId = someId;
this.code = code;
}
}
@Getter
public class LocationDTO {
private final long id;
private final long newEntityId;
private final String city;
private final String state;
private final String country;
public LocationDTO(Number id,
Number newEntityId,
String city,
String state,
String country) {
this.id = id.longValue();
this.newEntityId = newEntityId.longValue();
this.city = city;
this.state = state;
this.country = country;
}
}
Instead of using the constructors, you can also use SqlResultSetMapping.
Of course, this will still be an expensive operation and you have figured out the batching yourself from the other answers in the answer section.
This approach of solution should eliminate the additional expensive queries.
I did not run the process I don't have any data to work with.
P.S. Please do try the batching too. The maxResult
for the first native query is easy, fetching the corresponding data from the second query will be a little tricky as you can you will need to set the WHERE NEW_ENTITY_ID BETWEEN
MIN AND MAX id
s obtained from the first result(should not be very hard as you know how the data are sitting on your database tables).
Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With