I have to process a huge amount of data distributed over 20 tables (~5 million records in summary) and I need to efficently load them.
I'm using Wildfly 14 and JPA/Hibernate.
Since in the end, every single record will be used by the business logic (in the same transaction), I decided to pre-load the entire content of the required tables into memory via simply:
em.createQuery("SELECT e FROM Entity e").size();
After that, every object should be availabe in the transaction and thus be available via:
em.find(Entity.class, id);
But this doesn't work somehow and there are still a lot of calls to the DB, especially for the relationships.
How can I efficiently load the whole content of the required tables including the relationships and make sure I got everything / there will be no further DB calls?
What I already tried:
em.find
One thing to note is that the data is immutable (at least for a specific time) and could also be used in other transactions.
My plan is to load and manage the entire data in a @Singleton
bean. But I want to make sure I'm loading it the most efficient way and be sure the entire data is loaded. There should be no further queries necessary when the business logic is using the data. After a specific time (ejb timer), I'm going to discard the entire data and reload the current state from the DB (always whole tables).
In JPA we use the @ManyToMany annotation to model many-to-many relationships. This type of relationship can be unidirectional or bidirectional: In a unidirectional relationship only one entity in the relationship points the other. In a bidirectional relationship both entities point to each other.
Implementation in JPA. Modeling a many-to-many relationship with POJOs is easy. We should include a Collection in both classes, which contains the elements of the others. After that, we need to mark the class with @Entity and the primary key with @Id to make them proper JPA entities.
The purpose of the MappedBy parameter is to instruct JPA: Do NOT create another join table as the relationship is already being mapped by the opposite entity of this relationship.
Keep in mind, that you'll likely need a 64-bit JVM and a large amount of memory. Take a look at Hibernate 2nd Level Cache. Some things to check for since we don't have your code:
@Cacheable
annotation will clue Hibernate in so that the entity is cacheableIf you need to process things in this way, you may want to consider changing your design to not rely on having everything in memory, not using Hibernate/JPA, or not use an app server. This will give you more control of how things are executed. This may even be a better fit for something like Hadoop. Without more information it's hard to say what direction would be best for you.
I understand what you're asking but JPA/Hibernate isn't going to want to cache that much data for you, or at least I wouldn't expect a guarantee from it. Consider that you described 5 million records. What is the average length per record? 100 bytes gives 500 megabytes of memory that'll just crash your untweaked JVM. Probably more like 5000 bytes average and that's 25 gB of memory. You need to think about what you're asking for.
If you want it cached you should do that yourself or better yet just use the results when you have them. If you want a memory based data access you should look at a technology specifically for that. http://www.ehcache.org/ seems popular but it's up to you and you should be sure you understand your use case first.
If you are trying to be database efficient then you should just understand what your doing and design and test carefully.
Basically it should be a pretty easy task to load entire tables with one query each table and link the objects, but JPA works different as to be shown in this example.
The biggest problem are @OneToMany
/@ManyToMany
-relations:
@Entity
public class Employee {
@Id
@Column(name="EMP_ID")
private long id;
...
@OneToMany(mappedBy="owner")
private List<Phone> phones;
...
}
@Entity
public class Phone {
@Id
private long id;
...
@ManyToOne
@JoinColumn(name="OWNER_ID")
private Employee owner;
...
}
FetchType.EAGER
If defined as FetchType.EAGER
and the query SELECT e FROM Employee e
Hibernate generates the SQL statement SELECT * FROM EMPLOYEE
and right after it SELECT * FROM PHONE WHERE OWNER_ID=?
for every single Employee
loaded, commonly known as 1+n problem.
I could avoid the n+1 problem by using the JPQL-query SELECT e FROM Employee e JOIN FETCH e.phones
, which will result in something like SELECT * FROM EMPLOYEE LEFT OUTER JOIN PHONE ON EMP_ID = OWNER_ID
.
The problem is, this won't work for a complex data model with ~20 tables involved.
FetchType.LAZY
If defined as FetchType.LAZY
the query SELECT e FROM Employee e
will just load all Employees as Proxies, loading the related Phones only when accessing phones
, which in the end will lead into the 1+n problem as well.
To avoid this it is pretty obvious to just load all the Phones into the same session SELECT p FROM Phone p
. But when accessing phones
Hibernate will still execute SELECT * FROM PHONE WHERE OWNER_ID=?
, because Hibernate doesn't know that there are already all Phones in its current session.
Even when using 2nd level cache, the statement will be executed on the DB because Phone
is indexed by its primary key in the 2nd level cache and not by OWNER_ID
.
Conclusion
There is no mechanism like "just load all data" in Hibernate.
It seems there is no other way than keep the relationships transient and connect them manually or even just use plain old JDBC.
EDIT:
I just found a solution which works very well. I defined all relevant @ManyToMany
and @OneToMany
as FetchType.EAGER
combinded with @Fetch(FetchMode.SUBSELECT)
and all @ManyToOne
with @Fetch(FetchMode.JOIN)
, which results in an acceptable loading time. Next to adding javax.persistence.Cacheable(true)
to all entities I added org.hibernate.annotations.Cache
to every relevant collection, which enables collection caching in the 2nd level cache. I disabled 2nd level cache timeout eviction and "warm up" the 2nd level cache via @Singleton
EJB combined with @Startup
on server start / deploy. Now I have 100% control over the cache, there are no further DB calls until I manually clear it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With