Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does DISTINCT work when using JPA and Hibernate

Tags:

java

distinct

jpa

What column does DISTINCT work with in JPA and is it possible to change it?

Here's an example JPA query using DISTINCT:

select DISTINCT c from Customer c 

Which doesn't make a lot of sense - what column is the distinct based on? Is it specified on the Entity as an annotation because I couldn't find one?

I would like to specify the column to make the distinction on, something like:

select DISTINCT(c.name) c from Customer c 

I'm using MySQL and Hibernate.

like image 388
Steve Claridge Avatar asked Aug 28 '09 10:08

Steve Claridge


People also ask

Can we use distinct in JPQL?

You can add the DISTINCT keyword to your query to tell Hibernate to return each Author entity only once. But as you can see in the following log messages, Hibernate also adds the DISTINCT keyword to the SQL query.

Can we use distinct in HQL query?

Using distinct in the HQL Query We can notice that the distinct keyword was not only used by Hibernate but also included in the SQL query. We should avoid this because it's unnecessary and will cause performance issues.

How can we avoid duplicate records in JPA?

To avoid having JPA persist the objects automatically, drop the cascade and use persist to manually add the objects to the context immediately after creation. Since a persistence context is basically a tricked-out WeakHashMap attached to a database, these approaches are pretty similar when it comes down to it.


2 Answers

Depending on the underlying JPQL or Criteria API query type, DISTINCT has two meanings in JPA.

Scalar queries

For scalar queries, which return a scalar projection, like the following query:

List<Integer> publicationYears = entityManager .createQuery(     "select distinct year(p.createdOn) " +     "from Post p " +     "order by year(p.createdOn)", Integer.class) .getResultList();  LOGGER.info("Publication years: {}", publicationYears); 

The DISTINCT keyword should be passed to the underlying SQL statement because we want the DB engine to filter duplicates prior to returning the result set:

SELECT DISTINCT     extract(YEAR FROM p.created_on) AS col_0_0_ FROM     post p ORDER BY     extract(YEAR FROM p.created_on)  -- Publication years: [2016, 2018] 

Entity queries

For entity queries, DISTINCT has a different meaning.

Without using DISTINCT, a query like the following one:

List<Post> posts = entityManager .createQuery(     "select p " +     "from Post p " +     "left join fetch p.comments " +     "where p.title = :title", Post.class) .setParameter(     "title",      "High-Performance Java Persistence eBook has been released!" ) .getResultList();  LOGGER.info(     "Fetched the following Post entity identifiers: {}",      posts.stream().map(Post::getId).collect(Collectors.toList()) ); 

is going to JOIN the post and the post_comment tables like this:

SELECT p.id AS id1_0_0_,        pc.id AS id1_1_1_,        p.created_on AS created_2_0_0_,        p.title AS title3_0_0_,        pc.post_id AS post_id3_1_1_,        pc.review AS review2_1_1_,        pc.post_id AS post_id3_1_0__ FROM   post p LEFT OUTER JOIN        post_comment pc ON p.id=pc.post_id WHERE        p.title='High-Performance Java Persistence eBook has been released!'  -- Fetched the following Post entity identifiers: [1, 1] 

But the parent post records are duplicated in the result set for each associated post_comment row. For this reason, the List of Post entities will contain duplicate Post entity references.

To eliminate the Post entity references, we need to use DISTINCT:

List<Post> posts = entityManager .createQuery(     "select distinct p " +     "from Post p " +     "left join fetch p.comments " +     "where p.title = :title", Post.class) .setParameter(     "title",      "High-Performance Java Persistence eBook has been released!" ) .getResultList();   LOGGER.info(     "Fetched the following Post entity identifiers: {}",      posts.stream().map(Post::getId).collect(Collectors.toList()) ); 

But then DISTINCT is also passed to the SQL query, and that's not desirable at all:

SELECT DISTINCT        p.id AS id1_0_0_,        pc.id AS id1_1_1_,        p.created_on AS created_2_0_0_,        p.title AS title3_0_0_,        pc.post_id AS post_id3_1_1_,        pc.review AS review2_1_1_,        pc.post_id AS post_id3_1_0__ FROM   post p LEFT OUTER JOIN        post_comment pc ON p.id=pc.post_id WHERE        p.title='High-Performance Java Persistence eBook has been released!'   -- Fetched the following Post entity identifiers: [1] 

By passing DISTINCT to the SQL query, the EXECUTION PLAN is going to execute an extra Sort phase which adds overhead without bringing any value since the parent-child combinations always return unique records because of the child PK column:

Unique  (cost=23.71..23.72 rows=1 width=1068) (actual time=0.131..0.132 rows=2 loops=1)   ->  Sort  (cost=23.71..23.71 rows=1 width=1068) (actual time=0.131..0.131 rows=2 loops=1)         Sort Key: p.id, pc.id, p.created_on, pc.post_id, pc.review         Sort Method: quicksort  Memory: 25kB         ->  Hash Right Join  (cost=11.76..23.70 rows=1 width=1068) (actual time=0.054..0.058 rows=2 loops=1)               Hash Cond: (pc.post_id = p.id)               ->  Seq Scan on post_comment pc  (cost=0.00..11.40 rows=140 width=532) (actual time=0.010..0.010 rows=2 loops=1)               ->  Hash  (cost=11.75..11.75 rows=1 width=528) (actual time=0.027..0.027 rows=1 loops=1)                     Buckets: 1024  Batches: 1  Memory Usage: 9kB                     ->  Seq Scan on post p  (cost=0.00..11.75 rows=1 width=528) (actual time=0.017..0.018 rows=1 loops=1)                           Filter: ((title)::text = 'High-Performance Java Persistence eBook has been released!'::text)                           Rows Removed by Filter: 3 Planning time: 0.227 ms Execution time: 0.179 ms 

Entity queries with HINT_PASS_DISTINCT_THROUGH

To eliminate the Sort phase from the execution plan, we need to use the HINT_PASS_DISTINCT_THROUGH JPA query hint:

List<Post> posts = entityManager .createQuery(     "select distinct p " +     "from Post p " +     "left join fetch p.comments " +     "where p.title = :title", Post.class) .setParameter(     "title",      "High-Performance Java Persistence eBook has been released!" ) .setHint(QueryHints.HINT_PASS_DISTINCT_THROUGH, false) .getResultList();   LOGGER.info(     "Fetched the following Post entity identifiers: {}",      posts.stream().map(Post::getId).collect(Collectors.toList()) ); 

And now, the SQL query will not contain DISTINCT but Post entity reference duplicates are going to be removed:

SELECT        p.id AS id1_0_0_,        pc.id AS id1_1_1_,        p.created_on AS created_2_0_0_,        p.title AS title3_0_0_,        pc.post_id AS post_id3_1_1_,        pc.review AS review2_1_1_,        pc.post_id AS post_id3_1_0__ FROM   post p LEFT OUTER JOIN        post_comment pc ON p.id=pc.post_id WHERE        p.title='High-Performance Java Persistence eBook has been released!'   -- Fetched the following Post entity identifiers: [1] 

And the Execution Plan is going to confirm that we no longer have an extra Sort phase this time:

Hash Right Join  (cost=11.76..23.70 rows=1 width=1068) (actual time=0.066..0.069 rows=2 loops=1)   Hash Cond: (pc.post_id = p.id)   ->  Seq Scan on post_comment pc  (cost=0.00..11.40 rows=140 width=532) (actual time=0.011..0.011 rows=2 loops=1)   ->  Hash  (cost=11.75..11.75 rows=1 width=528) (actual time=0.041..0.041 rows=1 loops=1)         Buckets: 1024  Batches: 1  Memory Usage: 9kB         ->  Seq Scan on post p  (cost=0.00..11.75 rows=1 width=528) (actual time=0.036..0.037 rows=1 loops=1)               Filter: ((title)::text = 'High-Performance Java Persistence eBook has been released!'::text)               Rows Removed by Filter: 3 Planning time: 1.184 ms Execution time: 0.160 ms 
like image 45
Vlad Mihalcea Avatar answered Sep 22 '22 12:09

Vlad Mihalcea


You are close.

select DISTINCT(c.name) from Customer c 
like image 61
agelbess Avatar answered Sep 18 '22 12:09

agelbess