What column does DISTINCT work with in JPA and is it possible to change it?
Here's an example JPA query using DISTINCT:
select DISTINCT c from Customer c
Which doesn't make a lot of sense - what column is the distinct based on? Is it specified on the Entity as an annotation because I couldn't find one?
I would like to specify the column to make the distinction on, something like:
select DISTINCT(c.name) c from Customer c
I'm using MySQL and Hibernate.
You can add the DISTINCT keyword to your query to tell Hibernate to return each Author entity only once. But as you can see in the following log messages, Hibernate also adds the DISTINCT keyword to the SQL query.
Using distinct in the HQL Query We can notice that the distinct keyword was not only used by Hibernate but also included in the SQL query. We should avoid this because it's unnecessary and will cause performance issues.
To avoid having JPA persist the objects automatically, drop the cascade and use persist to manually add the objects to the context immediately after creation. Since a persistence context is basically a tricked-out WeakHashMap attached to a database, these approaches are pretty similar when it comes down to it.
Depending on the underlying JPQL or Criteria API query type, DISTINCT
has two meanings in JPA.
For scalar queries, which return a scalar projection, like the following query:
List<Integer> publicationYears = entityManager .createQuery( "select distinct year(p.createdOn) " + "from Post p " + "order by year(p.createdOn)", Integer.class) .getResultList(); LOGGER.info("Publication years: {}", publicationYears);
The DISTINCT
keyword should be passed to the underlying SQL statement because we want the DB engine to filter duplicates prior to returning the result set:
SELECT DISTINCT extract(YEAR FROM p.created_on) AS col_0_0_ FROM post p ORDER BY extract(YEAR FROM p.created_on) -- Publication years: [2016, 2018]
For entity queries, DISTINCT
has a different meaning.
Without using DISTINCT
, a query like the following one:
List<Post> posts = entityManager .createQuery( "select p " + "from Post p " + "left join fetch p.comments " + "where p.title = :title", Post.class) .setParameter( "title", "High-Performance Java Persistence eBook has been released!" ) .getResultList(); LOGGER.info( "Fetched the following Post entity identifiers: {}", posts.stream().map(Post::getId).collect(Collectors.toList()) );
is going to JOIN the post
and the post_comment
tables like this:
SELECT p.id AS id1_0_0_, pc.id AS id1_1_1_, p.created_on AS created_2_0_0_, p.title AS title3_0_0_, pc.post_id AS post_id3_1_1_, pc.review AS review2_1_1_, pc.post_id AS post_id3_1_0__ FROM post p LEFT OUTER JOIN post_comment pc ON p.id=pc.post_id WHERE p.title='High-Performance Java Persistence eBook has been released!' -- Fetched the following Post entity identifiers: [1, 1]
But the parent post
records are duplicated in the result set for each associated post_comment
row. For this reason, the List
of Post
entities will contain duplicate Post
entity references.
To eliminate the Post
entity references, we need to use DISTINCT
:
List<Post> posts = entityManager .createQuery( "select distinct p " + "from Post p " + "left join fetch p.comments " + "where p.title = :title", Post.class) .setParameter( "title", "High-Performance Java Persistence eBook has been released!" ) .getResultList(); LOGGER.info( "Fetched the following Post entity identifiers: {}", posts.stream().map(Post::getId).collect(Collectors.toList()) );
But then DISTINCT
is also passed to the SQL query, and that's not desirable at all:
SELECT DISTINCT p.id AS id1_0_0_, pc.id AS id1_1_1_, p.created_on AS created_2_0_0_, p.title AS title3_0_0_, pc.post_id AS post_id3_1_1_, pc.review AS review2_1_1_, pc.post_id AS post_id3_1_0__ FROM post p LEFT OUTER JOIN post_comment pc ON p.id=pc.post_id WHERE p.title='High-Performance Java Persistence eBook has been released!' -- Fetched the following Post entity identifiers: [1]
By passing DISTINCT
to the SQL query, the EXECUTION PLAN is going to execute an extra Sort phase which adds overhead without bringing any value since the parent-child combinations always return unique records because of the child PK column:
Unique (cost=23.71..23.72 rows=1 width=1068) (actual time=0.131..0.132 rows=2 loops=1) -> Sort (cost=23.71..23.71 rows=1 width=1068) (actual time=0.131..0.131 rows=2 loops=1) Sort Key: p.id, pc.id, p.created_on, pc.post_id, pc.review Sort Method: quicksort Memory: 25kB -> Hash Right Join (cost=11.76..23.70 rows=1 width=1068) (actual time=0.054..0.058 rows=2 loops=1) Hash Cond: (pc.post_id = p.id) -> Seq Scan on post_comment pc (cost=0.00..11.40 rows=140 width=532) (actual time=0.010..0.010 rows=2 loops=1) -> Hash (cost=11.75..11.75 rows=1 width=528) (actual time=0.027..0.027 rows=1 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on post p (cost=0.00..11.75 rows=1 width=528) (actual time=0.017..0.018 rows=1 loops=1) Filter: ((title)::text = 'High-Performance Java Persistence eBook has been released!'::text) Rows Removed by Filter: 3 Planning time: 0.227 ms Execution time: 0.179 ms
To eliminate the Sort phase from the execution plan, we need to use the HINT_PASS_DISTINCT_THROUGH
JPA query hint:
List<Post> posts = entityManager .createQuery( "select distinct p " + "from Post p " + "left join fetch p.comments " + "where p.title = :title", Post.class) .setParameter( "title", "High-Performance Java Persistence eBook has been released!" ) .setHint(QueryHints.HINT_PASS_DISTINCT_THROUGH, false) .getResultList(); LOGGER.info( "Fetched the following Post entity identifiers: {}", posts.stream().map(Post::getId).collect(Collectors.toList()) );
And now, the SQL query will not contain DISTINCT
but Post
entity reference duplicates are going to be removed:
SELECT p.id AS id1_0_0_, pc.id AS id1_1_1_, p.created_on AS created_2_0_0_, p.title AS title3_0_0_, pc.post_id AS post_id3_1_1_, pc.review AS review2_1_1_, pc.post_id AS post_id3_1_0__ FROM post p LEFT OUTER JOIN post_comment pc ON p.id=pc.post_id WHERE p.title='High-Performance Java Persistence eBook has been released!' -- Fetched the following Post entity identifiers: [1]
And the Execution Plan is going to confirm that we no longer have an extra Sort phase this time:
Hash Right Join (cost=11.76..23.70 rows=1 width=1068) (actual time=0.066..0.069 rows=2 loops=1) Hash Cond: (pc.post_id = p.id) -> Seq Scan on post_comment pc (cost=0.00..11.40 rows=140 width=532) (actual time=0.011..0.011 rows=2 loops=1) -> Hash (cost=11.75..11.75 rows=1 width=528) (actual time=0.041..0.041 rows=1 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on post p (cost=0.00..11.75 rows=1 width=528) (actual time=0.036..0.037 rows=1 loops=1) Filter: ((title)::text = 'High-Performance Java Persistence eBook has been released!'::text) Rows Removed by Filter: 3 Planning time: 1.184 ms Execution time: 0.160 ms
You are close.
select DISTINCT(c.name) from Customer c
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With