Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgres LEFT JOIN is creating more rows than in left table

I am running Postgres 9.1.3 32-bit on Windows 7 x64. (Have to use 32 bit because there is no Windows PostGIS release compatible with 64 bit Postgres.) (EDIT: As of PostGIS 2.0, it is compatible with Postgres 64 bit on windows.)

I have a query that left joins a table (consistent.master) with a temporary table, then inserts the resulting data into a third table (consistent.masternew).

Since this is a left join, the resulting table should have the same number of rows as the left table in the query. However, if I run this:

SELECT count(*)
FROM consistent.master

I get 2085343. But if I run this:

SELECT count(*)
FROM consistent.masternew

I get 2085703.

How can masternew have more rows than master? Shouldn't masternew have the same number of rows as master, the left table in the query?

Below is the query. The master and masternew tables should be identically-structured.

--temporary table created here
--I am trying to locate where multiple tickets were written on
--a single traffic stop
WITH stops AS (
    SELECT citation_id,
           rank() OVER (ORDER BY offense_timestamp,
                     defendant_dl,
                     offense_street_number,
                     offense_street_name) AS stop
    FROM   consistent.master
    WHERE  citing_jurisdiction=1
)

--Here's the insert statement. Below you'll see it's
--pulling data from a select query
INSERT INTO consistent.masternew (arrest_id,
  citation_id,
  defendant_dl,
  defendant_dl_state,
  defendant_zip,
  defendant_race,
  defendant_sex,
  defendant_dob,
  vehicle_licenseplate,
  vehicle_licenseplate_state,
  vehicle_registration_expiration_date,
  vehicle_year,
  vehicle_make,
  vehicle_model,
  vehicle_color,
  offense_timestamp,
  offense_street_number,
  offense_street_name,
  offense_crossstreet_number,
  offense_crossstreet_name,
  offense_county,
  officer_id,
  offense_code,
  speed_alleged,
  speed_limit,
  work_zone,
  school_zone,
  offense_location,
  source,
  citing_jurisdiction,
  the_geom)

--Here's the select query that the insert statement is using.    
SELECT stops.stop,
  master.citation_id,
  defendant_dl,
  defendant_dl_state,
  defendant_zip,
  defendant_race,
  defendant_sex,
  defendant_dob,
  vehicle_licenseplate,
  vehicle_licenseplate_state,
  vehicle_registration_expiration_date,
  vehicle_year,
  vehicle_make,
  vehicle_model,
  vehicle_color,
  offense_timestamp,
  offense_street_number,
  offense_street_name,
  offense_crossstreet_number,
  offense_crossstreet_name,
  offense_county,
  officer_id,
  offense_code,
  speed_alleged,
  speed_limit,
  work_zone,
  school_zone,
  offense_location,
  source,
  citing_jurisdiction,
  the_geom
FROM consistent.master LEFT JOIN stops
ON stops.citation_id = master.citation_id

In case it matters, I have run a VACUUM FULL ANALYZE and reindexed both tables. (Not sure of exact commands; did it through pgAdmin III.)

like image 581
Aren Cambre Avatar asked Dec 07 '22 15:12

Aren Cambre


2 Answers

A left join does not necessarily have the same number of rows as the number of rows in the left table. Basically, it is like a normal join, except rows of the left table that would not appear in the normal join are also added. So, if you have more than one row in the right table that matches one row in the left table, you can have more rows in your results than the number of rows of the left table.

In order to do what you want to do, you should use a group by, and a count to detect multiples.

select citation_id
from stops join master on stops.citation_id = master.citation_id
group by citation_id
having count(*) > 1
like image 84
Rémi Avatar answered Dec 28 '22 07:12

Rémi


Sometimes you know there are multiples, but don't care. You just want to take the first or top entry.
If so, you can use SELECT DISTINCT ON:

FROM consistent.master LEFT JOIN (SELECT DISTINCT ON (citation_id) * FROM stops) s
ON s.citation_id = master.citation_id

Where citation_id is the column that you want to take the first (any) row for each match.

You might want to ensure this is deterministic and use ORDER BY with some other orderable column:

SELECT DISTINCT ON (citation_id) * FROM stops ORDER BY citation_id, created_at
like image 37
Andy Hayden Avatar answered Dec 28 '22 07:12

Andy Hayden