Duplicate rows accumulate when tables lack unique constraints or data is imported from external sources without deduplication. PostgreSQL offers three main approaches to find and remove duplicates: a self-join using DELETE USING, a ROW_NUMBER() window function subquery, and an immediate table technique. All three should be wrapped in a transaction so you can ROLLBACK if something goes wrong.
Finding Duplicate Rows First
Before deleting, identify which values are duplicated and how many copies exist:
CREATE TABLE shipments (
id SERIAL PRIMARY KEY,
tracking_no VARCHAR(50) NOT NULL,
carrier VARCHAR(50) NOT NULL,
destination VARCHAR(150) NOT NULL
);
INSERT INTO shipments (tracking_no, carrier, destination)
VALUES
('TRK-1001', 'FastFreight', 'Berlin'),
('TRK-1001', 'FastFreight', 'Berlin'), -- duplicate
('TRK-2050', 'SkyMove', 'Tokyo'),
('TRK-2050', 'SkyMove', 'Tokyo'), -- duplicate
('TRK-2050', 'SkyMove', 'Tokyo'), -- duplicate again
('TRK-3099', 'RailEx', 'Paris');
Find duplicated tracking numbers:
SELECT tracking_no, COUNT(*) AS copies
FROM shipments
GROUP BY tracking_no
HAVING COUNT(*) > 1
ORDER BY tracking_no;
tracking_no | copies
-------------+--------
TRK-1001 | 2
TRK-2050 | 3
(2 rows)
Method 1: DELETE USING (Self-Join)
Join the table to itself and delete the row with the lower ID, keeping the highest ID for each duplicate group:
DELETE FROM shipments a
USING shipments b
WHERE a.id < b.id
AND a.tracking_no = b.tracking_no
AND a.carrier = b.carrier;
Output:
DELETE 3
To keep the lowest ID instead, flip the operator:
DELETE FROM shipments a
USING shipments b
WHERE a.id > b.id
AND a.tracking_no = b.tracking_no
AND a.carrier = b.carrier;
Method 2: ROW_NUMBER() Subquery
Use ROW_NUMBER() partitioned by the duplicate columns. Rows numbered greater than 1 are duplicates:
DELETE FROM shipments
WHERE id IN (
SELECT id
FROM (
SELECT id,
ROW_NUMBER() OVER (
PARTITION BY tracking_no, carrier
ORDER BY id
) AS row_num
FROM shipments
) ranked
WHERE row_num > 1
);
To keep the highest ID, change ORDER BY id to ORDER BY id DESC inside the window function.
For duplicates defined across multiple columns, include all of them in PARTITION BY:
DELETE FROM shipments
WHERE id IN (
SELECT id
FROM (
SELECT id,
ROW_NUMBER() OVER (
PARTITION BY tracking_no, carrier, destination
ORDER BY id
) AS row_num
FROM shipments
) ranked
WHERE row_num > 1
);
Preview the rows that would be deleted before running the DELETE:
SELECT id, tracking_no, carrier, destination
FROM (
SELECT id, tracking_no, carrier, destination,
ROW_NUMBER() OVER (
PARTITION BY tracking_no, carrier
ORDER BY id
) AS row_num
FROM shipments
) ranked
WHERE row_num > 1;
Method 3: Immediate Table
When the table has no usable unique identifier, create a deduplicated copy:
-- 1. Create a new table with the same structure
CREATE TABLE shipments_dedup (LIKE shipments INCLUDING ALL);
-- 2. Insert only the first row per duplicate group
INSERT INTO shipments_dedup (tracking_no, carrier, destination)
SELECT DISTINCT ON (tracking_no, carrier) tracking_no, carrier, destination
FROM shipments
ORDER BY tracking_no, carrier, id;
-- 3. Rename the tables (swap them)
ALTER TABLE shipments RENAME TO shipments_old;
ALTER TABLE shipments_dedup RENAME TO shipments;
-- 4. Drop the original after verifying the result
DROP TABLE shipments_old;
Note: this method fails if other tables have foreign keys referencing the original table. Use methods 1 or 2 in that case.
Testing with Vela
Deduplication is irreversible once committed. Use Vela database branching to run the deduplication logic against a copy of your production data first. Preview the DELETE with a SELECT, execute on the branch, verify the result set, and then apply the same SQL to production. After deduplication, add a UNIQUE constraint on the relevant columns so duplicates cannot accumulate again.
Production Tips
- Always run a
SELECTwith the same deduplication logic first to confirm which row IDs will be deleted. - Wrap the DELETE in an explicit transaction so you can
ROLLBACKif the result is unexpected. - After deduplication, add a
UNIQUEconstraint to prevent future duplicates:ALTER TABLE shipments ADD CONSTRAINT shipments_tracking_carrier_unique UNIQUE (tracking_no, carrier); - The self-join method is typically faster on small-to-medium tables. The
ROW_NUMBER()subquery is more flexible for complex duplicate definitions spanning many columns. - For very large tables, chunk the delete using a CTE with
LIMITto avoid long lock contention:WITH to_delete AS ( SELECT a.id FROM shipments a JOIN shipments b ON a.tracking_no = b.tracking_no AND a.carrier = b.carrier AND a.id < b.id LIMIT 10000 ) DELETE FROM shipments WHERE id IN (SELECT id FROM to_delete); - If the immediate table method is used, recreate all indexes and constraints on the new table before renaming.