Duplicate Line Remover ยท 6 min read
How Databases Handle Duplicates: DISTINCT, GROUP BY, and UNIQUE
SQL provides multiple tools for handling duplicate data โ each designed for a different purpose. DISTINCT removes duplicate rows from query results. GROUP BY aggregates duplicate values. UNIQUE prevents duplicates from being stored in the first place.
Why Duplicates Are Such a Common Problem
Relational databases are designed around the mathematical theory of relations โ in Codd's original 1970 formulation, a relation is a set of tuples, and sets contain no duplicates. A truly relational database should not contain duplicate rows by design. In practice, however, duplicates arise constantly from:
- Multiple import processes that load the same source data more than once
- Missing or unenforced UNIQUE constraints on tables that should have them
- Merging data from multiple systems with overlapping records
- Application bugs that insert a record without first checking for existence
- Distributed systems where insert operations may be retried
SQL provides tools for both preventing duplicates (UNIQUE constraints) and handling duplicates that already exist (DISTINCT, GROUP BY).
DISTINCT: Removing Duplicates From Query Results
The SELECT DISTINCT clause removes duplicate rows from the result of a query. It returns each unique combination of the selected columns exactly once.
Example: a table of website visits has one row per visit. If the same user visited 50 times, their user_id appears 50 times. To find the list of unique visitors:
SELECT DISTINCT user_id FROM visits;
This returns each user_id once, regardless of how many times that user visited. The result is the set of all distinct user identifiers.
Important nuance: DISTINCT applies to the entire row of selected columns, not just the first one. SELECT DISTINCT first_name, last_name FROM users returns distinct combinations of first name and last name โ "John Smith" is distinct from "John Jones" even though both have the same first name.
Performance: DISTINCT requires the database to sort or hash all result rows to identify duplicates. For large result sets, this is expensive. It should not be used as a shortcut to avoid thinking about why duplicates exist โ if your query produces duplicates that DISTINCT is hiding, it usually indicates a join condition problem or a data model issue worth investigating.
GROUP BY: Aggregating Duplicate Values
GROUP BY is a more powerful deduplication tool that collapses duplicate values into groups and applies aggregate functions to each group. Rather than simply removing duplicates, GROUP BY answers "how many?", "what is the sum?", "what is the maximum?"
Example: to count how many times each user visited, rather than just listing unique visitors:
SELECT user_id, COUNT(*) as visit_count
FROM visits
GROUP BY user_id
ORDER BY visit_count DESC;
This produces one row per user_id (like DISTINCT) but adds the count of visits per user. GROUP BY is the right tool when you need to know the distribution of duplicates, not just that they exist.
Common aggregate functions used with GROUP BY:
COUNT(*)โ number of rows in each groupSUM(column)โ total value within each groupAVG(column)โ average value within each groupMAX(column)andMIN(column)โ highest and lowest valuesSTRING_AGG(column, ',')(PostgreSQL) /GROUP_CONCAT(column)(MySQL) โ concatenate values within a group
UNIQUE Constraints: Preventing Duplicates at Insert Time
The most effective way to handle duplicates is to prevent them from being stored in the first place. A UNIQUE constraint on a column (or combination of columns) tells the database to reject any INSERT or UPDATE operation that would create a duplicate value.
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL,
UNIQUE (email)
);
With this constraint, attempting to insert two rows with the same email address will fail with a unique constraint violation error. The database enforces uniqueness at the storage level, which is far more reliable than application-level checks.
Composite UNIQUE constraints enforce uniqueness across combinations of columns. UNIQUE (user_id, product_id) on an order_items table ensures the same user cannot have the same product appear twice in a single order โ but the same product can appear in different users' orders.
PRIMARY KEY vs. UNIQUE
A PRIMARY KEY is a special case of a UNIQUE constraint. Every table should have exactly one PRIMARY KEY โ the column (or combination) that uniquely identifies each row. PRIMARY KEY implies UNIQUE (no duplicates) and NOT NULL (the value must be present). A table can have multiple UNIQUE constraints but only one PRIMARY KEY.
Finding and Removing Existing Duplicates
When duplicates already exist in a table and need to be removed, the standard approach uses a subquery to identify which rows to keep (typically the one with the lowest or highest ID) and deletes the rest:
DELETE FROM users
WHERE id NOT IN (
SELECT MIN(id)
FROM users
GROUP BY email
);
This keeps the first-inserted row for each email address and deletes all subsequent duplicates. After removing duplicates, adding a UNIQUE constraint prevents the problem from recurring.
UPSERT: Insert or Update on Conflict
A common pattern in modern SQL is the UPSERT โ insert a row if it does not exist, update it if it does. This pattern eliminates duplicates at the application layer:
INSERT INTO users (email, name)
VALUES ('[email protected]', 'Alice')
ON CONFLICT (email) DO UPDATE SET name = EXCLUDED.name;
PostgreSQL's ON CONFLICT clause (MySQL uses INSERT ... ON DUPLICATE KEY UPDATE) is the idiomatic way to handle the "insert if not exists, update if exists" pattern without a separate SELECT check first. It is atomic โ safe under concurrent access โ and eliminates the class of duplicate bugs caused by race conditions between a SELECT check and a subsequent INSERT.
References
- Date, C.J. (2003). An Introduction to Database Systems (8th ed.). Addison-Wesley.
- Celko, J. (2005). SQL for Smarties: Advanced SQL Programming (3rd ed.). Morgan Kaufmann.
- Codd, E.F. (1970). A relational model of data for large shared data banks. Communications of the ACM, 13(6), 377โ387.
- PostgreSQL Global Development Group. (2023). PostgreSQL 16 Documentation. postgresql.org.
- Oracle Corporation. (2023). Oracle Database SQL Language Reference, 21c. docs.oracle.com.