GROUP BY in PostgreSQL | PostgreSQL Tutorial

The GROUP BY clause in PostgreSQL collapses rows that share the same values in one or more columns into a single summary row. It is the foundation of aggregate reporting — without GROUP BY, aggregate functions like COUNT or SUM collapse the entire table into one row. With it, you get one row per unique grouping.

Syntax

SELECT column1, aggregate_function(column2)
FROM table_name
WHERE condition
GROUP BY column1
ORDER BY column1;

Every column in the SELECT list that is not inside an aggregate must appear in the GROUP BY clause. You can group by multiple columns, expressions, or ordinal positions.

Practical Example

Create a simple sales schema and load some data:

CREATE TABLE products (
  product_id   SERIAL PRIMARY KEY,
  product_name VARCHAR NOT NULL,
  category     VARCHAR NOT NULL,
  price        NUMERIC(10, 2) NOT NULL
);

CREATE TABLE order_lines (
  line_id    SERIAL PRIMARY KEY,
  product_id INT    NOT NULL REFERENCES products,
  quantity   INT    NOT NULL,
  sold_at    DATE   NOT NULL
);

INSERT INTO products (product_name, category, price) VALUES
  ('Widget A', 'Hardware', 12.50),
  ('Widget B', 'Hardware', 8.00),
  ('Gadget X', 'Electronics', 49.99),
  ('Gadget Y', 'Electronics', 29.99),
  ('Service Pack', 'Services', 99.00);

INSERT INTO order_lines (product_id, quantity, sold_at) VALUES
  (1, 3, '2025-01-05'), (2, 10, '2025-01-07'),
  (3, 1, '2025-01-10'), (4, 2, '2025-01-15'),
  (1, 5, '2025-02-03'), (3, 4, '2025-02-18'),
  (5, 1, '2025-02-20'), (2, 7, '2025-03-01');

Count units sold per product category:

SELECT
  p.category,
  COUNT(ol.line_id)      AS order_count,
  SUM(ol.quantity)       AS units_sold,
  AVG(p.price)           AS avg_price
FROM order_lines ol
JOIN products p ON p.product_id = ol.product_id
GROUP BY p.category
ORDER BY units_sold DESC;

Expected result:

  category   | order_count | units_sold | avg_price
-------------+-------------+------------+-----------
 Hardware    |           4 |         25 |     10.25
 Electronics |           3 |          7 |     39.99
 Services    |           1 |          1 |     99.00

Group by month and category together to get a time-series breakdown:

SELECT
  date_trunc('month', ol.sold_at) AS month,
  p.category,
  SUM(ol.quantity * p.price)      AS revenue
FROM order_lines ol
JOIN products p ON p.product_id = ol.product_id
GROUP BY 1, 2
ORDER BY 1, revenue DESC;

Using the ordinal positions 1, 2 is a PostgreSQL convenience — equivalent to listing the full expressions.

Verify the grouping with a catalog query:

SELECT
  p.category,
  COUNT(DISTINCT p.product_id) AS distinct_products,
  COUNT(ol.line_id)            AS total_orders
FROM products p
LEFT JOIN order_lines ol ON ol.product_id = p.product_id
GROUP BY p.category;

Aggregate Functions Reference

Function	Returns
`COUNT(*)`	Total rows in the group
`COUNT(col)`	Non-NULL values in the column
`COUNT(DISTINCT col)`	Unique non-NULL values
`SUM(col)`	Total of all values
`AVG(col)`	Mean value
`MIN(col)` / `MAX(col)`	Smallest / largest value
`STRING_AGG(col, sep)`	Concatenated strings with separator

Testing with Vela

Before deploying GROUP BY-heavy reporting queries to production, use Vela’s database branching to benchmark them on a production-data clone. Create a branch from the live database, run EXPLAIN ANALYZE on the GROUP BY query, and experiment with index changes — all without impacting the production workload. This is especially useful when introducing GROUPING SETS, ROLLUP, or materialized views to replace a slow GROUP BY query.

Production Tips

Index the columns used in GROUP BY when the query runs on a reporting schedule — a B-tree index on the grouping column can turn a sequential scan into an index scan.
Place selective filters in WHERE rather than HAVING when possible; WHERE reduces the row count before grouping, letting the planner use indexes. HAVING is evaluated after aggregation and cannot use standard indexes.
COUNT(DISTINCT col) is slower than COUNT(*) on large groups because it must hash or sort within each group. For high-frequency dashboards, consider a pre-aggregated table or a materialized view with a unique index.
Use GROUPING SETS, ROLLUP, or CUBE to compute multiple levels of aggregation in a single table scan instead of writing several UNION ALL queries.
When using GROUP BY with a JOIN, filter the joined table in the WHERE clause before grouping to avoid aggregating unnecessary rows.

Frequently Asked Questions

What does GROUP BY do in PostgreSQL?

GROUP BY divides a table's rows into groups based on equal values in the specified columns, then applies aggregate functions to compute one result per group. The query returns one output row per unique combination of grouped values.

Does GROUP BY lock the table during execution?

No. GROUP BY is a read-only operation and takes only an ACCESS SHARE lock, which does not block other readers or writers. Long-running GROUP BY queries on large tables can, however, consume significant I/O and CPU, which may affect overall throughput on busy systems.

What happens if I reference a non-grouped column in SELECT with GROUP BY?

PostgreSQL raises an error: 'column must appear in the GROUP BY clause or be used in an aggregate function.' Every column in the SELECT list must either be in GROUP BY or wrapped in an aggregate. Exception: if you group by a primary key, PostgreSQL allows other columns from the same table in SELECT since the key uniquely identifies each row.

Can I use expressions or aliases in GROUP BY?

You can GROUP BY any expression, including function calls like date_trunc('month', created_at). PostgreSQL also accepts ordinal position references (GROUP BY 1 means group by the first SELECT column). You cannot use SELECT aliases in GROUP BY because GROUP BY is evaluated before SELECT aliases are assigned.

What is the safest way to use GROUP BY in production on large tables?

Index the GROUP BY columns if the query runs frequently. Use WHERE to reduce rows before grouping rather than filtering groups with HAVING, since WHERE can exploit indexes. For heavy reporting workloads, consider a materialized view refreshed on a schedule rather than scanning the base table on every request.