Low Cardinality in PostgreSQL

março 10th, 2026 Bianchi Posted in PostgreSQL No Comments »

The Compression Hero That Quietly Kills Your Indexes

Low cardinality is one of those database traits that feels like it should be “good by default”. A column with very few distinct values is easy to reason about, it makes analytics simpler, it often compresses extremely well, and it usually matches the way the business talks about data: status, tier, is_active, region_code, plan, role.

And then you create an index on it… and PostgreSQL still chooses a sequential scan. That’s not PostgreSQL being stubborn. That’s PostgreSQL being honest about selectivity and I/O economics.

Why low cardinality is amazing for compression

Compression thrives on repetition. When the same values repeat across many rows, encodings such as dictionary-based representations and run-length encoding become very effective, reducing the overall storage footprint. Smaller data increases cache residency, reduces I/O pressure, and makes large scans cheaper. This is the halo: low cardinality can be a real win for data volume, memory efficiency, and scan-heavy workloads.

But compression is about bytes on disk and in memory. Index usefulness is about how many rows you’re going to fetch and how scattered those row fetches will be.

Those are related, but they are not the same fight.

Selectivity is the planner’s reality check

Selectivity is basically “what fraction of the table will match this predicate”.

If status = 'shipped' matches 90% of rows, an index is often a bad deal. Even if the index can locate the matching tuples, it still leads PostgreSQL to visit a huge portion of the heap. That becomes a storm of random page fetches, visibility checks, and CPU overhead. At some point it’s cheaper to just stream through the table once in physical order with a sequential scan (or a parallel sequential scan) and be done.

This is the core horn: low cardinality frequently implies low selectivity for common values. Your instinct says “index it”, but the math says “you’re still touching most of the table”.

The hidden tax of “obvious” low-cardinality indexes

A B-tree index stores an entry per row, regardless of how frequently a value occurs. So even when the planner ignores that index for your most common queries, you still pay for it on every insert and update: extra WAL, extra CPU, extra vacuum work, more memory pressure, and more places for bloat to accumulate.

Low-cardinality indexes can still be excellent in two situations. The first is when you query rare values, and those values are truly selective. The second is when you combine the low-cardinality predicate with another condition that dramatically reduces the candidate set, so the access path becomes narrow and cheap. The key is that low-cardinality access paths rarely win as standalone access paths in the common case.

A “before vs after” benchmark you can run, with track_io_timing

This demo is built to tell the truth with evidence: same dataset, same queries, different index strategies. We will capture not only execution time and buffer usage, but also I/O time using track_io_timing, so you can see when “random reads” are the real villain.

Run this in a disposable environment or staging. It loads 2 million rows.

Step 1: Create a skewed low-cardinality dataset

\pset pager off
\timing on

DROP SCHEMA IF EXISTS lc_bench CASCADE;
CREATE SCHEMA lc_bench;
SET search_path = lc_bench, public;

CREATE TABLE orders_demo (
  id          bigserial PRIMARY KEY,
  tenant_id   int NOT NULL,
  created_at  timestamptz NOT NULL,
  status      text NOT NULL,
  payload     text
);

-- 2 million rows with skew:
-- 90% shipped, 9% paid, 1% canceled
INSERT INTO orders_demo (tenant_id, created_at, status, payload)
SELECT
  (random() * 9999)::int + 1,
  now() - (g * interval '1 second'),
  CASE
    WHEN random() < 0.90 THEN 'shipped'
    WHEN random() < 0.99 THEN 'paid'
    ELSE 'canceled'
  END,
  md5(g::text || '-' || random()::text)
FROM generate_series(1, 2000000) AS g;

ANALYZE orders_demo;

SELECT status, count(*) AS rows
FROM orders_demo
GROUP BY status
ORDER BY rows DESC;

The skew is intentional. Low cardinality becomes operationally painful when one value is extremely common, because that’s where selectivity collapses.

Step 2: Enable and verify track_io_timing

track_io_timing that PostgreSQL measures time spent in read/write syscalls and expose those numbers in EXPLAIN (ANALYZE, BUFFERS) output. It adds overhead, which is why you usually keep it off until you’re diagnosing I/O behavior.

First, verify its current state:

SHOW track_io_timing;

If you’re allowed to enable it for your session:

SET track_io_timing = on;
SHOW track_io_timing;

If that fails due to permissions, you need to enable it in your server configuration (or parameter group) and reload, then confirm again with SHOW track_io_timing;. Don’t skip the confirmation, because the entire “I/O time” part of this story depends on it.

To keep runs comparable while you iterate, you can also reduce variability:

SET jit = off;
SET max_parallel_workers_per_gather = 0;

That isn’t about making the benchmark “faster”; it’s about making the before/after contrast easier to interpret.

Step 3: Capture results automatically into a table

Instead of copying/pasting EXPLAIN output, we’ll store metrics extracted from EXPLAIN (FORMAT JSON) it into a results table. With track_io_timing enabled, the JSON will include I/O Read Time and I/O Write Time.

DROP TABLE IF EXISTS bench_results;

CREATE TABLE bench_results (
  variant          text NOT NULL,
  query_name       text NOT NULL,
  execution_ms     numeric NOT NULL,
  planning_ms      numeric NOT NULL,
  shared_hit       bigint NOT NULL,
  shared_read      bigint NOT NULL,
  io_read_ms       numeric NOT NULL,
  io_write_ms      numeric NOT NULL,
  captured_at      timestamptz NOT NULL DEFAULT now(),
  plan             jsonb NOT NULL
);

CREATE OR REPLACE FUNCTION capture_explain(variant text, query_name text, sql_text text)
RETURNS void
LANGUAGE plpgsql
AS $$
DECLARE
  j jsonb;
  p jsonb;
BEGIN
  EXECUTE format('EXPLAIN (ANALYZE, BUFFERS, TIMING, SUMMARY, FORMAT JSON) %s', sql_text) INTO j;
  j := j->0;
  p := j->'Plan';

  INSERT INTO bench_results
    (variant, query_name, execution_ms, planning_ms, shared_hit, shared_read, io_read_ms, io_write_ms, plan)
  VALUES
    (
      variant,
      query_name,
      COALESCE((j->>'Execution Time')::numeric, 0),
      COALESCE((j->>'Planning Time')::numeric, 0),
      COALESCE((p->>'Shared Hit Blocks')::bigint, 0),
      COALESCE((p->>'Shared Read Blocks')::bigint, 0),
      COALESCE((p->>'I/O Read Time')::numeric, 0),
      COALESCE((p->>'I/O Write Time')::numeric, 0),
      j
    );
END;
$$;

A quick note for rigor: these numbers are taken from the top plan node, which is typically sufficient for this workload to tell the story clearly. If you later want node-by-node aggregation across the whole plan tree, it’s doable, but it’s not necessary to demonstrate the low-cardinality effect.

Step 4: Define the workload we’ll compare

We’ll run three queries. One is the “common value” trap. One is the “rare value” case where a low-cardinality index can actually shine. The last is the real-world pattern that many teams actually care about: status, a recent time window, and ordering.

SELECT capture_explain(
  '00_before_indexes',
  'A_common_count_shipped',
  $$SELECT count(*) FROM lc_bench.orders_demo WHERE status = 'shipped'$$
);

SELECT capture_explain(
  '00_before_indexes',
  'B_rare_count_canceled',
  $$SELECT count(*) FROM lc_bench.orders_demo WHERE status = 'canceled'$$
);

SELECT capture_explain(
  '00_before_indexes',
  'C_shipped_recent_topN',
  $$SELECT id, tenant_id, created_at
    FROM lc_bench.orders_demo
    WHERE status = 'shipped'
      AND created_at >= now() - interval '10 minutes'
    ORDER BY created_at DESC
    LIMIT 200$$
);

If you want a cleaner signal, run the same three captures a second time under a variant like 00_before_indexes_run2because the first execution incurs some warmup costs.

Step 5: “After 1” — the naïve index on the low-cardinality column

This is the index most people create first.

CREATE INDEX idx_orders_demo_status ON orders_demo(status);
ANALYZE orders_demo;

SELECT capture_explain(
  '01_status_index',
  'A_common_count_shipped',
  $$SELECT count(*) FROM lc_bench.orders_demo WHERE status = 'shipped'$$
);

SELECT capture_explain(
  '01_status_index',
  'B_rare_count_canceled',
  $$SELECT count(*) FROM lc_bench.orders_demo WHERE status = 'canceled'$$
);

SELECT capture_explain(
  '01_status_index',
  'C_shipped_recent_topN',
  $$SELECT id, tenant_id, created_at
    FROM lc_bench.orders_demo
    WHERE status = 'shipped'
      AND created_at >= now() - interval '10 minutes'
    ORDER BY created_at DESC
    LIMIT 200$$
);

What you will typically observe is that the rare-value query gets meaningfully faster, often via an index scan or bitmap path, while the common-value query remains a sequential scan. That isn’t a failure; it’s the planner refusing to pay random I/O overhead to fetch most of the table through an index. If your storage is cold, this is exactly where io_read_ms makes the story obvious: the “wrong” access path is the one that makes the storage subsystem do scattered work.

Step 6: “After 2” — index the question, not the label

Now you stop trying to use status it as the access path by itself. You align indexes with how you actually ask for data.

CREATE INDEX idx_orders_demo_status_created_at
ON orders_demo (status, created_at DESC);

CREATE INDEX idx_orders_demo_canceled_recent
ON orders_demo (created_at DESC)
WHERE status = 'canceled';

ANALYZE orders_demo;

SELECT capture_explain(
  '02_composite_plus_partial',
  'A_common_count_shipped',
  $$SELECT count(*) FROM lc_bench.orders_demo WHERE status = 'shipped'$$
);

SELECT capture_explain(
  '02_composite_plus_partial',
  'B_rare_count_canceled',
  $$SELECT count(*) FROM lc_bench.orders_demo WHERE status = 'canceled'$$
);

SELECT capture_explain(
  '02_composite_plus_partial',
  'C_shipped_recent_topN',
  $$SELECT id, tenant_id, created_at
    FROM lc_bench.orders_demo
    WHERE status = 'shipped'
      AND created_at >= now() - interval '10 minutes'
    ORDER BY created_at DESC
    LIMIT 200$$
);

This is where the “realistic query” usually changes dramatically. Instead of searching a massive status='shipped' population and then sorting, PostgreSQL can walk a tight slice of (status, created_at) in the correct order and stop early because of the LIMIT. In many environments, you’ll see a large drop in shared_read and a very visible drop in io_read_ms, because the engine stops thrashing the heap and starts reading far fewer pages.

Meanwhile, the “common value count” query often remains a sequential scan, and you should treat that as the correct outcome. Counting nearly everything is almost always best for streaming operations.

Step 7: Print a clean “before vs after” table

SELECT
  query_name,
  variant,
  round(execution_ms, 3) AS execution_ms,
  round(planning_ms, 3)  AS planning_ms,
  shared_hit,
  shared_read,
  round(io_read_ms, 3)   AS io_read_ms,
  round(io_write_ms, 3)  AS io_write_ms
FROM bench_results
ORDER BY query_name, variant, captured_at;

How to address low-cardinality problems in real systems

The best mental shift is to stop thinking “low cardinality means index” and start thinking “low cardinality means the optimizer will be selective only sometimes”. You want to build access paths for the selective parts of the workload and avoid paying a write tax for an index that helps rarely or only helps rare values.

When the application frequently queries a common status only for recent data, a composite index that includes the time dimension is often the cleanest fix, as it restores selectivity over time. When the workload cares about rare statuses, partial indexes let you focus index maintenance on the slice that actually benefits. When the “real” filter is a tenant, user, account, or another high-cardinality attribute, that should usually lead the index, with the low-cardinality attribute following to keep scans narrow and ordered.

If you’re seeing PostgreSQL pick the “wrong” plan because it misestimates skew or correlation, improve statistics so the planner stops guessing. Raising per-column statistics targets can help with heavily skewed data, and extended statistics can help when two columns interact (for example, a tenant where one status dominates far more than the global distribution suggests). None of this changes the fundamental economics of low selectivity, but it prevents the planner from hallucinating selectivity that doesn’t exist.

Conclusion: low cardinality is a feature, not a free lunch

Low cardinality delivers real wins in compression and cache efficiency, and it often makes datasets easier to store and scan. That’s the halo.

The horn is that it seduces you into building indexes that look semantically correct but are mechanically expensive. A low-cardinality index can become the definition of “pay every day, benefit once a month”, especially when the most common value is what you query most.

The mature approach is to measure, not assume. Use EXPLAIN (ANALYZE, BUFFERS) and, when you’re diagnosing I/O behavior, enable track_io_timing intentionally and verify that it’s on. Keep in mind that it track_io_timing adds overhead, so treat it like a diagnostic instrument, not a default setting. Then design indexes that match your query shape, not your schema labels: composite indexes that restore selectivity, partial indexes that target rare slices, and statistics that keep the planner honest.

AddThis Social Bookmark Button

PostgreSQL Partitioning Automation with pg_partman and pg_cron

junho 29th, 2023 Bianchi Posted in PostgreSQL No Comments »

I recently started a new project at work that demands a log table with a jsonb column. I don’t like to have JSON columns on tables as that sounds like you don’t want to make all that information stored on a JSON document well-normalised (more relations, more columns, etc.) More about JSON Data Type support on PostgreSQL.

Based on that project, we started looking at whether the Table Partitioning on PostgreSQL would make sense due to a log table expected to grow very large once the application starts using it. The PostgreSQL version 13.7 (I know, OK, it’s old and needs a major upgrade 😎 ); it is running on AWS RDS.

The Table Partitioning feature can be thought of as a source for queries to improve performance and as a way to mitigate issues related to data vacuuming, better indexing and some other advantages. The table itself is a simple one, but the data would be a big problem over time and thinking about that, we needed not just partitioning but also a way to automate the partitioning removal and the creation of new ones.

Executing online schema changes on PostgreSQL is still an issue, as no rock-solid tool on the market will execute that for big tables without downtime or considerable overhead. So, we need to start with a partitioned table.

🔥 I could not test the pg-osc due to time restrictions, as I said some days ago when I shared the post with my LinkedIn network. (check my post on LinkedIn).

We need to start with a partitioned table and an automated way to manage partitions along the way; operations like creating new partitions over time and removing old partitions as a subset of the data aren’t needed anymore is a desire. Also, I didn’t want to write that myself in case we already have a consolidated tool available. In the PostgreSQL world, we know, it is always an extension available to perform the job, which is not different in this case.

🎯 The database and table I am going to use here are examples, as, naturally, we can’t publish official material 😉

Table Partitioning Background

PostgreSQL table partitioning is a feature that allows you to divide a large table into smaller ones, more manageable pieces called partitions. Each partition is essentially a separate table that stores a subset of the data based on a specified partitioning strategy or partitioning method. Partitioning can provide significant performance benefits by improving query execution time and simplifying data maintenance.

💡Below, I leave you with a quick but functional, manual declarative partitioning example for labs/dev – not production:

-- partitioned or parent table
drop table if exists log;
create table if not exists log (
  log_entry json,
  dt_entry timestamp,
  primary key(dt_entry)
) partition by range(dt_entry);

-- partitions or child tables
create table if not exists log_part_202307 partition of log
for values from ('2023-07-01 00:00:00') to ('2023-07-31 00:00:00');

create table if not exists log_part_202308 partition of log
for values from ('2023-08-01 00:00:00') to ('2023-08-31 00:00:00');
-- end

🎯 Overall, the benefits of Table Partitioning are below (but not limited to):

  1. Improved Query Performance: Partitioning allows for faster query execution by reducing the amount of data that needs to be scanned. When a query is executed, PostgreSQL’s query planner can eliminate irrelevant partitions based on the query conditions, resulting in more targeted data retrieval and improved performance;
  2. Efficient Data Maintenance: Partitioning simplifies data management tasks, such as archiving or removing old data. Instead of performing these operations on the entire table, you can target specific partitions or child tables, which reduces the time and resources required for data maintenance like when running a manual VACUMM process;
  3. Enhanced Data Loading and Indexing: Partitioning can speed up data loading processes. When inserting new data, PostgreSQL can distribute it across multiple partitions simultaneously, utilising parallelism and improving the overall loading performance. Additionally, partitioning allows for indexing individual partitions, enabling more efficient indexing and enhancing query performance;
  4. Easy Data Retention and Purging: You can easily implement data retention and purging strategies by partitioning data based on time ranges or other relevant criteria. For example, you can create monthly partitions and set up a process to automatically drop or archive older partitions, thus managing data growth and maintaining optimal database performance;
  5. Flexibility in Storage and Indexing: PostgreSQL allows you to store different partitions on separate tablespaces, which enables the use of different storage technologies or configurations based on the specific needs of each partition. Additionally, you can create different indexes on each partition, tailoring indexing strategies to optimise query performance for specific subsets of the data;
  6. Improved Scalability: Partitioning enables horizontal scalability by distributing data across multiple physical disks or servers. By leveraging partitioning alongside techniques like sharding or replication, you can handle larger datasets and achieve higher levels of performance and scalability;
  7. Easier Data Analysis: Partitioning can facilitate data analysis and reporting tasks. By dividing data into logical subsets, you can focus analysis on specific partitions, enhancing query performance and simplifying analytical workflows.

🔥 Most of the above topics pave the way for new blog entries, like improved scalability when creating child tables or partitions over different tablespaces. PostgreSQL table partitioning significantly benefits query performance, data maintenance, scalability, and data analysis. Partitioning optimises database operations by efficiently organising and managing data, improving your PostgreSQL’s overall efficiency and performance.

🎧 A great resource I consumed many times is Postgres.FM episode on Partitioning, I suggest you to listening to: https://postgres.fm/episodes/partitioning.

Enter the pg_partman, a PostgreSQL Extension

The pg_partman extension is available on AWS RDS for PostgreSQL from version 12.5 and newer. When working on the RDS, you don’t need to configure the shared_preload_libraries to add the pg_partman extension, as that is already available for the DBA to issue the CREATE EXTENSION command, using, e.g., the psql client.

For the complete solution presented in this blog, in case you want to follow up and build your own lab, you must also have the pg_cron extension added to the shared_preload_libraries on the parameters group, and a cluster restart is required afterwards.

The parent table must have the PARTITIONED BY with the partition method, you want to apply, and the partman schema should be on the same database as the schema where you have your tables (it can be public as well).

📚 First of all, let’s get the pg_cron extension setup (and restart the cluster):

postgres=> CREATE EXTENSION pg_cron;
CREATE EXTENSION

postgres=> select oid, extname from pg_catalog.pg_extension;
  oid  |  extname
-------+------------
 14287 | plpgsql
 16403 | pg_partman
 17786 | pg_cron
(3 rows)

Let’s start setting up the pg_parttman, connect to your database, create a new schema we will name “partman” (very creative) and install the pg_partman extension.

postgres=> \c bianchi_db
You are now connected to database "bianchi_db" as user "postgres".

bianchi_db=> create schema if not exists partman;
CREATE SCHEMA

bianchi_db=> create extension pg_partman with schema partman;
CREATE EXTENSION

bianchi_db=> select oid, extname from pg_catalog.pg_extension;
  oid  |  extname
-------+------------
 14287 | plpgsql
 18949 | pg_partman
(2 rows)

We already have a log table, but let’s recreate it to avoid any questions regarding this part of the procedure. We will also partition the log table by the dt_entry column calling the partman.create_parent() function.

bianchi_db=> drop table if exists log;
create table if not exists log (
    log_entry json,
    dt_entry timestamp,
    primary key(dt_entry)
) partition by range(dt_entry);
DROP TABLE
CREATE TABLE

bianchi_db=> SELECT partman.create_parent(p_parent_table => 'public.log',
 p_control => 'dt_entry',
 p_type => 'native',
 p_interval=> 'monthly',
 p_premake => 6
);
 create_parent
---------------
 t
(1 row)

The above has an interesting effect on partitioning our log table, as, the p_control points to the column of the table we want to use as a partition key, the p_type is configured to use native partitioning, p_interval we configured it as monthly and requested it to create 6 partitions ahead. If we list our tables now, we can see the following:

bianchi_db=> \d
                  List of relations
 Schema |     Name     |       Type        |  Owner
--------+--------------+-------------------+----------
 public | log          | partitioned table | postgres
 public | log_default  | table             | postgres
 public | log_p2022_12 | table             | postgres
 public | log_p2023_01 | table             | postgres
 public | log_p2023_02 | table             | postgres
 public | log_p2023_03 | table             | postgres
 public | log_p2023_04 | table             | postgres
 public | log_p2023_05 | table             | postgres
 public | log_p2023_06 | table             | postgres
 public | log_p2023_07 | table             | postgres
 public | log_p2023_08 | table             | postgres
 public | log_p2023_09 | table             | postgres
 public | log_p2023_10 | table             | postgres
 public | log_p2023_11 | table             | postgres
 public | log_p2023_12 | table             | postgres
(15 rows)

If you use the \d+, you can see the table’s metadata and the partitions attached. Unfortunately, the results of that command will not be OK to show here on the blog, but you can try it on your side to see not only the partitions but also the boundaries of the partitions.

The Partition Management Automation

As we have everything we need regarding partitioning and partman configuration in place, we now need to schedule a @daily check with the pg_cron help so the partman.run_maintenance_proc() can be called automatically.

postgres=> UPDATE partman.part_config
SET infinite_time_partitions = true,
    retention = '3 months',
    retention_keep_table=false
WHERE parent_table = 'public.log';
SELECT cron.schedule('@daily', $$CALL partman.run_maintenance_proc()$$);
UPDATE 0

If you configure the cron.schedule with @hourly, instead of configuring it with @daily. You can see an example of that being triggered below:

postgres=> \x
Expanded display is on.

postgres=> select command, status, start_time, end_time from cron.job_run_details order by start_time desc limit 5;
-[ RECORD 1 ]-----------------------------------
command    | CALL partman.run_maintenance_proc()
status     | succeeded
start_time | 2023-06-29 18:00:00.076273+00
end_time   | 2023-06-29 18:00:00.996104+00
-[ RECORD 2 ]-----------------------------------
command    | CALL partman.run_maintenance_proc()
status     | succeeded
start_time | 2023-06-29 17:00:00.104716+00
end_time   | 2023-06-29 17:00:00.644784+00
-[ RECORD 3 ]-----------------------------------
command    | CALL partman.run_maintenance_proc()
status     | succeeded
start_time | 2023-06-29 16:00:00.466483+00
end_time   | 2023-06-29 16:00:00.98461+00
-[ RECORD 4 ]-----------------------------------
command    | CALL partman.run_maintenance_proc()
status     | succeeded
start_time | 2023-06-29 15:00:00.213028+00
end_time   | 2023-06-29 15:00:00.421355+00
-[ RECORD 5 ]-----------------------------------
command    | CALL partman.run_maintenance_proc()
status     | succeeded
start_time | 2023-06-29 14:00:00.147603+00
end_time   | 2023-06-29 14:00:00.405463+00

🟢 That’s it. You can check this repository on GitHub for more information about the pg_partman.

References:

https://github.com/pgpartman/pg_partman

https://github.com/citusdata/pg_cron

AddThis Social Bookmark Button