{"id":1645,"date":"2026-03-10T14:51:22","date_gmt":"2026-03-10T17:51:22","guid":{"rendered":"http:\/\/wagnerbianchi.com\/blog\/?p=1645"},"modified":"2026-03-10T14:51:31","modified_gmt":"2026-03-10T17:51:31","slug":"low-cardinality-in-postgresql","status":"publish","type":"post","link":"http:\/\/wagnerbianchi.com\/blog\/?p=1645","title":{"rendered":"Low Cardinality in PostgreSQL"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><a href=\"http:\/\/wagnerbianchi.com\/blog\/wp-content\/uploads\/2026\/02\/6fcb35cb-4a7d-459e-8860-b098310fa87b-1.png\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"683\" src=\"http:\/\/wagnerbianchi.com\/blog\/wp-content\/uploads\/2026\/02\/6fcb35cb-4a7d-459e-8860-b098310fa87b-1-1024x683.png\" alt=\"\" class=\"wp-image-1651\" srcset=\"http:\/\/wagnerbianchi.com\/blog\/wp-content\/uploads\/2026\/02\/6fcb35cb-4a7d-459e-8860-b098310fa87b-1-1024x683.png 1024w, http:\/\/wagnerbianchi.com\/blog\/wp-content\/uploads\/2026\/02\/6fcb35cb-4a7d-459e-8860-b098310fa87b-1-300x200.png 300w, http:\/\/wagnerbianchi.com\/blog\/wp-content\/uploads\/2026\/02\/6fcb35cb-4a7d-459e-8860-b098310fa87b-1-768x512.png 768w, http:\/\/wagnerbianchi.com\/blog\/wp-content\/uploads\/2026\/02\/6fcb35cb-4a7d-459e-8860-b098310fa87b-1.png 1536w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><\/figure>\n\n\n\n<h2 class=\"wp-block-heading has-medium-font-size\">The Compression Hero That Quietly Kills Your Indexes<\/h2>\n\n\n\n<p>Low cardinality is one of those database traits that feels like it should be \u201cgood by default\u201d. A column with very few distinct values is easy to reason about, it makes analytics simpler, it often compresses extremely well, and it usually matches the way the business talks about data: <code>status<\/code>, <code>tier<\/code>, <code>is_active<\/code>, <code>region_code<\/code>, <code>plan<\/code>, <code>role<\/code>.<\/p>\n\n\n\n<p>And then you create an index on it\u2026 and PostgreSQL still chooses a sequential scan. That\u2019s not PostgreSQL being stubborn. That\u2019s PostgreSQL being honest about selectivity and I\/O economics.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\">Why low cardinality is amazing for compression<\/h3>\n\n\n\n<p>Compression thrives on repetition. When the same values repeat across many rows, encodings such as dictionary-based representations and run-length encoding become very effective, reducing the overall storage footprint. Smaller data increases cache residency, reduces I\/O pressure, and makes large scans cheaper. This is the halo: low cardinality can be a real win for data volume, memory efficiency, and scan-heavy workloads.<\/p>\n\n\n\n<p>But compression is about bytes on disk and in memory. Index usefulness is about how many rows you\u2019re going to fetch and how scattered those row fetches will be.<\/p>\n\n\n\n<p>Those are related, but they are not the same fight.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\">Selectivity is the planner\u2019s reality check<\/h3>\n\n\n\n<p>Selectivity is basically \u201cwhat fraction of the table will match this predicate\u201d.<\/p>\n\n\n\n<p>If <code>status = 'shipped'<\/code> matches 90% of rows, an index is often a bad deal. Even if the index can locate the matching tuples, it still leads PostgreSQL to visit a huge portion of the heap. That becomes a storm of random page fetches, visibility checks, and CPU overhead. At some point it\u2019s cheaper to just stream through the table once in physical order with a sequential scan (or a parallel sequential scan) and be done.<\/p>\n\n\n\n<p>This is the core horn: low cardinality frequently implies low selectivity for common values. Your instinct says \u201cindex it\u201d, but the math says \u201cyou\u2019re still touching most of the table\u201d.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\">The hidden tax of \u201cobvious\u201d low-cardinality indexes<\/h3>\n\n\n\n<p>A B-tree index stores an entry per row, regardless of how frequently a value occurs. So even when the planner ignores that index for your most common queries, you still pay for it on every insert and update: extra WAL, extra CPU, extra vacuum work, more memory pressure, and more places for bloat to accumulate.<\/p>\n\n\n\n<p>Low-cardinality indexes can still be excellent in two situations. The first is when you query rare values, and those values are truly selective. The second is when you combine the low-cardinality predicate with another condition that dramatically reduces the candidate set, so the access path becomes narrow and cheap. The key is that low-cardinality access paths rarely win as standalone access paths in the common case.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\">A \u201cbefore vs after\u201d benchmark you can run, with <code>track_io_timing<\/code><\/h3>\n\n\n\n<p>This demo is built to tell the truth with evidence: same dataset, same queries, different index strategies. We will capture not only execution time and buffer usage, but also I\/O time using <code>track_io_timing<\/code>, so you can see when \u201crandom reads\u201d are the real villain.<\/p>\n\n\n\n<p>Run this in a disposable environment or staging. It loads 2 million rows.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-small-font-size\">Step 1: Create a skewed low-cardinality dataset<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>\\pset pager off\n\\timing on\n\nDROP SCHEMA IF EXISTS lc_bench CASCADE;\nCREATE SCHEMA lc_bench;\nSET search_path = lc_bench, public;\n\nCREATE TABLE orders_demo (\n  id          bigserial PRIMARY KEY,\n  tenant_id   int NOT NULL,\n  created_at  timestamptz NOT NULL,\n  status      text NOT NULL,\n  payload     text\n);\n\n-- 2 million rows with skew:\n-- 90% shipped, 9% paid, 1% canceled\nINSERT INTO orders_demo (tenant_id, created_at, status, payload)\nSELECT\n  (random() * 9999)::int + 1,\n  now() - (g * interval '1 second'),\n  CASE\n    WHEN random() &lt; 0.90 THEN 'shipped'\n    WHEN random() &lt; 0.99 THEN 'paid'\n    ELSE 'canceled'\n  END,\n  md5(g::text || '-' || random()::text)\nFROM generate_series(1, 2000000) AS g;\n\nANALYZE orders_demo;\n\nSELECT status, count(*) AS rows\nFROM orders_demo\nGROUP BY status\nORDER BY rows DESC;<\/code><\/pre>\n\n\n\n<p><strong>The skew is intentional. <\/strong>Low cardinality becomes operationally painful when one value is extremely common, because that\u2019s where selectivity collapses.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Step 2: Enable and verify <code>track_io_timing<\/code><\/h4>\n\n\n\n<p>track_io_timing that PostgreSQL measures time spent in read\/write syscalls and expose those numbers in <code>EXPLAIN (ANALYZE, BUFFERS)<\/code> output. It adds overhead, which is why you usually keep it off until you\u2019re diagnosing I\/O behavior.<\/p>\n\n\n\n<p>First, verify its current state:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SHOW track_io_timing;<\/code><\/pre>\n\n\n\n<p>If you\u2019re allowed to enable it for your session:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SET track_io_timing = on;\nSHOW track_io_timing;<\/code><\/pre>\n\n\n\n<p>If that fails due to permissions, you need to enable it in your server configuration (or parameter group) and reload, then confirm again with <code>SHOW track_io_timing;<\/code>. Don\u2019t skip the confirmation, because the entire \u201cI\/O time\u201d part of this story depends on it.<\/p>\n\n\n\n<p>To keep runs comparable while you iterate, you can also reduce variability:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SET jit = off;\nSET max_parallel_workers_per_gather = 0;<\/code><\/pre>\n\n\n\n<p>That isn\u2019t about making the benchmark \u201cfaster\u201d; it\u2019s about making the before\/after contrast easier to interpret.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Step 3: Capture results automatically into a table<\/h4>\n\n\n\n<p>Instead of copying\/pasting <code>EXPLAIN<\/code> output, we\u2019ll store metrics extracted from <code>EXPLAIN (FORMAT JSON)<\/code> it into a results table. With <code>track_io_timing<\/code> enabled, the JSON will include <code>I\/O Read Time<\/code> and <code>I\/O Write Time<\/code>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>DROP TABLE IF EXISTS bench_results;\n\nCREATE TABLE bench_results (\n  variant          text NOT NULL,\n  query_name       text NOT NULL,\n  execution_ms     numeric NOT NULL,\n  planning_ms      numeric NOT NULL,\n  shared_hit       bigint NOT NULL,\n  shared_read      bigint NOT NULL,\n  io_read_ms       numeric NOT NULL,\n  io_write_ms      numeric NOT NULL,\n  captured_at      timestamptz NOT NULL DEFAULT now(),\n  plan             jsonb NOT NULL\n);\n\nCREATE OR REPLACE FUNCTION capture_explain(variant text, query_name text, sql_text text)\nRETURNS void\nLANGUAGE plpgsql\nAS $$\nDECLARE\n  j jsonb;\n  p jsonb;\nBEGIN\n  EXECUTE format('EXPLAIN (ANALYZE, BUFFERS, TIMING, SUMMARY, FORMAT JSON) %s', sql_text) INTO j;\n  j := j-&gt;0;\n  p := j-&gt;'Plan';\n\n  INSERT INTO bench_results\n    (variant, query_name, execution_ms, planning_ms, shared_hit, shared_read, io_read_ms, io_write_ms, plan)\n  VALUES\n    (\n      variant,\n      query_name,\n      COALESCE((j-&gt;&gt;'Execution Time')::numeric, 0),\n      COALESCE((j-&gt;&gt;'Planning Time')::numeric, 0),\n      COALESCE((p-&gt;&gt;'Shared Hit Blocks')::bigint, 0),\n      COALESCE((p-&gt;&gt;'Shared Read Blocks')::bigint, 0),\n      COALESCE((p-&gt;&gt;'I\/O Read Time')::numeric, 0),\n      COALESCE((p-&gt;&gt;'I\/O Write Time')::numeric, 0),\n      j\n    );\nEND;\n$$;<\/code><\/pre>\n\n\n\n<p>A quick note for rigor: these numbers are taken from the top plan node, which is typically sufficient for this workload to tell the story clearly. If you later want node-by-node aggregation across the whole plan tree, it\u2019s doable, but it\u2019s not necessary to demonstrate the low-cardinality effect.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Step 4: Define the workload we\u2019ll compare<\/h4>\n\n\n\n<p>We\u2019ll run three queries. One is the \u201ccommon value\u201d trap. One is the \u201crare value\u201d case where a low-cardinality index can actually shine. The last is the real-world pattern that many teams actually care about: status, a recent time window, and ordering.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT capture_explain(\n  '00_before_indexes',\n  'A_common_count_shipped',\n  $$SELECT count(*) FROM lc_bench.orders_demo WHERE status = 'shipped'$$\n);\n\nSELECT capture_explain(\n  '00_before_indexes',\n  'B_rare_count_canceled',\n  $$SELECT count(*) FROM lc_bench.orders_demo WHERE status = 'canceled'$$\n);\n\nSELECT capture_explain(\n  '00_before_indexes',\n  'C_shipped_recent_topN',\n  $$SELECT id, tenant_id, created_at\n    FROM lc_bench.orders_demo\n    WHERE status = 'shipped'\n      AND created_at &gt;= now() - interval '10 minutes'\n    ORDER BY created_at DESC\n    LIMIT 200$$\n);<\/code><\/pre>\n\n\n\n<p>If you want a cleaner signal, run the same three captures a second time under a variant like 00_before_indexes_run2because the first execution incurs some warmup costs.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Step 5: \u201cAfter 1\u201d \u2014 the na\u00efve index on the low-cardinality column<\/h4>\n\n\n\n<p>This is the index most people create first.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>CREATE INDEX idx_orders_demo_status ON orders_demo(status);\nANALYZE orders_demo;\n\nSELECT capture_explain(\n  '01_status_index',\n  'A_common_count_shipped',\n  $$SELECT count(*) FROM lc_bench.orders_demo WHERE status = 'shipped'$$\n);\n\nSELECT capture_explain(\n  '01_status_index',\n  'B_rare_count_canceled',\n  $$SELECT count(*) FROM lc_bench.orders_demo WHERE status = 'canceled'$$\n);\n\nSELECT capture_explain(\n  '01_status_index',\n  'C_shipped_recent_topN',\n  $$SELECT id, tenant_id, created_at\n    FROM lc_bench.orders_demo\n    WHERE status = 'shipped'\n      AND created_at &gt;= now() - interval '10 minutes'\n    ORDER BY created_at DESC\n    LIMIT 200$$\n);<\/code><\/pre>\n\n\n\n<p>What you will typically observe is that the rare-value query gets meaningfully faster, often via an index scan or bitmap path, while the common-value query remains a sequential scan. That isn\u2019t a failure; it\u2019s the planner refusing to pay random I\/O overhead to fetch most of the table through an index. If your storage is cold, this is exactly where <code>io_read_ms<\/code> makes the story obvious: the \u201cwrong\u201d access path is the one that makes the storage subsystem do scattered work.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Step 6: \u201cAfter 2\u201d \u2014 index the question, not the label<\/h4>\n\n\n\n<p>Now you stop trying to use <code>status<\/code> it as the access path by itself. You align indexes with how you actually ask for data.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>CREATE INDEX idx_orders_demo_status_created_at\nON orders_demo (status, created_at DESC);\n\nCREATE INDEX idx_orders_demo_canceled_recent\nON orders_demo (created_at DESC)\nWHERE status = 'canceled';\n\nANALYZE orders_demo;\n\nSELECT capture_explain(\n  '02_composite_plus_partial',\n  'A_common_count_shipped',\n  $$SELECT count(*) FROM lc_bench.orders_demo WHERE status = 'shipped'$$\n);\n\nSELECT capture_explain(\n  '02_composite_plus_partial',\n  'B_rare_count_canceled',\n  $$SELECT count(*) FROM lc_bench.orders_demo WHERE status = 'canceled'$$\n);\n\nSELECT capture_explain(\n  '02_composite_plus_partial',\n  'C_shipped_recent_topN',\n  $$SELECT id, tenant_id, created_at\n    FROM lc_bench.orders_demo\n    WHERE status = 'shipped'\n      AND created_at &gt;= now() - interval '10 minutes'\n    ORDER BY created_at DESC\n    LIMIT 200$$\n);<\/code><\/pre>\n\n\n\n<p>This is where the \u201crealistic query\u201d usually changes dramatically. Instead of searching a massive <code>status='shipped'<\/code> population and then sorting, PostgreSQL can walk a tight slice of <code>(status, created_at)<\/code> in the correct order and stop early because of the <code>LIMIT<\/code>. In many environments, you\u2019ll see a large drop in <code>shared_read<\/code> and a very visible drop in <code>io_read_ms<\/code>, because the engine stops thrashing the heap and starts reading far fewer pages.<\/p>\n\n\n\n<p>Meanwhile, the \u201ccommon value count\u201d query often remains a sequential scan, and you should treat that as the correct outcome. Counting nearly everything is almost always best for streaming operations.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Step 7: Print a clean \u201cbefore vs after\u201d table<\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>SELECT\n  query_name,\n  variant,\n  round(execution_ms, 3) AS execution_ms,\n  round(planning_ms, 3)  AS planning_ms,\n  shared_hit,\n  shared_read,\n  round(io_read_ms, 3)   AS io_read_ms,\n  round(io_write_ms, 3)  AS io_write_ms\nFROM bench_results\nORDER BY query_name, variant, captured_at;<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\">How to address low-cardinality problems in real systems<\/h3>\n\n\n\n<p>The best mental shift is to stop thinking \u201clow cardinality means index\u201d and start thinking \u201clow cardinality means the optimizer will be selective only sometimes\u201d. You want to build access paths for the selective parts of the workload and avoid paying a write tax for an index that helps rarely or only helps rare values.<\/p>\n\n\n\n<p>When the application frequently queries a common status only for recent data, a composite index that includes the time dimension is often the cleanest fix, as it restores selectivity over time. When the workload cares about rare statuses, partial indexes let you focus index maintenance on the slice that actually benefits. When the \u201creal\u201d filter is a tenant, user, account, or another high-cardinality attribute, that should usually lead the index, with the low-cardinality attribute following to keep scans narrow and ordered.<\/p>\n\n\n\n<p>If you\u2019re seeing PostgreSQL pick the \u201cwrong\u201d plan because it misestimates skew or correlation, improve statistics so the planner stops guessing. Raising per-column statistics targets can help with heavily skewed data, and extended statistics can help when two columns interact (for example, a tenant where one status dominates far more than the global distribution suggests). None of this changes the fundamental economics of low selectivity, but it prevents the planner from hallucinating selectivity that doesn\u2019t exist.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\">Conclusion: low cardinality is a feature, not a free lunch<\/h3>\n\n\n\n<p>Low cardinality delivers real wins in compression and cache efficiency, and it often makes datasets easier to store and scan. That\u2019s the halo.<\/p>\n\n\n\n<p>The horn is that it seduces you into building indexes that look semantically correct but are mechanically expensive. A low-cardinality index can become the definition of \u201cpay every day, benefit once a month\u201d, especially when the most common value is what you query most.<\/p>\n\n\n\n<p>The mature approach is to measure, not assume. Use <code>EXPLAIN (ANALYZE, BUFFERS)<\/code> and, when you\u2019re diagnosing I\/O behavior, enable <code>track_io_timing<\/code> intentionally and verify that it\u2019s on. Keep in mind that it <code>track_io_timing<\/code> adds overhead, so treat it like a diagnostic instrument, not a default setting. Then design indexes that match your query shape, not your schema labels: composite indexes that restore selectivity, partial indexes that target rare slices, and statistics that keep the planner honest.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Compression Hero That Quietly Kills Your Indexes Low cardinality is one of those database traits that feels like it should be \u201cgood by default\u201d. A column with very few distinct values is easy to reason about, it makes analytics simpler, it often compresses extremely well, and it usually matches the way the business talks [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[50],"tags":[],"_links":{"self":[{"href":"http:\/\/wagnerbianchi.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/1645"}],"collection":[{"href":"http:\/\/wagnerbianchi.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/wagnerbianchi.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/wagnerbianchi.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/wagnerbianchi.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1645"}],"version-history":[{"count":9,"href":"http:\/\/wagnerbianchi.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/1645\/revisions"}],"predecessor-version":[{"id":1657,"href":"http:\/\/wagnerbianchi.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/1645\/revisions\/1657"}],"wp:attachment":[{"href":"http:\/\/wagnerbianchi.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1645"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/wagnerbianchi.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1645"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/wagnerbianchi.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1645"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}