wagnerbianchi.com

PostgreSQL Partitioning Automation with pg_partman and pg_cron

I recently started a new project at work that demands a log table with a jsonb column. I don’t like to have JSON columns on tables as that sounds like you don’t want to make all that information stored on a JSON document well-normalised (more relations, more columns, etc.) More about JSON Data Type support on PostgreSQL.

Based on that project, we started looking at whether the Table Partitioning on PostgreSQL would make sense due to a log table expected to grow very large once the application starts using it. The PostgreSQL version 13.7 (I know, OK, it’s old and needs a major upgrade 😎 ); it is running on AWS RDS.

The Table Partitioning feature can be thought of as a source for queries to improve performance and as a way to mitigate issues related to data vacuuming, better indexing and some other advantages. The table itself is a simple one, but the data would be a big problem over time and thinking about that, we needed not just partitioning but also a way to automate the partitioning removal and the creation of new ones.

Executing online schema changes on PostgreSQL is still an issue, as no rock-solid tool on the market will execute that for big tables without downtime or considerable overhead. So, we need to start with a partitioned table.

🔥 I could not test the pg-osc due to time restrictions, as I said some days ago when I shared the post with my LinkedIn network. (check my post on LinkedIn).

We need to start with a partitioned table and an automated way to manage partitions along the way; operations like creating new partitions over time and removing old partitions as a subset of the data aren’t needed anymore is a desire. Also, I didn’t want to write that myself in case we already have a consolidated tool available. In the PostgreSQL world, we know, it is always an extension available to perform the job, which is not different in this case.

🎯 The database and table I am going to use here are examples, as, naturally, we can’t publish official material 😉

Table Partitioning Background

PostgreSQL table partitioning is a feature that allows you to divide a large table into smaller ones, more manageable pieces called partitions. Each partition is essentially a separate table that stores a subset of the data based on a specified partitioning strategy or partitioning method. Partitioning can provide significant performance benefits by improving query execution time and simplifying data maintenance.

💡Below, I leave you with a quick but functional, manual declarative partitioning example for labs/dev – not production:

-- partitioned or parent table
drop table if exists log;
create table if not exists log (
  log_entry json,
  dt_entry timestamp,
  primary key(dt_entry)
) partition by range(dt_entry);

-- partitions or child tables
create table if not exists log_part_202307 partition of log
for values from ('2023-07-01 00:00:00') to ('2023-07-31 00:00:00');

create table if not exists log_part_202308 partition of log
for values from ('2023-08-01 00:00:00') to ('2023-08-31 00:00:00');
-- end

🎯 Overall, the benefits of Table Partitioning are below (but not limited to):

  1. Improved Query Performance: Partitioning allows for faster query execution by reducing the amount of data that needs to be scanned. When a query is executed, PostgreSQL’s query planner can eliminate irrelevant partitions based on the query conditions, resulting in more targeted data retrieval and improved performance;
  2. Efficient Data Maintenance: Partitioning simplifies data management tasks, such as archiving or removing old data. Instead of performing these operations on the entire table, you can target specific partitions or child tables, which reduces the time and resources required for data maintenance like when running a manual VACUMM process;
  3. Enhanced Data Loading and Indexing: Partitioning can speed up data loading processes. When inserting new data, PostgreSQL can distribute it across multiple partitions simultaneously, utilising parallelism and improving the overall loading performance. Additionally, partitioning allows for indexing individual partitions, enabling more efficient indexing and enhancing query performance;
  4. Easy Data Retention and Purging: You can easily implement data retention and purging strategies by partitioning data based on time ranges or other relevant criteria. For example, you can create monthly partitions and set up a process to automatically drop or archive older partitions, thus managing data growth and maintaining optimal database performance;
  5. Flexibility in Storage and Indexing: PostgreSQL allows you to store different partitions on separate tablespaces, which enables the use of different storage technologies or configurations based on the specific needs of each partition. Additionally, you can create different indexes on each partition, tailoring indexing strategies to optimise query performance for specific subsets of the data;
  6. Improved Scalability: Partitioning enables horizontal scalability by distributing data across multiple physical disks or servers. By leveraging partitioning alongside techniques like sharding or replication, you can handle larger datasets and achieve higher levels of performance and scalability;
  7. Easier Data Analysis: Partitioning can facilitate data analysis and reporting tasks. By dividing data into logical subsets, you can focus analysis on specific partitions, enhancing query performance and simplifying analytical workflows.

🔥 Most of the above topics pave the way for new blog entries, like improved scalability when creating child tables or partitions over different tablespaces. PostgreSQL table partitioning significantly benefits query performance, data maintenance, scalability, and data analysis. Partitioning optimises database operations by efficiently organising and managing data, improving your PostgreSQL’s overall efficiency and performance.

🎧 A great resource I consumed many times is Postgres.FM episode on Partitioning, I suggest you to listening to: https://postgres.fm/episodes/partitioning.

Enter the pg_partman, a PostgreSQL Extension

The pg_partman extension is available on AWS RDS for PostgreSQL from version 12.5 and newer. When working on the RDS, you don’t need to configure the shared_preload_libraries to add the pg_partman extension, as that is already available for the DBA to issue the CREATE EXTENSION command, using, e.g., the psql client.

For the complete solution presented in this blog, in case you want to follow up and build your own lab, you must also have the pg_cron extension added to the shared_preload_libraries on the parameters group, and a cluster restart is required afterwards.

The parent table must have the PARTITIONED BY with the partition method, you want to apply, and the partman schema should be on the same database as the schema where you have your tables (it can be public as well).

📚 First of all, let’s get the pg_cron extension setup (and restart the cluster):

postgres=> CREATE EXTENSION pg_cron;
CREATE EXTENSION

postgres=> select oid, extname from pg_catalog.pg_extension;
  oid  |  extname
-------+------------
 14287 | plpgsql
 16403 | pg_partman
 17786 | pg_cron
(3 rows)

Let’s start setting up the pg_parttman, connect to your database, create a new schema we will name “partman” (very creative) and install the pg_partman extension.

postgres=> \c bianchi_db
You are now connected to database "bianchi_db" as user "postgres".

bianchi_db=> create schema if not exists partman;
CREATE SCHEMA

bianchi_db=> create extension pg_partman with schema partman;
CREATE EXTENSION

bianchi_db=> select oid, extname from pg_catalog.pg_extension;
  oid  |  extname
-------+------------
 14287 | plpgsql
 18949 | pg_partman
(2 rows)

We already have a log table, but let’s recreate it to avoid any questions regarding this part of the procedure. We will also partition the log table by the dt_entry column calling the partman.create_parent() function.

bianchi_db=> drop table if exists log;
create table if not exists log (
    log_entry json,
    dt_entry timestamp,
    primary key(dt_entry)
) partition by range(dt_entry);
DROP TABLE
CREATE TABLE

bianchi_db=> SELECT partman.create_parent(p_parent_table => 'public.log',
 p_control => 'dt_entry',
 p_type => 'native',
 p_interval=> 'monthly',
 p_premake => 6
);
 create_parent
---------------
 t
(1 row)

The above has an interesting effect on partitioning our log table, as, the p_control points to the column of the table we want to use as a partition key, the p_type is configured to use native partitioning, p_interval we configured it as monthly and requested it to create 6 partitions ahead. If we list our tables now, we can see the following:

bianchi_db=> \d
                  List of relations
 Schema |     Name     |       Type        |  Owner
--------+--------------+-------------------+----------
 public | log          | partitioned table | postgres
 public | log_default  | table             | postgres
 public | log_p2022_12 | table             | postgres
 public | log_p2023_01 | table             | postgres
 public | log_p2023_02 | table             | postgres
 public | log_p2023_03 | table             | postgres
 public | log_p2023_04 | table             | postgres
 public | log_p2023_05 | table             | postgres
 public | log_p2023_06 | table             | postgres
 public | log_p2023_07 | table             | postgres
 public | log_p2023_08 | table             | postgres
 public | log_p2023_09 | table             | postgres
 public | log_p2023_10 | table             | postgres
 public | log_p2023_11 | table             | postgres
 public | log_p2023_12 | table             | postgres
(15 rows)

If you use the \d+, you can see the table’s metadata and the partitions attached. Unfortunately, the results of that command will not be OK to show here on the blog, but you can try it on your side to see not only the partitions but also the boundaries of the partitions.

The Partition Management Automation

As we have everything we need regarding partitioning and partman configuration in place, we now need to schedule a @daily check with the pg_cron help so the partman.run_maintenance_proc() can be called automatically.

postgres=> UPDATE partman.part_config
SET infinite_time_partitions = true,
    retention = '3 months',
    retention_keep_table=false
WHERE parent_table = 'public.log';
SELECT cron.schedule('@daily', $$CALL partman.run_maintenance_proc()$$);
UPDATE 0

If you configure the cron.schedule with @hourly, instead of configuring it with @daily. You can see an example of that being triggered below:

postgres=> \x
Expanded display is on.

postgres=> select command, status, start_time, end_time from cron.job_run_details order by start_time desc limit 5;
-[ RECORD 1 ]-----------------------------------
command    | CALL partman.run_maintenance_proc()
status     | succeeded
start_time | 2023-06-29 18:00:00.076273+00
end_time   | 2023-06-29 18:00:00.996104+00
-[ RECORD 2 ]-----------------------------------
command    | CALL partman.run_maintenance_proc()
status     | succeeded
start_time | 2023-06-29 17:00:00.104716+00
end_time   | 2023-06-29 17:00:00.644784+00
-[ RECORD 3 ]-----------------------------------
command    | CALL partman.run_maintenance_proc()
status     | succeeded
start_time | 2023-06-29 16:00:00.466483+00
end_time   | 2023-06-29 16:00:00.98461+00
-[ RECORD 4 ]-----------------------------------
command    | CALL partman.run_maintenance_proc()
status     | succeeded
start_time | 2023-06-29 15:00:00.213028+00
end_time   | 2023-06-29 15:00:00.421355+00
-[ RECORD 5 ]-----------------------------------
command    | CALL partman.run_maintenance_proc()
status     | succeeded
start_time | 2023-06-29 14:00:00.147603+00
end_time   | 2023-06-29 14:00:00.405463+00

🟢 That’s it. You can check this repository on GitHub for more information about the pg_partman.

References:

https://github.com/pgpartman/pg_partman

https://github.com/citusdata/pg_cron


You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

AddThis Social Bookmark Button

Leave a Reply