Why our PostgreSQL replicas couldn’t remove WAL files?

Last week our monitoring tool alerted that the disks of read replicas were filling up in a critical very large database cluster. This was a database cluster with 5 read replicas that were using WAL streaming. After an initial look, it was obvious the replicas were not able to purge the WAL files. To recycle WAL files we use wal_keep_segments which was adjusted to 8192 on that cluster. Each WAL file was 16MB thus 128GB of disk space usage by WAL files was expected. However, at the time of the issue, the pg_xlog was consuming over 1TB of disk space.

Let’s run the tape back a bit.

We’d renewed all 5 read replicas a week ago as a result of a replication failure. Actually, creating a read replica is the responsibility of our PaaS platform and that process does not include DBA operation at all. But something was wrong with those read replicas’ configuration.

Further investigating, we have noticed those read replicas inherited the pglogical extension from the database master. Before the renewal of replicas, we had installed pglogical on the database master only and pglogical had not been enabled on streaming replicas. Therefore, I started to suspect that the issue was somehow related to pglogical extension.

First I have tried to remove the pglogical extension on the read replicas, but it didn’t allow it.

psql> drop extension pglogical;ERROR: cannot execute DROP EXTENSION in a read-only transaction

This was guessed because streaming replicas cannot be modified.

Then I thought I can downgrade wal_level on the replicas from logical to hot_standby.

That parameter requires a restart. However, it failed at restart unexpectedly. This was what helped us to find the root cause of the issue.

FATAL,55000,”logical replication slot “”pgl_userdatabase_provider1_subscription1"” exists, but wal_level < logical”,,”Change wal_level to be logical or higher.”

It means the replication slot between the database master and its logical replicas was also inherited by streaming replicas.

To explain that better let me draw our database cluster diagram.

db_master
├── rds_logical_replica
├── streaming_replica_1
├── streaming_replica_2
├── streaming_replica_3
├── streaming_replica_4
└── streaming_replica_5

Streaming replicas were behaving like they were followed by a logical replica. Thus they were not purging the WAL files.

psql > SELECT slot_name,plugin FROM pg_replication_slots ;
slot_name | plugin
------------------------------------------+------------------
pgl_userdatabase_provider1_subscription1 | pglogical_output

To solve this, I thought removing the replication slot on streaming replicas could be possible. So tried the command below.

psql > select pg_drop_replication_slot('pgl_userdatabase_provider1_subscription1');

Bingo! That removed the replication slot and it worked.

After a couple of minutes, the used space by WAL files dropped from 1TB to 128GB.