Demystifying PostgreSQL Database Migrations

Database migration is the process of applying changes to a live database environment. This may be done with the help of a database migration library (eg: Phinx for PHP) but in this article we’re not going to focus on how the migration takes place with a certain tool, instead we’re going to explore the implications and the traps you may fall into when these migrations needs to be executed on a large PostgreSQL database providing a service at scale that cannot afford to be interrupted.


Are you looking for exciting engineering challenges in a remote first environment?


When we talk about “database migrations” from the application perspective, we may talk about two different things:

  • Data change: a migration that involves adding/removing/updating a large number of records
  • Schema change: a migration that changes the database structure

Each migration should have no negative impact on the service: we should expect both error rate and latencies to be constant without being affected by the underlying migration

While the techniques we are going to talk about can be used both with small and large tables, the impact of a migration on small tables is often negligible and the overhead of the extra step is not worth. The definition of small and large can vary and it can be highly affected by the underlying database resources: for example, updating a 1M rows can be extremely fast on a powerful server while even updating 100K rows can be pretty slow on a shared virtual instance. Because of this, you should apply your own judgement based on your environment when it’s time to decide what technique you should use.

Data Changes

Problem: Run an UPDATE query to manipulate one or more columns

UPDATE users SET feature_enabled = TRUE;

With this kind of migration, there are 3 challenges you have to consider:

  • Locks
  • Disk IOPS
  • Replication (only in a replicated scenario you’re managing yourself)
Locks

UPDATE takes an exclusive row-level lock on each updated row, and this lock block concurent writes on the same row until the transaction commits or rolls back. A SELECT can run concurrently on the same row but UPDATE or DELETE cannot a will wait until the lock is released.

The lock is taken even if the value you are setting is equal to the current value present in a specific row, so what you may think is a no-op, is not a no-op.

Impact: The longer the lock is taken, the longer other “write” queries on the same rows will be blocked

Disk IOPS

In PostgreSQL tuples are immutable, so UPDATE completely rewrites all matching rows, even if you are changing one single column. 

Because the whole row is rewritten, even when updating a single column out of 50, the larger the number of rows and/or the larger the size (bytes) of each row, the higher the disk pressure in terms of IOPS. Depending on your environment, an increase in the number of IOPS may impact the general performance of the system or it will lead to a cost increase. 

Replication

Hopefully you’re not managing PostgreSQL replication yourselves since the offering of managed PostgreSQL database is pretty good, but if you are you’re likely running asynchronous streaming replication. When the UPDATE transaction is committed, changes are written to the WAL XLOG, an append-only file containing all changes, that is written down sequentially and continuously shipped to the replica nodes. 

The larger a transaction commit is, the longer it will take to get transfered and applied on replicas, increasing the replication latency of subsequent transactions. Also, in case of a massive rewrite the replica may run with so much delay up to the point that the maximum retention for WAL XLOG is reached and the replica will lose its ability to sync. 

Solution

As a general rule of thumb, you want keep low the number of affected rows by each UPDATE command and throttle between two UPDATE

The benefits of this approach are:

  • Less rows updated at once, lower disk IOPS / CPU
  • Less disk IOPS / CPU, faster transaction
  • Faster transactions, row-exclusive lock taken for a shorter time and on a smaller number of rows
  • Throttle between two transactions, reduced pressure on disk and replication

And here’s a few tips you can use in the process

  • Idempotency: the migration task should support “cancel and restart” at any time, and once restarted no rows should be updated if already processed previously. 
  • Batch iterations on index columns: Iterate on primary keys or indexed columns and update only the corresponding rows.
  • Update once: if a row already contain the desired value, do not UPDATE it. 
  • Dynamic throttling: introduce small delays between each UPDATE statement based on the number of rows updated by the previous command (eg: sleep($num_updated_rows / 10) to sleep 1s for every batch of 10K updated rows)
UPDATE users SET ...
# Batch Iteration
WHERE user_id > $offset AND user_id <= $offset + $length
# Update once
AND column_to_be_changed != $expected_value

Schema Changes

A schema change is any change to the database structure, that includes creating/modifying/deleting a table, but also adding and removing indexes. Each one of these changes comes with their own challenges so we’re going to cover each one of them separately.

Create Table

CREATE TABLE command only affects PostgreSQL internal metadata, so it has no negative impact. You can add tables to a live databases without worriying about any performance impact.

Alter Table

ALTER TABLE is trickier and the only answer about whether that affect the database is: it depends, but how? Let’s see different commands and evaluate the impact of each one.

IMPORTANT: Any ALTER TABLE takes an ACCESS EXCLUSIVE lock to apply the metadata changes (no reads / writes allowed, the table is fully locked), but we consider it negligible because the operation is very fast. The only downside is that to acquire the lock we’ve to wait all other queries complete. From now on the ACCESS EXCLUSIVE lock for this specific case will not be mentioned anymore and we’ll focus on the other side effects.

Adding column

ALTER TABLE mytable ADD COLUMN mycolumn INTEGER

The impact of this command is neglible since it just affect internal metadata, there’s nothing to rewrite nor any constraint to validate.

ALTER TABLE mytable ADD COLUMN mycolumn INTEGER DEFAULT 0

The impact of this command changes based on the PostgreSQL version you’re running.

  • Before version 11 adding a column with a default requires rewriting each single row of the table to store the new column value, making the execution of the statement very slow on a large table, keeping the table locked the entire time.
  • Starting from version 11, adding a column with a default value is a metadata change most of the ties, and it has a negligible impact. The default value will be returned the next time the row is accessed and applied when the table is rewritten, making the execution of the statement very fast even on large tables. However, if the default value is volatile (e.g. CURRENT_TIMESTAMP) each row will need to be updated with the value calculated at the time ALTER TABLE is executed.

When running this command requires a table rewrite, the general suggestion is to apply this migration in different stages. You should first add a column without the default value (ie: ALTER TABLE mytable ADD COLUMN mycolumn INTEGER), then ALTER the column to set a default value for future INSERT (ie: ALTER TABLE mytable ALTER COLUMN mycolumn SET DEFAULT 0) and finally to run a massive UPDATE for the existing rows (see above)

ALTER TABLE mytable ALTER COLUMN mycolumn NOTNULL

Adding a NOT NULL constraint to an existing column is typically a quite heavy command when run on a large table. In order for the constraint to be added, it requires a full table scan performed under an ACCESS EXCLUSIVE lock, to prevent new rows to be added not in compliance with the constraing while the command is executed. 

While there’s no solution to this at the moment, there are ways it’s impact can be reduced. For example, the check is significantly faster if the involved disk pages are cached in memory, so you could run a full scan on the table to warm it up, then run the ALTER command immediately after.

Starting with PostgreSQL 9.4 a similar result can be achieved using table-level CHECK constraints that can be added but validated later. In this scenario, you will add the constraint by running:

ALTER TABLE mytable ADD CONSTRAINT mycolunt_notnull CHECK (mycolumn IS NOT NULL) NOT VALID;

and the NOT VALID option will make PostgreSQL to skip the validation phase of the existing rows. A subsequent run of:

ALTER TABLE mytable VALIDATE CONSTRAINT mycolunt_notnull;

will then validate the content of the table by acquiring a SHARE UPDATE EXCLUSIVE lock instead of an ACCESS EXCLUSIVElike it was happening in PostgreSQL prior to 9.4. 

Dropping column

ALTER TABLE mytable DROP COLUMN mycolumn

The impact of this command is neglible since it just affect internal metadata, there’s nothing to rewrite nor any constraint to validate. DROP COLUMN does not physically remove the column, but simply makes it invisible to SQL operations. The space will be reclaimed over time as existing rows are updated.

Drop Table

Assuming you already released an updated version of your software that completely stopped any read or write on the table you want to drop, the impact of the DROP TABLE command is negligible.

Create Index

In order to execute a CREATE INDEX statement, PostgreSQL needs to perform a full table scan and lock the table against writes. While this is very efficient, it cause problems when adding an index on a large table that has an intensive write workload. Fortunately, PostgreSQL provides a CONCURRENTLY modifier that you can apply to your statement, so for example a statement like:

CREATE INDEX mytable_mycolumn_idx ON mytable (mycolumn);

becomes:

CREATE INDEX CONCURRENTLY mytable_mycolumn_idx ON mytable (mycolumn);

CONCURRENTLY is very useful because it’s able to create your index without locking the table, but there are a couple of caveats:

  • It requires TWO full table scans, so it takes longer and uses more disk IOPS
  • It cannot be executed inside a transaction block

Normally these caveats are not a big deal, but it’s certainly good to know about them in order to use CONCURRENTLY to create your new index without locking the table.

Drop Index

The impact of the DROP INDEX command is negligible, but a CONCURRENTLY version is provided anyway to avoid any kind of locking.

Conclusions

As you have noticed from this post, dealing with a live production environment running at scale needs some extra care, so we hope reading this can help you facing the challenges of your work. Stay tuned for further updates about what we do at Spreaker and don’t forget to check our current job openings!