When one is faced with the task of picking a storage solution they have plenty of choices today: Columns-based stores, Key-Value stores, Graph like, you name. Every solution has its own strength and weaknesses and can be a good or bad fit depending on the use case. For sure, traditional, relational databases have still a lot of offer today: when we started developing Spreaker we choose PostgreSQL, that’s still running today as it backs our CMS and API.
At our scale we are dealing with a pretty large amount of data, and we feel query analysis and optimization should be in the toolbelt of every developer. In this post, we are going to introduce EXPLAIN and EXPLAIN ANALYZE commands, learn how to read their output, and spot potential bottlenecks.
Are you looking for exciting engineering challenges in a remote first environment?
A query optimization journey
Let’s say you already know how to analyze and optimize a query, the question now is: since even small applications can have a wide variety of queries how can we spot the problematic ones from a performance point of view?
Depending on the way you’re running your PostgreSQL, tools like AWS Performance Insight or New Relic are a good starting point as they point out which queries are the major offenders in terms of resources usage (I/O and execution time just to name two).
Once you found the query you want to work on, you need to analyze it to understand what is going on, and that is the part this article is focused on. When you know what is happening, then you can try to optimize it i.e: rewrite the query to better use indexes, add missing indexes, use a subquery, you name it.
After that, you can deploy the updated query on production and repeat the cycle again:
Check performance -> Analyze -> Optimize -> Deploy
until you are satisfied.
An example table with some data
To keep things simple we are going to create a table, named students:
CREATE TABLE students
(student_id SERIAL, name TEXT, PRIMARY KEY(student_id));
And fill it with a good amount of data. 10 million rows should be enough.
INSERT INTO students (name) SELECT left(md5(i::text), 5)
FROM generate_series(1, 10000000) s(i);
This is the final result
student_id | name
------------+-------
1 | c4ca4
2 | c81e7
3 | eccbc
4 | a87ff
5 | e4da3
6 | 16790
7 | 8f14e
8 | c9f0f
9 | 45c48
10 | d3d94
...
Keep in mind that when working on query optimization and performance, it is important to have an amount of test data that is in the same order of magnitude as the real data (or just run the analysis directly in production): that’s because table cardinality and data distribution plays a role when Postgres decides how to execute a query.
Every time Postgres receives a query it uses a Query Planner that tries to understand what is the best way to execute it: whether to use an index (and which one) or not, how to load the data, how to sort them, and so on. The output of the Query Planner is what is called a Query (Execution) Plan.
How PostgreSQL stores data
Before diving in and starting talking about performance, let’s take a look at how Postgres stores data, introducing a few concepts that will help us understand what the database does under the hood. These concepts are:
- tuple_id
- pages
- heap
- indexes
tuple_id
When you create a table, Postgres creates a system maintained column which is used to reference a row and acts as an identifier. This column is called tuple_id. Keep in mind this column is for internal usage and it won’t show up when you query the table.
tuple_id | student_id | name
---------+------------+-------
1 | 3000 | c4ca4
2 | 4000 | c81e7
3 | 5000 | eccbc
4 | 6000 | a87ff
5 | 7000 | e4da3
6 | 8000 | 16790
7 | 9000 | 8f14e
8 | 0000 | c9f0f
9 | 1000 | 45c48
10 | 2000 | d3d94
Pages
A page is a fixed size (8 KB on Postgres) location on disk where rows are physically stored in a binary format. To give you the feeling of how it could be (this is not a real representation, it’s just for the sake of this explanation), assuming each page contains 3 rows, they will look like the following:
page 0
1,3000,c4ca4|2,4000,c81e7|3,5000,eccbc
--------------------------------------
page 1
4,6000,a87ff|5,7000,e4da3|6,8000,16790
...
--------------------------------------
page N
(rows x, y, z)...
1,3000,c4ca4|2,4000,c81e7|3,5000,eccbc for instance is page 0 content. For each row, we have tuple_id (1), student_id (3000), and name (c4ca4).
Pages are the smallest I/O unit. If you ask for a single row the entire page will be read and unused rows will be discarded. The same concept applies for single values (i.e. name): the whole page will be loaded and unused values will be filtered out.
Heap
For every table, there is a heap where the table’s pages are stored, one after another. As I/O operations are costly, traversing the heap makes no exception: if we need to get data from the heap without knowing the page that would be a costly operation. That is one of the reasons why indexes exist: to help us find in which part of the heap (and in particular in which page) a given row is stored.
HEAP
_________________________________________
| page 0 |
| 1,3000,c4ca4|2,4000,c81e7|3,5000,eccbc |
| -------------------------------------- |
| page 1 |
| 4,6000,a87ff|5,7000,e4da3|6,8000,16790. |
| |
| ... |
| -------------------------------------- |
| page N |
| (rows x, y, z)... |
|_________________________________________|
Indexes
Indexes are data structures created aside from an existing table that summarizes it, helping search the data you are looking for.
Indexes work like an address book: when you look for a person’s phone number you don’t need to go through all the names: since an address book is an ordered list of names you camp jump directly to the section which starts by the initial of the name you are looking for.
Indexes are ordered by default so they also come in handy when sorting data using ORDER BY.
Indexes are stored separately from the heap and contain references to pages. Getting back to our fictional example, this is what an index on the table students could look like:
index page 1
3000 (1,0) | 4000 (2,0) | ...
6000 (4,1) | 7000 (5,1) | ...
---------------------------
...
index page n
...
page 0
1,3000,c4ca4|2,4000,c81e7|...
----------------------------
page 1
4,6000,a87ff|5,7000,e4da3|...
----------------------------
page N
(rows x, y, z)...
Let’s analyze an index entry, like the first one: 3000 (1,0).
3000 is the primary key, student_id
(1,0) are tuple_id and page number, respectively
So if we need to get the first row, we search the index and find the row is located on page 0. Then we proceed loading the page, discarding the other rows, and returning the one we need. Easy, isn’t it?
Postgres provides several index types, by default it will create a Balanced-Tree (B-Tree) index, which is optimized for search. If you are into big O notation, the time complexity for search in this data structure is O(logN).
When creating a table you define a primary key, Postgres will create an unique index on that column. That is not the only index you can have on the table: multiple indexes are allowed, spanning several columns if needed.
You may think that indexes could solve all performance problems, but unfortunately, that’s far from the truth. While they can be useful in a lot of cases, maintaining an index comes at a cost, which manifests itself when writing data on the table (ie: insert, update, delete), in particular for bigger indexes.
EXPLAIN and EXPLAIN ANALYZE
The first thing we can do to analyze a query is using the EXPLAIN command. Let’s run it on a simple query, where we select a single student_id.
postgres=# EXPLAIN SELECT student_id FROM students WHERE student_id = 2000;
QUERY PLAN
-------------------------------------------------------------------------
Index Only Scan using students_pkey on students (cost=0.43..4.45 rows=1 width=4)
The output is a Query Plan, the way Postgres is going to execute the query and return the results. Many information is provided, even for a simple query like this one:
- Index only scan is the table access method, that is, the way Postgres is going to retrieve data. In this case, since we are returning only the student_id and since student_id is in the index, no access on the table is needed.
- cost=0.43 is the startup cost, the cost required to find the first page containing the data, expressed in disk page fetches units. This can vary based on operations the database performs before fetching the rows like aggregating data or sorting, so if we would add an ORDER BY clause, this number will increase.
- 4.45 is the cost needed to fetch all rows.
- rows=1 is the number of rows returned by the query. Postgres may have scanned more rows than the one displayed here, for example as a result of filtering by a WHERE clause condition
- width=4 this is the average width of rows output by this query (in bytes)
Remember that EXPLAIN is based on internal statistics and does not execute the actual query. For that EXPLAIN ANALYZE should be used
postgres=# EXPLAIN ANALYZE SELECT student_id FROM students WHERE student_id = 6;
QUERY PLAN
-----------------------------------------------------------------------
Index Only Scan using students_pkey on students (cost=0.43..4.45 rows=1 width=4) (actual time=4.666..4.685 rows=1 loops=1)
Index Cond: (student_id = 6)
Heap Fetches: 0
Planning Time: 0.058 ms
Execution Time: 0.260 ms
(5 rows)
As the actual query was executed we have more data to reason on:
- Planning Time: 0.058 ms is the time Query Planner used to plan the query. It can vary based on query complexity (i.e. aggregation, sorting, …)
- Execution Time: 4.784 ms is the total time required to execute the query
- Heap Fetches: 0 is the number of time we needed to get data from the heap. In that particular case, since is an Index Only Scan, no fetches were needed
Index Only Scan vs Index Scan
Let’s try with another query
postgres=# EXPLAIN ANALYZE SELECT name FROM students WHERE student_id = 300000;
QUERY PLAN
-----------------------------------------------------------------------
Index Scan using students_pkey on students (cost=0.43..8.45 rows=1 width=6) (actual time=1.322..1.344 rows=1 loops=1)
Index Cond: (student_id = 300000)
Planning Time: 0.072 ms
Execution Time: 1.400 ms
(4 rows)
This time we are asking for the name column which is not part of the index, so we need to access the heap to get the data. Still, the index is used since we have a WHERE condition on student_id. This is what Index Scan means: Postgres was able to use the index to locate the tuple, but then it accessed the heap to fetch the data.
As you may see, accessing the heap results in both Planning Time and Execution Time increasing.
Caching, caching everywhere
Be aware when doing this investigation, caching will be in your way. Executing the exact same query twice will lead to different Planning and Execution Time. Why? Caching
postgres=# EXPLAIN ANALYZE select name from students where student_id = 300000;
QUERY PLAN
-----------------------------------------------------------------------
Index Scan using students_pkey on students (cost=0.43..8.45 rows=1 width=6) (actual time=0.028..0.050 rows=1 loops=1)
Index Cond: (student_id = 300000)
Planning Time: 0.070 ms
Execution Time: 0.124 ms
(4 rows)
Sequential Scan
Retrieving one record using an index is for sure a happy path from the Query Planner point of view: check the index, fetch the page, filter the unused rows, done. Things are usually more complicated than this, like in this example, where we’d like to fetch a student by his name. Back to school, roll calls were by name, right?
postgres=# EXPLAIN ANALYZE SELECT student_id FROM students WHERE name = 'fec8d';
QUERY PLAN
-----------------------------------------------------------------------
Gather (cost=1000.00..107138.80 rows=1 width=4) (actual time=3.554..335.991 rows=11
loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on students (cost=0.00..106137.70 rows=5 width=4) (actual
time=95.067..306.439 rows=4 loops=3) Filter: (name = 'fec8d'::text)
Rows Removed by Filter: 3.333.333
Planning Time: 0.053 ms
Execution Time: 336.813 ms
Before talking about what is going on here, let’s spend a few words clarifying how to read the information. As the topic could deserve a whole blog post we are reporting just enough to help you get an idea:
First, a query plan is a tree of nodes. Each node has a certain type, and may have one or more child nodes (which child nodes depends on the node type). Different node types behave differently, but the overall mechanism is the same: a parent node pulls data from its children row by row. The children either produce data directly (e.g., by reading it from tables) or pull from their children.
In our example we do have only a child node, -> Parallel Seq Scan on students and the root of the tree is Gather (cost=1000.00..107138.80 rows=1 width=4) (actual time=3.554..335.991 rows=11 loops=1).
Here, no index can help us so Postgres will go through all the pages looking for rows matching the WHERE condition. Fortunately, it can do that in a parallel fashion, as highlighted by Workers Launched: 2 and -> Parallel Seq Scan. Still, the majority of the rows are discarded since
- loops=3 means 3 loops were performed to do the full scan
- Removed by Filter: 3.333.333 be aware the number of rows removed is actual higher, since we are performing 3 loops. Here we are discarding 3×3.333.333=9.999.999 rows, that is, all but the one returned. High values here a good indicator there is room from improvement
Avoid Sequential Scan
Creating an index on the name column can solve this problem
postgres=# CREATE INDEX students_name ON students(name);
Be aware that index creation will take time and locks the tables against writes. For big tables, the usage of CONCURRENTLY is highly recommended because even though it’s generally slower to execute due to a double pass being required, it doesn’t need to lock the table. We can see immediately the benefits of the newly created index on the previous query, as Execution Time dropped from 336.813 ms to 0.542 ms.
postgres=# EXPLAIN ANALYZE SELECT * FROM students WHERE name = 'eec9d';
QUERY PLAN
-----------------------------------------------------------------------
Index Scan using students_name on students (cost=0.43..48.63 rows=11 width=10) (actual
time=0.197..0.353 rows=10 loops=1)
Index Cond: (name = 'eec9d'::text)
Planning Time: 0.109 ms
Execution Time: 0.542 ms
(4 rows)
Sequential Scan and LIKE expressions
Things can get worse than that. Take a look at this query
postgres=# EXPLAIN ANALYZE SELECT * FROM students WHERE name LIKE '%eec';
QUERY PLAN
-------------------------------------------------------------------------
Gather (cost=1000.00..107238.39 rows=1000 width=10) (actual time=2.719..389.191
rows=7260 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on students (cost=0.00..106138.39 rows=417 width=10) (actual
time=3.297..357.868 rows=2420 loops=3)
Filter: (name ~~ '%eec%'::text)
Rows Removed by Filter: 3330917
Planning Time: 1.384 ms
Execution Time: 439.882 ms
A sequential scan again. Why is that? We do have an index on the name column, why is it not used? Well, for LIKE expressions, when the wildcard appears at the beginning of the string, B-Tree indexes are not useful.
For this use case Trigram indexes or a Full-Text Search will do the job.
Keep the statistics updated
EXPLAIN uses statistics to come up with a Query Plan. Outdated statistics can lead to strange results
postgres=# EXPLAIN SELECT name FROM students;
QUERY PLAN
--------------------------------------------------------------------- Seq Scan on students (cost=0.00..308107.35 rows=19.999.835 width=6)
postgres=# ANALYZE students;
postgres=# EXPLAIN SELECT name FROM students;
QUERY PLAN
--------------------------------------------------------------------- Seq Scan on students (cost=0.00..207783.16 rows=9.967.416 width=6)
rows=19.999.835 looks suspicious since our table has only 10 million records. After performing ANALYZE students, which updates the statistics, we got rows=9.967.416, which is more reasonable.
Keeping the stats updated has a bigger impact than this: the Query Planner uses it to reason about the best way to execute a query. Outdated statistics may lead to degraded performance.
As a rule of thumb, the more a table is being written, the more frequently you should update the statistics. AUTOVACUUM takes care of this and is active by default in most configurations, so for example if you use a managed solution like AWS Aurora, you don’t need to think about it because it’s running out of the box. Still, AUTOVACUUM may require tweaking on per-table basis in case certain tables are particularly write-intensive and others are not.
The journey just began…
Query analysis and optimization is a vast topic. In this article we scratched only the surface and we simplified things a lot, but still you can get a long way with what we have covered here.
- Be aware of the Table Access Method: is an index used?
- Keep an eye on high values in Rows Removed by Filter
- Be aware of caching when performing analysis
- Prefer EXPLAIN ANALYZE over EXPLAIN
- Make sure your statistics are updated
We know we left out some important and interesting stuff, like dealing with JOINs, multi-column indexes, LIMIT and OFFSET, more index types… but don’t worry, it is a good material for another blog post, so stay tuned!