A couple of days ago I saw an email about MVCC catalog access on the PostgreSQL hackers mailing list. When evaluating if a given heap tuple of a table is valid or not, PostgreSQL uses a transaction snapshot which is more or less a list of active transactions IDs, but this snapshot can be used in many flavors to check if the heap tuple satisfies a given visibility condition or not. One of the tuple validation methods using snapshots is called SnapshotNow which is not MVCC-safe. Why? Because the validation using SnapshotNow includes the effects of all the committed transactions as of the instant when validation is done, while MVCC validation considers the transactions that were in progress or unstarted when snapshot was taken as uncommitted.
Validation with SnapshotNow is good for performance as it limits the interactions within backends when operations are done on system tables, and consistency is protected by the necessary lock levels taken when performing a DDL operation.
Trying to solve such problems is particularly interesting for operations like concurrent DDLs (and even for other things) that need to be performed with low level locks, meaning that such DDLs can run while other database operations like common write/read process on a table. The first set of features coming to my mind that could be implemented on top of MVCC catalogs are: REINDEX CONCURRENTLY, CLUSTER CONCURRENTLY and ALTER TABLE CONCURRENTLY. Note that those operations can now be performed out-of-the-box up to a certain level with extensions like pg_reorg or pg_repack, but even those operations need to take an exclusive lock on the objects reorganized when performing switch on the old and new objects.
MVCC catalog access is a necessary milestone in making such features bullet-proof from the consistency point of view, but people are concerned about the performance impact it could have. Using the set of tests created and written by Robert Haas on the mail previously linked, I did some measurements with a 32-core machine to try to have results with large snapshots by having open transactions with up to 2000 connections. The results of those measurements are posted on the mailing list, but they are in a raw shape, not really user-friendly. So I thought that having them in a better reshaped on tables with performance comparison would be nice.
In order to test the backup startup, the following test is done.
$ time for s in `seq 1 1000`; do rm -f
bin/pgsql/master/pg_internal.init && psql -c 'SELECT 2+2' >/dev/null;
done
The measurements have the following characteristics.
- Tests done while 250, 500, 1000 or 2000 remain open with a transaction ID acquired
- Test 1 is the creation and drop of 100,000 within 1 backend
- Test 2 is measure the time to open a backend and run a simple query on it with no relcache file
- Performance difference percentage is evaluated with a simple formula: abs(nonMVCC – VMCC)/avg(nonVMCC, MVCC)
Those measurements do not really take advantage of the multi-core architecture of the mechine used as they run on a single backend, but well… That’s a good machine…
After re-treating the data…
| Series of DROP/CREATE |
|
|
|
|
CREATE (ms) |
DROP (ms) |
| Connections |
MVCC |
Non-MVCC |
% perf |
MVCC |
Non-MVCC |
% perf |
| 250 |
26904.755 |
24554.849 |
9.133012372 |
32891.556 |
29755.146 |
10.01300914 |
| 500 |
29105.713 |
25872.886 |
11.76031059 |
33674.336 |
30434.019 |
10.10887582 |
| 1000 |
33281.246 |
28178.21 |
16.60618669 |
36618.166 |
31747.451 |
14.24901936 |
| 2000 |
39987.815 |
28708.095 |
32.83956789 |
43157.006 |
32510.057 |
28.14156802 |
|
| Backend startup time |
|
| Connections |
MVCC real CPU (s) |
Non-MVCC real CPU (s) |
% perf |
|
| 250 |
9.297 |
9.124 |
1.878291081 |
|
| 500 |
9.592 |
9.221 |
3.94408122 |
|
| 1000 |
9.746 |
9.302 |
4.661906762 |
|
| 2000 |
9.974 |
9.113 |
9.021847331 |
|
With a really high number of connections, the performance loss is close 30% for the DDL case and 10% for the connection case. Those test cases are really the worst scenarios possible when working with DDL, as you would hardly find someone dropping that many objects at the same time (btw why not?), and users would use a connection pooling facility before reaching such a number of active connections. For a *low* number of open connections, this performance loss is really low for the backend startup case. For the DDL case, a loss of 10% can be seen even with a low number of connections but just by thinking the possible functionality gain if MVCC catalogs are implemented this looks acceptable, no?