Michael

Creating RPM packages for your own needs can be quite an adventure the first time, especially with a code base as complex as Postgres considering all the submodules, options and extensions that the core code can come up with. But thanks to all the work done by the community, the building process is quite easy.

First, be sure to get the RPM specs from the official RPM repository.
git svn clone http://svn.pgrpms.org/repo/rpm/redhat

The spec repository has a structure depending on the version of postgresql, the Linux distribution on which the RPMs are based, and the package that needs to be built. There is also a common part called common/ where are defined all the rules used by all the Makefiles of each package, like some rules for calls as rpmbuild.

Depending on the version or the package you want to build, simply move into the dedicated folder:
cd $VERSION/$PACKAGE_NAME/$DISTRIBUTION
$VERSION should be something like 9.3, $PACKAGE_NAME named as postgresql, and $DISTRIBUTION be like EL-6 or F-17 depending on the Linux distribution where you want to build the packages.

Then run this command to create the packages.
make rpm
And you should basically be done… But actually not if your environment lacks some libraries.

You might run into dependency issues.
error: Failed build dependencies:
tcl-devel is needed by postgresql93-9.3beta1-4PGDG.rhel6.x86_64
e2fsprogs-devel is needed by postgresql93-9.3beta1-4PGDG.rhel6.x86_64
uuid-devel is needed by postgresql93-9.3beta1-4PGDG.rhel6.x86_64

The necessary libraries can be found with the flag BuildRequires in postgresql.spec, and depend on the build options wanted, like ldap or python activation for example. You might not need that many things for your own build though.

The RPM build is also dependent on some files obtained from some external sources, of course the source tarball, but also some pdf documentation always difficult to generate if a particular environment with jade is not set up. The files obtained from external sources can be found with the tags Source$NUM with an URL in the spec file, here is for example what happens in the case of the PDF documentation.
Source12: http://www.postgresql.org/files/documentation/pdf/%{majorversion}/%{oname}-%{majorversion}-A4.pdf

When writing this post, the 9.3 beta1 build failed only because of this PDF file not accessible:
+ cp -p $HOME/git/rpm/9.3/postgresql/EL-6/postgresql-9.3-A4.pdf .
cp: cannot stat `$HOME/git/rpm/9.3/postgresql/EL-6/postgresql-9.3-A4.pdf': No such file or directory

Note that you can as well hack a bit the spec to remove this file when generating your RPM, this needs only the removal of 2 lines.

But it was working correctly for 9.2. Note that if you want to even go deeper in the hacking of the spec file, as you might need to adjust the build process to your own environment, be sure to always use a command of that type that will generate the RPMs for you:
rpmbuild --define "_sourcedir $HOME/rpm/9.2/postgresql/EL-6" \
--define "_specdir $HOME/rpm/9.2/postgresql/EL-6" \
--define "_builddir $HOME/rpm/9.2/postgresql/EL-6" \
--define "_srcrpmdir $HOME/rpm/9.2/postgresql/EL-6" \
--define "_rpmdir $HOME/rpm/9.2/postgresql/EL-6" \
--define "dist .rhel6" -bb "postgresql-9.2.spec"

This is only a command defined in common/Makefile.global, but I think it is good to know that it is the central piece of the build process commanded by the spec file.

Once done, the following RPMs (here for 9.2) will be generated:

  • postgresql92-*.rpm, containing some client binaries (pg_dump, createdb…)
  • postgresql92-libs-*.rpm, containing some client libraries (expg, libpq, libpgtype)
  • postgresql92-server-*.rpm, containing server side binaries (initdb, postgres…)
  • postgresql92-docs-*.rpm, with the documentation
  • postgresql92-contrib-*.rpm, with all the contrib modules
  • postgresql92-devel-*.rpm, with all the development libs and headers
  • postgresql92-plperl-*.rpm, containing the extension plperl
  • postgresql92-plpython-*.rpm, containing the extension plpython
  • postgresql92-pltcl-*.rpm, containing extension pltcl
  • postgresql92-test-*.rpm, containing all the regression tests
  • postgresql92-debuginfo-*.rpm, heavy package with all the information for debugging purposes

And all you RPMs are here! Be sure to check that their content fits your needs with a command of the type “rpm -qpl”.

pg_top is a monitoring tool designed for PostgreSQL in a way similar to top. When using it you can get a grab at the process activity of your server with an output similar to that:
last pid: 425; load avg: 0.07, 0.03, 0.02; up 0+00:52:08 12:22:02
1 processes: 1 sleeping
CPU states: 0.5% user, 0.0% nice, 0.1% system, 99.8% idle, 0.0% iowait
Memory: 581M used, 15G free, 18M buffers, 211M cached
Swap:
 
 PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
 426 postgres 20 0 171M 5388K sleep 0:00 0.00% 0.00% postgres: postgres postgres [local] idle

The project has a dedicated GIT repository on postgresql.org as well as a mirror on github. There is also a mailing list on pgfoundry dedicated to the project.

It has many customization options:

  • Connection to a given PostgreSQL server with a certain IP, port, username, etc.
  • Possibility to monitor server from remote
  • An interactive mode with the possibility change the output script or even to kill a list of processes

pg_top supports many OSes (OSX, Linux, FreeBSD, hpux, aix). It is designed with a structure such as there is one set of generic APIs designed to get results for different purposes (process-related information mainly), the same function being written multiple times for each different OS but with the same spec. It is then decided which file to include in compilation at configure step by choosing among files named as machine/m_$OS_NAME.c.

These days I have been working on improving the user experience with pg_top, the first patch I sent consisting in improving a bit the documentation, the help message when an incorrect option is used and add support for long options. This was just based on my first impressions with this module, that looks to be powerful but difficult to apprehend for a newcomer.

However, the plan is not really to stop to that… Based on some optimizations added in the vPostgres version of pg_top, a couple of extra features have been proposed.

1. Database activity

The idea here is to query pg_stat_database and get back raw statistics based on transactions committed, rollbacked, tuples inserted, etc. Then those fields are analyzed and recompiled based on the time diff between two displays to get some TPS values. This is platform-independent as it relies entirely on pg_stat_database and libpq, so a generic function available to all the OS supported would be fine. The output could look like that:
Activity: 1 tps, 0 rollbs/s, 0 buffer r/s, 100 hit%, 42 row r/s, 0 row w/s

2. Disk space of PGDATA

Here, the plan is to get the disk space available for the partition where data folder of server is located, similarly to the output you can get with a plain df command. The result would look like that:
PGDATA disk: 48.6 GB total, 14.2 GB free
This would be OS-dependent, so an additional type of API in the set dedicated to the machine/* files would be necessary, and return an error if this is not supported for a certain OS.

3. Disk I/O

In order to do that, it is necessary to gather statistics from for example the parsing of a file like /proc/diskstats. The output could look like that:
Disk I/O: 0 reads/s, 0 KB/s, 0 writes/s, 32 KB/s
Once again this is OS-dependent, so it would need an extra generic API.

All of those additional outputs could be printed to screen using some dedicated options, making the default the layer the same as the current one.

And you, do you have other ideas about what could be added in pg_top?

Based on a 2nd version of the patch implementing MVCC catalog access, here is a second set of results after the first try done last week.

The same tests as last week are done, aka the backend startup time and the CREATE/DROP of multiple objects. The formula used to calculate performance comparison is the same as last week. The vanilla measurement use SnapshotNow, while MVCC uses the MVCC catalogs (Oh surprise!).


Series of DROP/CREATE
CREATE (ms) DROP (ms)
Connections MVCC vanilla % perf MVCC vanilla % perf
250 25685.094 24846.060 3.32 31492.377 30391.713 3.55
500 28557.882 24339.449 15.94 33673.266 30084.741 11.25
1000 32210.093 26007.960 21.30 36754.888 31019.918 16.92
2000 40374.754 26900.324 40.05 43442.528 30741.989 34.24
Backend startup time
Connections MVCC real CPU (s) vanilla real CPU (s) % perf
250 9.067 8.993 0.82
500 9.034 9.004 0.33
1000 9.303 9.072 2.15
2000 9.916 9.257 6.87

Except for the numbers of backend startup case for 500 connections which look to be victim of some noise, performance degradation is acceptable for the backend test, topping at 6% for 2000 connections. During the first battery of tests, this reached 10% of performance degradation.

For the CREATE/DROP test, performance drops up to 40% for 2000 connections which is a little bit worse than the 32% degradation noticed after the 1st battery of tests.

Still, all those results look to be acceptable even for a high number of connections.

A couple of days ago I saw an email about MVCC catalog access on the PostgreSQL hackers mailing list. When evaluating if a given heap tuple of a table is valid or not, PostgreSQL uses a transaction snapshot which is more or less a list of active transactions IDs, but this snapshot can be used in many flavors to check if the heap tuple satisfies a given visibility condition or not. One of the tuple validation methods using snapshots is called SnapshotNow which is not MVCC-safe. Why? Because the validation using SnapshotNow includes the effects of all the committed transactions as of the instant when validation is done, while MVCC validation considers the transactions that were in progress or unstarted when snapshot was taken as uncommitted.

Validation with SnapshotNow is good for performance as it limits the interactions within backends when operations are done on system tables, and consistency is protected by the necessary lock levels taken when performing a DDL operation.

Trying to solve such problems is particularly interesting for operations like concurrent DDLs (and even for other things) that need to be performed with low level locks, meaning that such DDLs can run while other database operations like common write/read process on a table. The first set of features coming to my mind that could be implemented on top of MVCC catalogs are: REINDEX CONCURRENTLY, CLUSTER CONCURRENTLY and ALTER TABLE CONCURRENTLY. Note that those operations can now be performed out-of-the-box up to a certain level with extensions like pg_reorg or pg_repack, but even those operations need to take an exclusive lock on the objects reorganized when performing switch on the old and new objects.

MVCC catalog access is a necessary milestone in making such features bullet-proof from the consistency point of view, but people are concerned about the performance impact it could have. Using the set of tests created and written by Robert Haas on the mail previously linked, I did some measurements with a 32-core machine to try to have results with large snapshots by having open transactions with up to 2000 connections. The results of those measurements are posted on the mailing list, but they are in a raw shape, not really user-friendly. So I thought that having them in a better reshaped on tables with performance comparison would be nice.

In order to test the backup startup, the following test is done.
$ time for s in `seq 1 1000`; do rm -f
bin/pgsql/master/pg_internal.init && psql -c 'SELECT 2+2' >/dev/null;
done

The measurements have the following characteristics.

  • Tests done while 250, 500, 1000 or 2000 remain open with a transaction ID acquired
  • Test 1 is the creation and drop of 100,000 within 1 backend
  • Test 2 is measure the time to open a backend and run a simple query on it with no relcache file
  • Performance difference percentage is evaluated with a simple formula: abs(nonMVCC – VMCC)/avg(nonVMCC, MVCC)

Those measurements do not really take advantage of the multi-core architecture of the mechine used as they run on a single backend, but well… That’s a good machine…

After re-treating the data…


Series of DROP/CREATE
CREATE (ms) DROP (ms)
Connections MVCC Non-MVCC % perf MVCC Non-MVCC % perf
250 26904.755 24554.849 9.133012372 32891.556 29755.146 10.01300914
500 29105.713 25872.886 11.76031059 33674.336 30434.019 10.10887582
1000 33281.246 28178.21 16.60618669 36618.166 31747.451 14.24901936
2000 39987.815 28708.095 32.83956789 43157.006 32510.057 28.14156802
Backend startup time
Connections MVCC real CPU (s) Non-MVCC real CPU (s) % perf
250 9.297 9.124 1.878291081
500 9.592 9.221 3.94408122
1000 9.746 9.302 4.661906762
2000 9.974 9.113 9.021847331

With a really high number of connections, the performance loss is close 30% for the DDL case and 10% for the connection case. Those test cases are really the worst scenarios possible when working with DDL, as you would hardly find someone dropping that many objects at the same time (btw why not?), and users would use a connection pooling facility before reaching such a number of active connections. For a *low* number of open connections, this performance loss is really low for the backend startup case. For the DDL case, a loss of 10% can be seen even with a low number of connections but just by thinking the possible functionality gain if MVCC catalogs are implemented this looks acceptable, no?

On top of bringing the latest PostgreSQL version at hand inside a virtual machine, vFabric Postgres 9.2.4 includes some new scripts to facilitate the integration of replication features of community version of PostgreSQL.

Here is a list of the new functionalities:

  • Separate virtual disk for archives with a default size of 2G mounted on /var/vmware/vpostgres/9.2-archive.
  • New default parameters in postgresql.conf to support replication by default (wal_level, max_wal_senders, archive_command, archive_mode, etc.).
  • New scripts for the management of replication between vFabric Postgres nodes: slave creation, node promotion, replication monitoring. Those scripts are located in folder /opt/vmware/vpostgres/current/share.

The replication configuration that can be achieved thanks to those scripts is really simple and convenient for most of real-world applications.

  • All the slaves use streaming replication to catch up with master in asynchronous mode
  • If a slave is disconnected for a too long time from the cluster and it cannot get necessary WAL files from archives of master, a new base backup is needed. This helps in maintaining the archives at a low size level (size by the way customizable by changing the disk size of archives in the virtual machine settings) by keeping in memory only the WAL files that are needed by slaves to catch up with a master
  • Recovery default is the latest timeline. This is to ensure that a slave trying to reconnect to a node freshly promoted will catch up with the latest changes that occurred in cluster
  • Slave/slave connections are possible with cascading replication

Here are more details about the scripts implemented. They are aimed to be used only with vFabric Postgres nodes whose version is higher than 9.2.4.

run_as_replica

Transform the existing vFabric Postgres server into a slave by getting it in sync with a given root node (either slave or master), or reconnect to an existing node in the cluster. When creating a new slave on a freshly-installed vFabric Postgres 9.2.4, you need to use options -b to take a new base backup and -W to specify password to connect to remote node, recommended user being “postgres”.

A single command based on this script is enough to set up a read-only slave becoming the replica of an existing root node in order to leverage read activity of an application with load balancing through multiple virtual machines of vFabric Postgres.
$ /opt/vmware/vpostgres/current/share/run_as_replica -h $IP_MASTER -b -W -U postgres
This script registers automatically SSH authorization key of slave node on root node for archive transfer between nodes so there is no need to worry in doing additional settings for pg_hba.conf, recovery.conf, postgresql.conf or SSH on root node side. Note that as default value max_wal_senders is set to 3, so you might want to increase this value on root node when connecting setting up more slave on a given root node.
This script can also be used to reconnect an existing slave to a new root node. In the case where slave node is not able to catch up with the root node because of missing WAL archive files or because the slave node was in advance compared to the root node in term of WAL replay, a new base backup is necessary.

A generated recovery.conf looks like that:
standby_mode= 'on'
primary_conninfo = 'host=''$IP_ADDRESS'' user=''postgres'' application_name=''localhost.localdom'' password=''$PASSWORD'''
recovery_target_timeline = 'latest'
restore_command = 'scp $IP_ADDRESS:/var/vmware/vpostgres/9.2-archive/%f %p'

$IP_ADDRESS is the IP used by slave node to connect to a root node (either slave or master). $PASSWORD is the password that has been specified when calling run_as_replica.

show_replication_status

This script can be used to monitor the WAL replay activity of slave nodes connected to the node where script is launched. The following query is used on the server.
with xl as (
select pg_last_xlog_receive_location() as log_receive_position,
pg_last_xlog_replay_location() as log_replay_position)
select xl.log_receive_position as log_receive_position,
xl.log_replay_position as log_replay_position,
pg_xlog_location_diff(xl.log_receive_position,
xl.log_replay_position) as replay_delta
from xl;

Compared to the default columns of pg_stat_replication (catalog table used to report activity of replicated nodes) a new column called replay_delta is used to monitor how a given slave is late compared to a master node when replaying WAL.
When monitoring replication activity, results similar to that are obtained:
postgres@localhost:~> /opt/vmware/vpostgres/current/share/show_replication_status
sync_priority | slave | sync_state | log_receive_position | log_replay_position | receive_delta | replay_delta
---------------+--------------------+------------+----------------------+---------------------+---------------+--------------
0 | localhost.localdom | async | 0/8000000 | 0/8000000 | 0 | 0
(1 row)

The lower the values of receive_delta and replay_delta are, the closer slave node is getting to the master in term of WAL replay (replication state).

promote_replica_to_primary

This script allows promoting a given slave to master using the default settings in recovery.conf.
$ /opt/vmware/vpostgres/current/share/promote_replica_to_primary
server promoting

Other slave nodes can reconnect to a new master using run_as_replica.

archive_command

This script is used as a command for archiving WAL files (set by archive_command in postgresql.conf, kicked by vFabric Postgres server each time a WAL file is ready to be archived, or archive_timeout is reached). This script includes some checks on the existence of WAL file to be archived and some checks on the size of archive disk to ensure that only required WAL files are maintained in archives based on the size of disk available for archives. WAL files are archived if root node is part of a cluster as primary or replica.

create_replication_user

This script can be used to create a new replication user on a master node. It generates a CREATE USER query using the password asked at prompt.

Having those scripts already in place inside the vFabric Postgres appliance and RPMs has several advantages, two of them being:

  • Having unique and consistent scripts used for the management of vFabric Postgres cluster without deploying any additional and home-made tools
  • Minimizing node replication/promotion operations, only a basic understanding of internal mechanics of database server is enough: failover, reconnection and slave/master structure

As a DBA and an operator, such things make the management of clusters and applications really easier in an ESX cluster of hundreds of vFabric Postgres servers for example.

©2010-2013 Michael Paquier All content is ©Copyright of Otacoo.com 2010-2013. Privacy Policy - Terms of Use