Among one of the many new features implemented in 9.3, pg_dump now offers the possibility to perform parallel dumps. This feature has been introduced by the commit below.
commit 9e257a181cc1dc5e19eb5d770ce09cc98f470f5f
Author: Andrew Dunstan
Date: Sun Mar 24 11:27:20 2013 -0400
 
Add parallel pg_dump option.
 
New infrastructure is added which creates a set number of workers
(threads on Windows, forked processes on Unix). Jobs are then
handed out to these workers by the master process as needed.
pg_restore is adjusted to use this new infrastructure in place of the
old setup which created a new worker for each step on the fly. Parallel
dumps acquire a snapshot clone in order to stay consistent, if
available.
 
The parallel option is selected by the -j / --jobs command line
parameter of pg_dump.
 
Joachim Wieland, lightly editorialized by Andrew Dunstan

This is an extremely nice improvement of pg_dump as it allows accelerating the speed a dump is taken, particularly for machines having multiple cores as the load can be shared among separate threads.

Note that this option only works with the format called directoyy that can be specified with option -Fd or –format=directory, which outputs the database dump as a directory-format archive. A new option -j/–jobs can also be used to define the number of jobs that will run in parallel when performing the dump.

When using parallel pg_dump, it is important to remember that n+1 connections are opened to the server, n being the number of jobs defined, with an extra master connection to control the shared locks taken on the objects dumped. So be sure that max_connections is set up to a number high enough in accordance to the number of jobs that are planned.

Thanks to synchronized snapshots shared among the backends managed by the jobs, the dump is taken consistently ensuring that all the jobs share the same data view. However, as synchronized snapshots are only available since PostgreSQL 9.2, you need to be sure that no external sessions are doing any DML or DDL when performing a dump on servers whose version is lower than 9.2. It is also necessary to specify the option –no-synchronized-snapshots in this case.

Now, using a server having 16 cores, let’s check how this feature performs. For this test, the schema of the database dumped is extremely simple: 16 tables with a constant size of approximately 200MB each (5000000 rows with a single int4 column), for a database having a total size of 3.2GB. Tests are conducted with 1, 2, 4, 8 and 16 jobs, so in the case of 16 jobs one table would be dumped by a unique job running on a single connection. This is of course an unrealistic schema for a production database, but here the point is to give an idea of how this feature can speed up a dump in an ideal case. 5 successive runs are done for each case.

Each test case has been run with the following command:
time pg_dump -Fd -f $DUMP_DIRECTORY -j $NUM_JOBS $DATABASE_NAME

Jobs – Runs(s) 1 2 3 4 5 Avg
1 56.714 54.385 54.242 59.300 57.705 56.47
2 27.023 26.207 27.211 26.112 25.206 26.35
4 12.641 12.797 12.484 12.604 12.486 12.60
8 7.641 7.013 7.913 7.081 6.702 6.27
16 5.086 5.045 5.079 5.216 5.054 5.10

As expected, dump time is halved each time job number is doubled with this ideal database schema. However, due to some I/O disk bottleneck, the time gain is not that important with a high number of jobs. For example, in those series of tests, there is not much difference between 8 and 16 jobs, so be always aware of the I/O your dump disk can manage at most and choose carefully the number of jobs used for dumps based on that.

pg_xlogdump is a new contrib module introduced in PostgreSQL 9.3 by this commit.
commit 639ed4e84b7493594860f56b78b25fd113e78fd7
Author: Alvaro Herrera
Date: Fri Feb 22 16:46:24 2013 -0300
 
Add pg_xlogdump contrib program
 
This program relies on rm_desc backend routines and the xlogreader
infrastructure to emit human-readable rendering of WAL records.
 
Author: Andres Freund, with many reworks by Alvaro
Reviewed (in a much earlier version) by Peter Eisentraut

Mainly useful for educational and debugging purposes, pg_xlogdump can be used to understand the internals of PostgreSQL by dumping the WAL (Write-ahead log, which is the basic mechanism used by the server for transaction replay during recovery) into a shape humanly readable.

Just a little bit more information about WAL… Its information is stored in files located in pg_xlog of $PGDATA whose name respect a format name subdivided into 3 sequences of 8 hexa digits defining:

  • Timeline ID
  • Block ID
  • Segment ID

The counter for blocks is incremented once segments are filled.

Postgres includes a couple of functions that can help you to determine in which file is located a given WAL record (pg_xlogfile_name) or what is the current WAL position (pg_current_xlog_location).
michael=# select pg_current_xlog_location();
pg_current_xlog_location
--------------------------
4A/1799988
(1 row)
michael=# select pg_xlogfile_name('4A/1799988');
pg_xlogfile_name
--------------------------
000000010000004A00000001
(1 row)

Here the server is currently on timeline 1, with a Block ID of 4A and a segment ID of 1. The composition of Block ID + segment ID is a LSN or log sequence number, for example ’4A/1799988′. The XLOG files are located in the folder pg_xlog of $PGDATA. Each file has a size of 16MB, and server switches to a new file once this maximum size is reached or once no new file has been written since a time of archive_timeout, parameter of postgresql.conf.

After this digression, let’s have a look at what this utility can do.
The only mandatory option is to specify a start from where the dump will be taken, by either specifying a start LSN with –start or a start WAL file with commands similar to that.
pg_xlogdump --start 0/010EA4D0
pg_xlogdump 000000010000000000000001

If no path is specified to scan the segment files in a given directory, the default is to look if there is a folder called pg_xlog in current directory and get results from it.
In a dump you will get information like that for each WAL record:
rmgr: Heap len (rec/tot): 143/ 175, tx: 688, lsn: 0/02010BE8, prev 0/02010BA0, bkp: 0000, desc: insert: rel 1663/16384/11782; tid 41/54
Each field being in more details:

  • rmgr, the resource manager involved, can be filtered with option -r/–rmgr
  • len, about the length of the record
  • tx, ID of transaction involved with this record, can be filtered with option -x/–xid
  • lsn, log sequence number, including the previous and current lsn, can be filtered with –start and –end.
  • bkp, for the backup block
  • desc, for the description of the action record is doing, and some information related to the relation and page item with which interacts the action

WAL records are divided by resource managers, like database, tablespace, sequence, heap, etc. Based on that you can filter the dump depending on resource manager you want to see. To get a complete list of the resource managers available, simply do that:
pg_xlogdump --rmgr=list

For example, let’s create a sequence on server.
postgres=# create sequence aas;
CREATE SEQUENCE

Here is what you dump when filtering for the resource manager Sequence.
$ pg_xlogdump --start 0/02000000 -r Sequence
rmgr: Sequence len (rec/tot): 158/ 190, tx: 688, lsn: 0/02013C18, prev 0/02013B60, bkp: 0000, desc: log: rel 1663/16384/16390

You can then for example dump all the WAL records for this transaction by running a command like that:
pg_xlogdump --start 0/02000000 -x 688

Only a short introduction of what you can do this module is provided in this post, so for fore details feel free to have a look at the documentation about the available options and what you can do (and cannot do) with pg_xlogdump. Honestly I think it is a good tool to understand the internals of Postgres for newcomers.

Modifying the format name of dump file in a Linux system can be made with sysctl like this.
sysctl -w kernel.core_pattern=core.%e.%p
However, making this modification command-based will not make it effective at next reboot.

In order to make the modification permanent, you need to edit the file /etc/sysctl.conf. Here the core file has the executable name %e and the process ID %p.
kernel.core_pattern = core.%e.%p

Here is a list of the possible keywords usable:

  • %p, PID of dumped process
  • %u, (numeric) real UID of dumped process
  • %g, (numeric) real GID of dumped process
  • %s, number of signal causing dump
  • %t time of dump, expressed as seconds since the Epoch, 1970-01-01 00:00:00 +0000 (UTC)
  • %h, hostname (same as nodename returned by uname(2))
  • %e, executable filename (without path prefix)
  • %c, core file size soft resource limit of crashing process (since Linux 2.6.24)
©2010-2013 Michael Paquier All content is ©Copyright of Otacoo.com 2010-2013. Privacy Policy - Terms of Use