Last week, I had an interesting discussion in the Postgres hackers mailing list about integrating pg_reorg features (possibility to reorganize a table without locks on it) directly into postgres core. Community strongly suggested that pg_reorg cannot be integrated as-is in the contribution modules of postgres core, and instead postgres should provide native ways to reorganize a table without taking heavy locks. This means that a table could be reindexed or clustered, and at the same time read and writes operations could still happen in parallel. What is particularly useful when an index is broken in a production database, as you could keep your table free of access for the other sessions running while the table is reorganized.

So, the following suggestions have been made:

  • Implementation of CLUSTER CONCURRENTLY
  • Implementation of REINDEX CONCURRENTLY
  • ALTER TABLE CONCURRENTLY
  • Extend autovacuum to perform REINDEX and CLUSTER in parallel automatically

ALTER TABLE, CLUSTER and REINDEX share a common thing: they need high-level locks to be performed. So there is a risk that the table being manipulated by one of those operations could not be accessible for a long time, especially of the table is huge. The locks taken would block read and/or write operations for the other sessions, what is not acceptable for production environment if a critical table is touched.

Working on ALTER TABLE might be a huge piece of work, CLUSTER and REINDEX look more accessible. So I took some week-end spare time while a typhoon was on Tokyo area to write some code and studied the case of REINDEX CONCURRENTLY. And I finished with a patch, yeah!

Here are more details about the feature proposed…
You can rebuild a table or an index concurrently with such commands:
REINDEX INDEX ind CONCURRENTLY;
REINDEX TABLE tab CONCURRENTLY;

REINDEX CONCURRENTLY has the following restrictions:

  • REINDEX [ DATABASE | SYSTEM ] cannot be run concurrently.
  • REINDEX CONCURRENTLY cannot run inside a transaction block.
  • Shared tables cannot be reindexed concurrently
  • indexes for exclusion constraints cannot be reindexed concurrently.
  • toast relations are reindexed non-concurrently when table reindex is done and that this table has toast relations

Here are more details about the algorithm used. Roughly, a secondary index is created in parallel of the first one, it is completed. Then the old and fresh indexes are switched. For a more complete description (the beginning of the process is similar to CREATE INDEX CONCURRENTLY):

  1. creation of a new index based on the same columns and restrictions as the index that is rebuilt (called here old index). This new index has as name $OLDINDEX_cct. So only a suffix _cct is added. It is marked as invalid and not ready
  2. Take session locks on old and new index(es), and the parent table to prevent unfortunate drops
  3. Commit and start a new transaction
  4. Wait until no running transactions could have the table open with the old list of indexes
  5. Build the new indexes. All the new indexes are marked as indisready
  6. Commit and start a new transaction
  7. Wait until no running transactions could have the table open with the old list of indexes
  8. Take a reference snapshot and validate the new indexes
  9. Wait for the old snapshots based on the reference snapshot
  10. mark the new indexes as indisvalid
  11. Commit and start a new transaction. At this point the old and new indexes are both valid
  12. Take a new reference snapshot and wait for the old snapshots to insure that old indexes are not corrupted,
  13. Mark the old indexes as invalid
  14. Swap new and old indexes, consisting here in switching their names.
  15. Old indexes are marked as invalid.
  16. Commit and start a new transaction
  17. Wait for transactions that might use the old indexes
  18. Old indexes are marked as not ready
  19. Commit and start a new transaction
  20. Drop the old indexes

This feature will be normally submitted for review to the PostgreSQL 9.3 commit fest. For the time being patch has been given to community.

Some technical details…

  • A new set of functions has been created in index.c to manage concurrent operations.
  • Code is relying a maximum on existing index creation, building and validation functions for maintainability.
  • Documentation, as well as regression tests have been added in the first version of the patch.
  • Concurrent operations are longer, require additional CPU, IO and memory but they are lock free. The parent relation and indexes cannot be dropped during process.
  • If an error occurs during process, the table will finish with invalid indexes (marked with suffix _cct in their names). It is the responsability of the user to drop them.
  • If you are looking for the patch, have a look here.

Please note that those specification notes are based on the first version of the patch proposed, and are subject to change depending on the community and reviewers’ feedback.

Edit 2012/10/14: A new version of the patch has been submitted with the following enhancements:

  • Support for toast relations to be reindexed concurrently as well as other indexes
  • Correction of drop behavior for constraint indexes
  • Correction of bugs
  • Support for exclusion constraints, looks to work as far as tested

The patch has been submitted to pgsql-hackers in this email.

Postgres-XC, read/write-scalable multi-master symmetric cluster based on PostgreSQL, version 1.0.1 has been released today.

This minor release is based on the latest PostgreSQL 9.1.5+alpha, meaning that all the patches in PostgreSQL 9.1 stable branch have been merged up to commit d10ddf4 (3rd of September 2012).

You can download the source tarball directly from here.
Like PostgreSQL, this tarball contains all the HTML and man documentation.

The documentation of 1.0, including release notes, is available here.

Around 20 bugs have been fixed since 1.0.0, with in particular those fixes:

  • Applications like pgadmin had problems to connect to Postgres-XC server.
  • Drop of sequence was not managed correctly when its database was dropped

You can find all the details in the release notes here.

The project can be followed on Source Forge:
And a couple of GIT repositories are used for development:

  • SourceForge
  • Github
  • Twitter: @PostgresXCBot, bot giving tweets about the commits in Postgres-XC GIT repository

The project members are currently working hard on the next version of Postgres-XC that will include those features:

  • triggers (being implemented)
  • Merge with PostgreSQL 9.2 code (already committed)
  • RETURNING, WHERE CURRENT OF (being implemented)
  • Insure consistency of utilities that cannot run inside transaction block (ex: CREATE DATABASE safely insured in multiple nodes, being implemented)
  • Change table distribution type with ALTER TABLE (already committed)
  • Support for cursors (already committed)
  • Stuff related to node addition and deletion
  • and other things…

The project is under the same license as PostgreSQL, and is managed under a single entity called “Postgres-XC Development Group”.

Have fun with this stable release.

For the last couple of days I have been working on merging the code of Postgres-XC from 9.1 to 9.2. The release 1.0 of XC has been based on 9.1, but it is time to move forward and steal the latest PostgreSQL amazing features :) . The plan was to plug in the code of XC up to the intersection of PostgreSQL master branch and 9.2 stable branch, pretty interesting for two things:

  • Possibility to create stable branches of Postgres-XC based on the 9.2 stable branch of Postgres
  • Preserve the code for merges with future PostgreSQL releases

So the code stays in sync with the latest stable branch of Postgres and the master branch.

However, after this small regression… It has been honestly an interesting experience, allowing me to have a look at the latest evolutions inside PostgreSQL itself between 9.1 and 9.2, and I picked up a couple of items that changed since last year. This list is of course not a complete one, as it results from all the merge conflicts I have seen between Postgres-XC code and PostgreSQL code, so I might not have seen everything on board. More a memo than anything else, perhaps this will help some hackers when upgrading their own code and features not in core.

  • Separation of PGPROC into PGXACT and PGPROC. The new structure PGXACT contains information related to vacuum and snapshot. This really improves the performance for multi-core.
  • Removal of RecentGlobalXmin from snapshot data
  • Addition of clause IF EXISTS in several DDL. You should avoid to use missing_ok in RangeVarGetRelid to false in order to get correct error messages
  • Check if a table column has a dependency with a rule and do not drop it when it is the case
  • Return messages presenting details of tuple data back to client for tuples that failed to satisfy a constraint
  • Removal of inner_plan and outer_plan from deparse_namespace
  • Creation of a new utility command for CREATE TABLE AS, called CreateTableAsStmt. Before CTAS was included in Query with intoClause, this clause has been removed and now both SELECT INTO and CREATE TABLE AS are transformed into this utility at query analysis. Honestly this is cleaner like this
  • Management of cached plans largely modified. In short, the way plans cached have been changed in plancache.c has been really refactored
  • Management of tuple sorting in tuplesort.c. SortSupport is now used instead of ScanKeys
  • Use of PlannerInfo instead of PlannerGlobal when setting references in a plan
  • Setup of parameters in a planner path (see pathnode.c)
  • Modifications for RangeVarGetRelid functions in namespace.h. The calling protocol is extended with no-wait and callback options
  • New functionalities in var.c to pull out Var clauses for a given varno. Some of the APIs were already implemented in XC, but better to use the stuff from core
  • DROP of TRIGGER and RULES is now groupped with DropStmt, before it was under the banner DropPropertyStmt

Of course this list is far from showing all the things done in Postgres for 1 year… This is just the top of the iceberg.

Postgres-XC, as a sharding cluster (write-scalable, multi-master based on PostgreSQL) has currently a huge limitation related to the way tables are distributed.
Just to recall, tables can be either replicated, distributed by round robin, hash or modulo. For hash and modulo the distribution can be done based on the values of one column. Distribution type is defined thanks to an extension of CREATE TABLE.
CREATE TABLE...
[ DISTRIBUTE BY { REPLICATION | ROUND ROBIN | { [HASH | MODULO ] ( column_name ) } } ]
[ TO ( GROUP groupname | NODE nodename [, ... ] ) ]

However once defined it cannot be changed while a cluster is running. There is still the method consisting in using a CREATE TABLE AS consisting in fetching all the data of the table into an intermediate one, then dropping the old table and remaming the intermediate table as the old one. This is enough for 1.0 but the table Oid is definitely lost.

One of the features I have been working these days is to provide to the applications a simple SQL interface that would allow to change a table distribution on the fly, meaning that all the data is transferred automatically between nodes with a single SQL.
This feature uses an extension to ALTER TABLE as follows:
ALTER TABLE
DISTRIBUTE BY { REPLICATION | ROUND ROBIN | { [HASH | MODULO ] ( column_name ) } }
TO { GROUP groupname | NODE ( nodename [, ... ] ) }
ADD NODE ( nodename [, ... ] )
DELETE NODE ( nodename [, ... ] )

This basically means that you can change the distribution type of a table and the subset of nodes where data is located. The node list where data is distributed can be reset, increased or reduced at will.

The redistribution funcionality is still pretty basic, but what is simply does is:

  1. fetch all the data of the table to be redistributed on Coordinator
  2. Truncate the table
  3. Update the catalogs to the new distribution type
  4. Redistribute the data cached on Coordinator

A tuple store is used to cache the data on Coordinator at phase 1, which can be customized with work_mem. A COPY protocol is used to exchange the data between nodes as fastly as possible. This functionality also includes some new stuff to materialize in a tuple slot the data received with COPY protocol (reverse operation also implemented), essential when a tuple has to be redirected to a given node based on a hash value. And it looks that such a materialization mechanism would be a milestone to a more complex mechanism for global constraints and triggers in XC.
This is still a basic implementation, and the following improvements are planned once the basic stuff is committed:

  • Save materialization if it is not necessary (new distribution set to round robin, replication)
  • Truncate the table on a portion of nodes if a replicated table has its subset of nodes reduced
  • COPY only necessary data for a replicated table to new nodes if its subset of nodes is increased
  • And a couple of other things

So how does it work? Let’s take an example with this simple cluster, 1 Coordinator and 3 Datanodes:
postgres=# select node_name, node_type from pgxc_node;
node_name | node_type
-----------+-----------
coord1 | C
dn1 | D
dn2 | D
dn3 | D
(4 rows)

A table aa is created as replicated with 10,000 rows on all the nodes.
postgres=# CREATE TABLE aa (a int);
CREATE TABLE
postgres=# INSERT INTO aa VALUES (generate_series(1,10000));
INSERT 0 10000
postgres=# EXECUTE DIRECT ON (dn1) 'SELECT count(*) FROM aa';
count
-------
10000
(1 row)
postgres=# EXECUTE DIRECT ON (dn2) 'SELECT count(*) FROM aa';
count
-------
10000
(1 row)
postgres=# EXECUTE DIRECT ON (dn3) 'SELECT count(*) FROM aa';
count
-------
10000
(1 row)

So here there are 10,000 tuples on each nodes, nothing fancy for a replicated table.

Let’s change it to a hash-based distribution…
postgres=# ALTER TABLE aa DISTRIBUTE BY HASH(a);
NOTICE: Copying data for relation "public.aa"
NOTICE: Truncating data for relation "public.aa"
NOTICE: Redistributing data for relation "public.aa"
ALTER TABLE
postgres=# EXECUTE DIRECT ON (dn1) 'SELECT count(*) FROM aa';
count
-------
3235
(1 row)
postgres=# EXECUTE DIRECT ON (dn2) 'SELECT count(*) FROM aa';
count
-------
3375
(1 row)
postgres=# EXECUTE DIRECT ON (dn3) 'SELECT count(*) FROM aa';
count
-------
3390
(1 row)

Now one third of the data is on each node.

What happens if the set of nodes is reduced? Let’s now remove the data on node dn2.
postgres=# ALTER TABLE aa DELETE NODE (dn2);
NOTICE: Copying data for relation "public.aa"
NOTICE: Truncating data for relation "public.aa"
NOTICE: Redistributing data for relation "public.aa"
ALTER TABLE
postgres=# EXECUTE DIRECT ON (dn1) 'SELECT count(*) FROM aa';
count
-------
5039
(1 row)
postgres=# EXECUTE DIRECT ON (dn2) 'SELECT count(*) FROM aa';
count
-------
0
(1 row)
postgres=# EXECUTE DIRECT ON (dn3) 'SELECT count(*) FROM aa';
count
-------
4961
(1 row)

The data is now hashed on nodes dn1 and dn3. There is no more data on dn2.

This implementation is still pretty basic, but opens a couple of possibilities for clustering applications, no?

Postgres-XC, read&write-scalable multi-master symmetric cluster based on PostgreSQL, version 1.0.0 is released.
This project is seen as an open-source alternative to costly products such as OracleRAC. Postgres-XC is based on the code of PostgreSQL, so it can naturally use all its technologies, which are enhaunced to have a shared-nothing multi-master PostgreSQL-based database cluster.

This first stable version is based on PostgreSQL 9.1.4. All the patches in PostgreSQL 9.1 stable branch have been merged up to commit 873d1c1 (1st of June 2012).
This includes the security fix related to pg_crypto dated of 30th of May.
You can download the source tarball directly from here
This tarball contains all the HTML and man documentation.

30 bug fixes have been made since release of beta2, with some notable enhancements:

  • Support for EXTENSION is fixed
  • Stabilization of the use of slave nodes in cluster
  • Fix of a bug related to read-only transactions, improving performance by 15%
  • Support of compilation for MacOSX

About the scalability of this release, Postgres-XC 1.0.0 scales to a factor of 3 when compared to a standalone server PostgreSQL 9.1.3 on 5 nodes using a benchmark called DBT-1.

Compared to version Postgres-XC 0.9.7, the following features have been added:

  • Fast query shipping (FQS), quick identification of expressions in a query that can be pushed down to remote nodes
  • SERIAL types
  • TABLESPACE
  • Utility to clean up 2PC transactions in cluster (pgxc_clean)
  • Utility for initialization of GTM (global transaction manager, utility called initgtm)
  • Relation-size functions and locking functions
  • Regression stabilization

The documentation of 1.0, including release notes, is available here.

The project can be followed on SourceForge.
And a couple of GIT repositories are used for development:

The core team is currently working in the addition of new features for the next major release including:

  • Merge with PostgreSQL 9.2
  • Data redistribution functionality, changing table distribution in cluster with a simple ALTER TABLE
  • New functionalities related to online node addition and deletion for a better user experience
  • Triggers
  • Planner improvements
  • Global constraints

The roadmap of the project is located here in section Roadmap.

The project is under the same license as PostgreSQL, now managed under a single entity called “Postgres-XC Development Group”.
In order to keep in touch with the project, whose development follows the same model as PostgreSQL, you can register to the following mailing lists:

  • postgres-xc-general@lists.sourceforge.net, for general questions. Registration can be done here
  • postgres-xc-developers@lists.sourceforge.net. hachers mailing list. Registration can be done here
©2010-2013 Michael Paquier All content is ©Copyright of Otacoo.com 2010-2013. Privacy Policy - Terms of Use