This short manual targets PostgreSQL users looking for a smooth introduction to dblink.

dblink is a PostgreSQL contrib module that can be found in the folder contrib/dblink. It is treated as an extension, meaning that the installation of this module is in two phases, explained in this post a bit later.
The goal of this module is to provide simple functionalities to connect and interact with remote database servers from a given PostgreSQL server to which your client application or driver is connected.

The first thing that you need to do is to install the sources of dblink. You can do it easily by installing all the modules of PostgreSQL at once from source code.
./configure --prefix $INSTALL_FOLDER
make install-world

$INSTALL_FOLDER is the folder where you wish to install PostgreSQL binaries.

Or if you wish only to install dblink (you might have already installed PostgreSQL ressources), do it directly from its source folder.
cd contrib/dblink
make install

The installed files for dblink can be found in $INSTALL_FOLDER/share/extensions.
$ cd $INSTALL_FOLDER/share/extension
$ ls dblink*
dblink--1.0.sql dblink--unpackaged--1.0.sql dblink.control

For the purpose of this demonstration, two PostgreSQL servers called server1 and server2 are created on the same local server with port values respectively of 5432 and 5433.

Some data will be inserted on server2, and the goal is to fetch this data to server1 using dblink.

Let’s first prepare server 2 and create some data on it.
$ psql -p 5433 postgres
psql (9.2beta1)
Type "help" for help.
postgres=# create table tab (a int, b varchar(3));
CREATE TABLE
postgres=# insert into tab values (1, 'aaa'), (2,'bbb'), (3,'ccc');
INSERT 0 3

So now that the remote server2 is ready to work, all the remaining tasks need to be done on server1.

The sources of dblink have been installed, but they are not yet active on server1. dblink is treated as an extension, which is a functionality that has been introduced since PostgreSQL 9.1. In order to activate a new extension module, here dblink, on a PostgreSQL server, the following commands are necessary.
$ psql postgres
psql (9.2beta1)
Type "help" for help.
postgres=# CREATE EXTENSION dblink;
CREATE EXTENSION
postgres=# \dx
List of installed extensions
Name | Version | Schema | Description
---------+---------+------------+--------------------------------------------------------------
dblink | 1.0 | public | connect to other PostgreSQL databases from within a database
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
(2 rows)

You can then confirm that the extension has been activated by using \dx from a psql client.

Now let’s fetch the data from server2 with dblink while connecting on server1. The function dblink can be invocated to fetch data as it uses as return type “SETOF record”. This implies that the function has to be called in FROM clause.
postgres=# select * from dblink('port=5433 dbname=postgres', 'select * from tab') as t1 (a int, b varchar(3));
a | b
---+-----
1 | aaa
2 | bbb
3 | ccc
(3 rows)

Do not forget to use aliases in the FROM clause to avoid errors of the following type:
postgres=# select * from dblink_exec('port=5433 dbname=postgres', 'select * from tab');
ERROR: statement returning results not allowed

It is also possible to do more fancy stuff with dblink functions.
dblink_connect allows you to create a permanent connection to a remote server. Such connections are defined by names you can choose. This avoids to have to create new connections to remote servers all the time at invocating of function dblink, allowing to gain more time by maintaining connections alive. In case you wish to use the connection created, simply invocate its name when using dblink functions.

Execution of other queries, like DDL or DML, can be done with function dblink_exec.
postgres=# select dblink_exec('port=5433 dbname=postgres', 'create table aa (a int, b int)');
dblink_exec
--------------
CREATE TABLE
(1 row)

dblink has a dozen of functions that allows to control remote database servers from a single connection point.
be sure to have a look at it!

If you came at this page, it means that you got interest in a cluster solution based on PostgreSQL.
Currently developed for version 0.9.7, Postgres-XC has been largely improved with the way cluster is being set.

Just lately, I committed this commit.
Support for dynamic pooler/session connection information cache reload
 
A new system function called pgxc_pool_reload has been added.
If called, this function reloads connection information to remote nodes
in a consistent way with the following process:
1) A lock is taken on pooler forbidding new connection requests
2) Database pools (user and database-dependant pools) are reloaded
depending on the node information located on catalog pgxc_node.
The following rules are followed depending on node connection
information modification:
- node whose node and port value is changed has its connections
dropped and this node pool is deleted from each database pool
- node deleted is deleted from each database pool
- node unchanged is kept as is. However, its index value is changed
depending on the new cluster configuration.
- node created is added to each database pool
3) Lock is released
4) Session that invocated pgxc_pool_reload signals all the other
server sessions to reconnect to pooler to allow each agent to update
with newest connection information and reload session information
related to remote node handles. This has as effect to abort current
transactions and to remove all the temporary and prepared objects
on session. Then a WARNING message is sent back to client to inform
about the cluster configuration modification.
5) Session that invocated pgxc_pool_reload reconnects to pooler by
itself and reloads its session information related to remote
node handles. No WARNING message is sent back to client to inform
about the session reload.
This operation is limited to local Coordinator and returns a boolean
depending on the success of the operation. If pooler data is consistent
with catalog information when pgxc_pool_reload is invocated, nothing is
done but a success message is returned.
 
This has the following simplifications for cluster settings:
- cluster_nodes.sql is deleted.
- a new mandatory option --nodename is used to specify the node name
of the node initialized. This allows to set up pgxc_node catalog
with the node itself. pgxc_node_name in postgresql.conf is also
set automatically.
- CREATE/ALTER/DROP node are launched on local Coordinator only, meaning
that when a cluster is set up, it is necessary to create node information
on each Coordinator and then upload this information to pooler and sessions
by invocaing pgxc_pool_reload.
 
This optimization avoids to have to restart a Coordinator when changing
cluster configuration and solves security problems related to cluster_nodes.sql
that could be edited with all types of SQL even if its first target was only NODE
DDL.

So what is behing this looooong commit text? Well, it is a feature that will simplify your life.
It is strongly related the feature called Node DDL that has been committed at the end of October. Just to recall, node DDL is a feature allowing to manage the cluster nodes with catalog tables such as you don’t have to bother about heavy settings in postgresql.conf. However, even if node DDL have been supported, it does not mean that dropping, creating or altering a node is visible to the connection pooling. You had to restart a node, increasing by that much the downtime of each Coordinators.

This commit, in one word, introduces this => pgxc_pool_reload. It is a new system function used to check whose details are described here used to reload all the server sessions and pooler connection information without having to restart a Coordinator. In other words, it simplifies the way to set up a cluster.

Now let’s enter in the main subject: the cluster setting, what can be done with the following steps:

  • Initialize the nodes with initdb
  • Create a global transaction manager and start it
  • Start up all the nodes
  • Connect to a Coordinator
  • Create all the nodes initialized with node DDL
  • Reload connection data with “select pgxc_node_reload();”

Here are a couple of details:

  • There is a new mandatory option in initdb called –nodename that is used to setup the name of the node being initialized. This is a Postgres-XC specific option. This option is used to define itself in pgxc_node catalog the node being initialized. It also sets automatically pgxc_node_name in postgresql.conf.
  • You can check the consistency of the information cached in pooler and catalogs by calling the system function pgxc_pool_check. It returns a boolean on operation success or failure.
  • The specifications of node DDL is located at those pages: CREATE NODE, DROP NODE and ALTER NODE
  • Invocating pgxc_pool_reload aborts the current transaction, and drops all the prepared and temporary objects in session. This is effective in all the session of the server
  • Node DDL run locally, so you need to launch the same node DDL on all Coordinators of the cluster. This allows more smoothness in case Coordinators view the same Datanode with different IPs.

It is also possible to manipulate cluster nodes even after initialization. It doesn’t matter how many times you change it as long as pgxc_pool_reload is used to update data cached in sessions and connection pool.

Here is also a bonus, a script that you can use to setup easily a cluster with a chosen number of Coordinators and Datanodes on a local machine. Port numbers are fixed, but it helps in trying Postgres-XC.
#!/bin/bash
#Otacoo.com
 
#Build cluster from scratch and run pg_regress
#1) Build the XC cluster: 1GTM with Coordinators (default 1) and Datanodes (default 2) defined
#2) Run pg_regress if wanted
 
#Take and check options
EXPECTED_ARGS=0
FLAG_REGRESS=0
NUM_COORDS=1
NUM_DATANODES=2
 
#Treat options
while getopts 'c:n:r' OPTION
do
  case $OPTION in
  c) #Number of Coordinators
    NUM_COORDS="$OPTARG"
    EXPECTED_ARGS=$(($EXPECTED_ARGS + 2))
    ;;
  n) #Number of Datanodes
    NUM_DATANODES="$OPTARG"
    EXPECTED_ARGS=$(($EXPECTED_ARGS + 2))
    ;;
  r) #Run regressions or not?
    FLAG_REGRESS=1
    EXPECTED_ARGS=$(($EXPECTED_ARGS + 1))
    ;;
  ?) echo "Usage: `basename $0` [-c num_coords] [-n num datanodes] [-r]\n"
    echo "Example: `basename $0` -c 4 -n 4 -r"
    exit 0
    ;;
  esac
done
 
#Check number of arguments
if [ $# -ne $EXPECTED_ARGS ]
then
  echo "Usage: `basename $0` [-c num_coords] [-n num datanodes] [-r]\n"
  echo "Example: `basename $0` -c 4 -n 4 -r"
  exit 1
fi
 
#Setup Default values
#GTM has a unique value
#Coordinator ports are mapped from 5432
#Datanode ports are mapped from 15432
#All the machines run on local host
COORD_PORT_START=5431
DN_PORT_START=15432
COORD_PORTS[1]=$COORD_PORT_START
DN_PORTS[1]=$DN_PORT_START
for i in $(seq 1 $NUM_COORDS)
do
  COORD_PORTS[$i]=$(($COORD_PORT_START + $i))
done
for i in $(seq 1 $NUM_DATANODES)
do
  DN_PORTS[$i]=$(($DN_PORT_START + $i))
done
GTM_PORT=7777
PSQL_FOLDER=$HOME/pgsql
 
#Finish calculating dependencies between folders
PSQL_SHARE=$PSQL_FOLDER/share
PSQL_BIN=$PSQL_FOLDER/bin
GTM_DATA=$PSQL_FOLDER/gtm
LOG_DATA=$PSQL_FOLDER/log
 
#Setup data folders
for i in $(seq 1 $NUM_COORDS)
do
  COORD_DATAS[$i]=$PSQL_FOLDER/coord$i
done
for i in $(seq 1 $NUM_DATANODES)
do
  DN_DATAS[$i]=$PSQL_FOLDER/datanode$i
done
 
#Kill all the processes that may remain
#in the most atrocious way possible as they meritated it
#OK this is not very clean...
echo "Take out Postgres-XC processes"
kill -9 `ps ux | grep "bin/gtm" | cut -d " " -f 2-3`
killall postgres gtm psql
sleep 2
 
#Check if data folders exist or not and create them
echo "Creating data folders"
for folder in $GTM_DATA $LOG_DATA ${COORD_DATAS[@]} ${DN_DATAS[@]}
do
  if [ ! -d $CODE_REPO_GIT ]
  then
    mkdir $folder
  fi
done
 
#Clean up all the data folders
echo "Clean up data folders"
for folder in $GTM_DATA $LOG_DATA ${COORD_DATAS[@]} ${DN_DATAS[@]}
do
  rm -r $folder/*
done
sleep 1
 
#OK, let's begin the show...
 
#make initialization
echo "Initializing PGXC nodes"
for i in $(seq 1 $NUM_DATANODES)
do
  $PSQL_BIN/initdb --locale=POSIX --nodename dn$i -D ${DN_DATAS[$i]}
done
for i in $(seq 1 $NUM_COORDS)
do
  $PSQL_BIN/initdb --locale=POSIX --nodename coord$i -D ${COORD_DATAS[$i]}
done
 
#copy all configuration files to remote machin
echo "Copy of configuration files"
#Create an empty GTM conf file and add host/port data
touch $GTM_DATA/gtm.conf
echo "nodename = 'one'" >> $GTM_DATA/gtm.conf
echo "listen_addresses = '*'" >> $GTM_DATA/gtm.conf
echo "port = 7777" >> $GTM_DATA/gtm.conf
echo "log_file = 'gtm.log'" >> $GTM_DATA/gtm.conf
 
#Node common settings
OPTIONS="logging_collector = on\n"\
"gtm_port = $GTM_PORT\n"\
"datestyle = 'postgres, mdy'\n"\
"timezone = 'PST8PDT'\n"\
"default_text_search_config = 'pg_catalog.english'\n"\
"log_statement = 'all'\n"\
"log_min_messages = debug1\n"\
"log_min_error_statement = debug1\n"\
"max_prepared_transactions = 20\n"
 
#Pooler options
POOLER_BASE_PORT=6667
#Coordinator settings
for i in $(seq 1 $NUM_COORDS)
do
  echo -e $OPTIONS >> ${COORD_DATAS[$i]}/postgresql.conf
  POOLER_NUM=$(($POOLER_BASE_PORT + $i))
  echo -e "pooler_port = $POOLER_NUM\n" >> ${COORD_DATAS[$i]}/postgresql.conf
done
#Datanode settings
for i in $(seq 1 $NUM_DATANODES)
do
  echo -e $OPTIONS >> ${DN_DATAS[$i]}/postgresql.conf
done
 
#launch gtm
echo "launch GTM"
$PSQL_BIN/gtm -x 10000 -D $GTM_DATA &
sleep 1
 
#launch datanodes
echo "launch Datanodes..."
for i in $(seq 1 $NUM_DATANODES)
do
  $PSQL_BIN/postgres -X -i -p ${DN_PORTS[$i]} -D ${DN_DATAS[$i]} > $LOG_DATA/datanode$i.log &
done
sleep 1
 
#launch coordinators
echo "launching Coordinators..."
for i in $(seq 1 $NUM_COORDS)
do
  $PSQL_BIN/postgres -C -i -p ${COORD_PORTS[$i]} -D ${COORD_DATAS[$i]} > $LOG_DATA/coord$i.log &
done
sleep 1
 
#Initialize Coordinators with cluster data
echo "initializing Coordinators..."
for i in $(seq 1 $NUM_COORDS)
do
  #Datanode connection info
  for j in $(seq 1 $NUM_DATANODES)
  do
    NODE_NAME=dn$j
    NODE_PORT=${DN_PORTS[$j]}
    $PSQL_BIN/psql -p ${COORD_PORTS[$i]} -c "CREATE NODE $NODE_NAME WITH (HOSTIP = 'localhost', NODE MASTER, NODEPORT = $NODE_PORT);" postgres
  done
  #Other Coordinator info
  for j in $(seq 1 $NUM_COORDS)
  do
    if [ "$i" -eq "$j" ]
    then
      continue
    fi
    NODE_NAME=coord$j
    NODE_PORT=${COORD_PORTS[$j]}
    $PSQL_BIN/psql -p ${COORD_PORTS[$i]} -c "CREATE NODE $NODE_NAME WITH (HOSTIP = 'localhost', COORDINATOR MASTER, NODEPORT = $NODE_PORT);" postgres
  done
  #reload data
  $PSQL_BIN/psql -p ${COORD_PORTS[$i]} -c "SELECT pgxc_pool_reload();" postgres
done
 
if [ "$FLAG_REGRESS" == 1 ]
then
  echo "running pg_regress"
  pgregress
fi
 
exit `echo $?`

You can also download it from this link.

19th and 20th of May have been days of the PostgreSQL conference.
During those 2 days, I saw some good presentations and stuff like that may help (perhaps) in increasing my own database knowledge related to PostgreSQL.
By the way, for sure, what I heard from this conference will help me not only for my current work but also it could give ideas for future design tasks.

As a lucky one, there were never two presentations happening at the same time even if conference was on 3 tracks. So let’s take chronologically each presentation I had the chance to see. In this post I don’t give my impression about everything seen, but just on the main matters that I think have a relative importance to facilitate your lecture.

On the 19th I saw first a presentation about Sharding for unlimited growth, given by Robert Treat. Sharding is a technique that could bring scale growth to large database systems (millions of operations, users) through an horizontal scaling (scaling in/out, increase the number of database nodes in terms of servers, and not in term of local resources). The idea behind it is to try to bring to a database the possibility to grow without losing its scalability and resolve SPOF (single point of failure) problems within a database system. For this purpose a couple of solutions were proposed based on data mapping or on the division of applications data into various databases located on multiple nodes (for example a website application may have its user data and forum data on separated nodes). All the ideas based their assumptions on making the application taking care of data mapping, so the database does not need to do anything but just deal with data. So I would say that Sharding is an up-layer of a database application that is focused on the optimization of applications running on top of database node(s).

Then there was an interesting presentation about the review of patches by Stephen Frost. The goal of this presentation was to teach the attendance about all the tools and formats used within and for PostgreSQL. Useful stuff such as when you want to send a patch, who you should contact. If you want to help in reviewing a patch, you can contact such or such person. This presentation told also about the formatting used in PostgreSQL: code refactoring, code quality, code duplication. Still, if a project has no such an organization, it for sure can become a mess quickly. So I personally keep a good impression about it.

The first day was full of surprises, there were another 2 presentations that caught my attention: something about Foreign data wrapper and another by Tom Lane, “How to hack the planner”.
Always by being focused on what I do for Postgres-XC, I am not very familiar with the functionalities introduced since 9.0. So it was a pleasure to find a presentation that introduced the foreign data wrapper functionality and some additional stuff a Japanese functionality is developing based on the feature of 9.0. A foreign data wrapper adds functionalities to enable a Postgres server to interact with a remote database or remote data files and show it in a nice way in your PostgreSQL instance. For instance you can show cvs files stored somewhere directly on a psql terminal. By the way, the presentation by Yotaro Nakayama shew a couple of additional features for foreign data wrappers: the capacity to interact with additional database systems and not only Postgres instances. His team has developed some extra features to be able to create foreign tables that can be seen from Oracle or MySQL instances. This consists more or less of taking into account the specificities used in each db softwares and to translate them in a Postgres-way. Fascinating. For the impression I keep, it looked that the development was at a fairly advanced stage but it wasn’t in the plans of Nakayama’s team to release publicly the work done :( .

By the end of the day came a presentation about PostgreSQL planner. For sure the presentation which was the most difficult to access to not only by the level of understanding which is necessary to understand what is dealt about but also by the quantity of information that has been discussed about. So in two terms this presentations can be qualified as: qualitative and quantitative. The planner of PostgreSQL is perhaps the hardest part of the code in terms of complexity, so making a presentation about it is even more complex. The presentation begun with some general explanation about Postgres’ parser/rewriter/planner/executor but after a couple of minutes quickly came the main dish, and the audience became aware of how planner is complicated not by its general way of working, but by all the cases that have to be taken into account in the most generalized way in their implementation to increase dependencies between each case. However, some cases such as the analysis of JOIN planning made the comprehension even easier. Some general explanation about the key structures also came at the good time to light up the basics of planning. The part that personally caught the most my attention was about costs planning of queries, and particularly the fact that sometimes a cost estimation could lead to cost higher than expected (case of LIMIT). However, to conclude on it, there are still areas of improvement of planner and Postgres is in need of people who could work on it.

The second day, one presentation in particular caught my attention. PostgreSQL 9.1 introduces SSI level, serializable snapshot isolation. One result is particular is amazing… Let’s tell more about that. In serializable transactions, you have to take care of cycles of transactions due to their read/write conflicts. For example, let’s imagine that you have a transaction performing a read on a tuple being written (by DML, UPDATE, DELETE, INSERT) by another transaction, you need to check if the transaction performing the write does not perform a read on a tuple being modified by a third transaction… This continuing until you know that there are no transactions trying to perform a read on something that has been modified by the first transaction. If you have a read/write conflict cycle, you need to abort one transaction to break the cycle and save all the other transactions from a deadlock condition. However, in order to check that, you have to go through all the transactions that could enter in the cycle, which is really resource consuming. By the way, the idea that caught my mind was that you do not need to check that all the cycle of transactions. You just need to check if the transaction you are on has not at the same time an in and an out read/write conflict. An in read/write conflict means that your transaction reads something that is being updated. An out read/write conflict means that what is updated by your transaction is not read. In case your transaction has at the same time and in and out read/write conflict, you need to abort something on the cycle it is on. However, if such a check is made on each transaction, doesn’t it increase the number of transaction being aborted, as there could be transactions in a semi cycle not closed, what would not need to be aborted, but would be aborted to satisfy the SSI check?

©2010-2013 Michael Paquier All content is ©Copyright of Otacoo.com 2010-2013. Privacy Policy - Terms of Use