This is a short memo to install Hadoop and sqoop (Hadoop interface with db backend) in Ubuntu Lucid.

First it is necessary to add the following debian repository from Cloudera, the host of Hadoop and sqoop.
This can be added from System -> Update manager -> Settings (bottom-left) -> Other sources (tab) -> add.
deb http://archive.cloudera.com/debian -cdh3 contrib
On Lucid, has to be replaced by lucid, giving:
deb http://archive.cloudera.com/debian -cdh3 contrib

A Java environment is necessary, you should have at least default-jdk 1.6.
Then install the software itself:
sudo apt-get install hadoop
sudo apt-get install sqoop
sudo apt-get install hadoop-hbase

Once trying to launch sqoop on certain tables through PostgreSQL, you may find the following error:
sqoop import --table test --connect jdbc:postgresql://localhost/postgres --verbose
...
ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: org.postgresql.Driver

This means that JDBC driver of PostgreSQL is not installed correctly.
You have to download it from here.
Then copy it in /usr/lib/sqoop/.

More details about the installation can be found here.

pg_regress is a PostgreSQL test module that permits to check if you have done correctly an installation of a PostgreSQL server.

Until now, the development of Postgres-XC has been focused on scalability and performance, without always checking if implementation sticked with PostgreSQL standards.
However, in order to be able to consider Postgres-XC as a product, it has to pass those regression tests.
This is also the easiest way to check if it respects the SQL rules protected by PostgreSQL, making it a user-friendly software.

So, why passing regression tests?

  1. Prove that XC can be stable
  2. Improve efficiency of the implementation of new functionalities. All the SQL test cases are already in the regression tests, so checking if an implementation is correct is faster and secured. Passing also regression tests makes the basics of Postgres-XC really stronger.

Well, are those regression tests sufficient?
No, they are a base to protect the basics of the cluster product when running SQL queries. As a cluster, Postgres-XC needs tests for:

  1. High-availability (node failure, security)
  2. performance (write-scalability)
  3. regression tests specific to Postgres-XC (CREATE TABLE has been extended with DISTRIBUTE BY [REPLICATION | HASH(column) | ROUNDROBIN | MODULO(column)])

Let’s talk a little bit more about pg_regress.

All its files are located in src/test/regress.
The most common usage made is an installation check, what would basically consist in typing the following command in src/test/regress:
make installcheck
This command allows to launch regression tests on a PostgreSQL server having the default port 5432 open.
./pg_regress --inputdir=. --dlpath=. --multibyte=SQL_ASCII --psqldir=/home/ioltas/pgsql/bin --schedule=./serial_schedule

Let’s have a look at what makes pg_regress… You can find the following folders:

  • data, all the external data used for mainly COPY
  • input, input data for SQL queries that depend on the environment where regression tests are launched: COPY, TABLESPACE… Those files have the suffix .source, and are saved in folder sql after generation
  • output, output files whose content are modified depending on the environment where regressions are installed
  • expected, all the expected results. Those files have the prefix .out and have the same prefix name as the sql or source files
  • sql, all the files containing the SQL queries to run for regression tests. They have the same prefix name as the corresponding expected result files .out.

For Postgres-XC, as the default table type is round robin, or hash if the first column can be distributed, the order of output data for SELECT queries cannot be controlled.
As regressions have to give the same results whatever the cluster configuration (it cannot depend on the number of Coordinators and Datanodes), SELECT queries are sometimes completed with ORDER BY.
For some types where ORDER BY has no effect like box or point, the table is created as a replicated one (use of keyword DISTRIBUTE BY REPLICATION at the end of CREATE TABLE).

There are 121 test cases that have to be checked in pg_regress.
Most of them can be corrected based on the current limitations of Postgres-XC (update, delete, case, guc…).
But some of them require more fundamental work (select_having, subselect, returning).
Others are currently making the cluster entering in a stall state (errors, constraints).

This is a huge task. But once this is completed,
Postgres-XC will have the base that will make it a great cluster product!

©2010-2013 Michael Paquier All content is ©Copyright of Otacoo.com 2010-2013. Privacy Policy - Terms of Use