Ten New Features that Make PostgreSQL 8.3 a Must-Have

New and nifty features in PostgreSQL 8.3 that could turn the head of even the MySQL faithful make it a must-have upgrade.

What do you expect from a software upgrade? Bugfixes? Dozens of new, useful features, all clearly documented? Double or triple performance increases?

The latest version of PostgreSQL provides all of those things, plus an active, growing community of developers, administrators and end-users. The growth of the community is reflected in the breadth and depth of changes in version 8.3, likely to be released January 2008.

This article describes 10 new features system administrators and application developers can immediately use. Each feature includes examples for implementation, and descriptions are meant to be introductions to both PostgreSQL and the database-specific technology implemented. Examples use data from the recent Perl Survey (www.perlsurvey.org).

You can find more information about PostgreSQL at www.postgresql.org.

Statistics Collection

Because of many improvements in performance of the collector process, statistics collection is now enabled by default. Database statistics are all tracked in the pg_stat_* tables. The pg_stat_* tables are described in section 26 of the PostgreSQL documentation. (http://tinyurl.com/yrxbhr)

One way to identify performance problems is to look for long-running queries. A new column, xact_start, was added to pg_stat_activity to help determine how long a transaction associated with a particular query has been running.

Here is an example query that looks at xact_start, query_start and backend_start:

SELECT datname, procpid, usename, current_query, xact_start, query_start, backend_start
FROM pg_stat_activity;

The return values are below, separated by ‘|’, with the headers stripped away for clarity:

test |    5552 | selena | SELECT datname,procpid,usename, \
current_query, xact_start, query_start, backend_start \
FROM pg_stat_activity; | 2007-12-31 08:01:08.043345-08 | \
2007-12-31 08:01:19.514915-08 | 2007-12-28 16:35:43.270908-08

There is a slight time difference between xact_start and query_start. In this case I used BEGIN to start a transaction before I issued the SELECT query. You can also see that the client connection the query was issued from has been running for a few days.

Another source of performance problems are sequential scans. Sequential scans are database table scans that walk through a table one row at a time searching for matching data. These can be computationally expensive when compared to a query that uses an index.

You can look at pg_stat_user_tables to find information about sequential scans occurring in your system. An unexpected increase in sequential scans typically means one of two things to an administrator: a query needs to be rewritten, or a new index on the database table is needed.

SELECT relname, seq_scan, seq_tup_read, idx_scan, \
idx_tup_fetch FROM pg_stat_user_tables 

ORDER BY seq_scan DESC;

          relname             | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch
------------------------------+----------+--------------+----------+---------------
response                      |       26 |       133475 |       11 |            48
industries_response           |        1 |         7053 |        0 |             0
perl_versions_response        |        0 |            0 |        0 |             0
platforms_response            |        0 |            0 |        0 |             0
country                       |        0 |            0 |        0 |             0
programming_languages_response|        0 |            0 |        0 |             0

If this query on pg_stat_user_tables is run repeatedly over time, you can see when the number of sequential scans increases or decreases as queries and usage patterns change. In combination with query logging, you could then pinpoint the queries causing problems.

Per-statement statistics can be reported to the server log with log_parser_stats, log_planner_stats, log_executor_stats and the catch-all: log_statement_stats. These are all disabled by default. Such log entries may provide a rough indicator of problem areas with no programming or query writing required.

Two third-party monitoring tools released in 2007 are also available, including an Simple Network Management Protocol (SNMP) agent and a command-line tool called ptop.

Statistics collection overhead has been estimated to be about 1%. You can turn statistics collection off in the server configuration file. Be aware that turning statistics collection off can have side-effects, and is not recommended for typical use in version 8.3.

Server log output in CSV

In PostgreSQL 8.3 server logs can be saved in (Comma Separated Values) CSV format. With this feature, you can load the logfile directly into a spreadsheet application or a database table without any special text processing. Server configuration for this feature is done in the postgresql.conf file, located in the $PGDATA directory for your database instance.

1 logging_collector		= on
2 logging_destination 		= csvlog
3 log_min_error_statement 	= on
4 log_error_verbosity		= verbose
5 log_directory			= /var/log/postgresql
6 log_filename			= postgresql-%a
7 log_rotation_age		= 1440
8 log_truncate_on_rotation	= on
9 log_rotation_size		= 0   # disabled

In lines 1 and 2, set logging_collector to on (this parameter requires a database restart) and log_destination to csvlog. By default, system logging is turned off.

A table definition you’ll see later in this article requires setting log_min_error_statement to on and setting log_error_verbosity to verbose in lines 3 and 4.

In line 5 set log_directory, which controls where log files are written, to an appropriate directory I’ve chosen /var/log/postgresql, but this is flexible.

Looking at lines 6, 7, and 8, log_filename and log_rotation_age (in minutes) are set to values that make them easy to predict to automate the loading of logs into a database. Set log_truncate_on_rotation to on to make sure old data doesn’t get mixed up with new data in the same file.

Setting log_filename to postgresql-%a will create a single logfile per day, named postgresql-Mon.csv, postgresql-Tue.csv, etc. After seven days, the first logfile in the series would be overwritten, rather than appended because we set log_truncate_on_rotation to on.

In line 9 I disable size-based log rotation by setting log_rotation_size to 0 (the default is 10MB). By disabling this type of rotation, your filenames will remain predictable and make automating database loads easier.

To generate some log data for testing, try setting log_duration to on, which will log the duration of every query executed on the database. You could also log different types of queries that are executed (log_statement), or connections to your database (log_connections).

The PostgreSQL documentation offers this example table and command for loading CSV logfiles:

CREATE TABLE postgres_log
(
  log_time timestamp(3) with time zone,
  user_name text,
  database_name text,
  process_id integer,
  connection_from text,
  session_id text,
  session_line_num bigint,
  command_tag text,
  session_start_time timestamp with time zone,
  virtual_transaction_id text,
  transaction_id bigint,
  error_severity text,
  sql_state_code text,
  message text,
  detail text,
  hint text,
  internal_query text,
  internal_query_pos integer,
  context text,
  query text,
  query_pos integer,
  location text,
  PRIMARY KEY (session_id, session_line_num)
);

COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;

Combined with the configuration above, you can call the COPY command in a shell script and use cron to load your logfiles daily.

NULLS FIRST or LAST

Sorting NULLS can be a major issue for portability between databases. MySQL (like MSSQL server) sorts NULLs as less than other values, while PostgreSQL (like Oracle and DB2) sorts NULLs as greater than other values. The SQL standard doesn’t specify which ordering is correct, but that doesn’t make the difference less annoying.

As of 8.3, PostgreSQL allows you to chose where NULLs end up with an addition to the ORDER BY clause. Let’s look at an example:

SELECT * FROM response ORDER BY income DESC NULLS LAST;

With NULLS LAST, all the responses where a person chose not state their income will appear after responses where an income was stated.

Full-Text Search Integration

Full-text search (FTS) is invaluable whether you’re storing documentation, hosting a Web site or managing an enterprise resource planning suite. Tsearch2, a search package based on OpenFTS, has been available as a contrib module to PostgreSQL prior to 8.3. It required a little extra work to get it compiled and running, but now it’s fully-integrated and requires no special database configuration to use.

Full-text search in PostgreSQL supports a data type called tsvector. This data type is a sorted list of lexemes gathered from a source document. A lexeme roughly corresponds to the root of a word without its prefixes or suffixes. For example, fluent, fluentness, fluently and fluency are all forms of the lexeme “fluent.”

Using lexemes instead of the actual words in search data types makes the search functions more flexible and useful than the more familiar SQL search operators such as ~, ~*, LIKE and ILIKE. Those SQL operators treat search strings literally, and don’t have the linguistic smarts to know that ‘fluency’ and ‘fluently’ have the same root word.

To use full-text search, you must first convert your document to the tsvector data type with the to_tsvector() function. Then, to search the document, you create a search query of type tsquery with the to_tsquery() function. This example searches for the word ‘fluent’ and uses @@ (the MATCH operator):

CREATE TABLE response (
	comments TEXT
);

SELECT comments FROM response WHERE to_tsvector('english', comments) @@ to_tsquery('english', 'fluent');

You’ll notice the first argument in both functions is ‘english’. This specifies which dictionary PostgreSQL will use to parse and search the document. PostgreSQL allows multiple dictionaries, which are configured at run-time in postgresql.conf.

You can even create your own, special-purpose dictionaries for doing things like normalizing similar URLs, or searching in domain-specific languages. If you are looking for non-English dictionaries, they are available for eight languages at http://tinyurl.com/6lljm. The site also includes several detailed tutorials, including one for creating your own, specialized dictionaries.

Simple queries that use tsvector conversions on text work fine for small amounts of data. As the number and size of documents grow, you will need to create indexes to speed up searches. Creating an index is straightforward:

CREATE INDEX comments_idx ON response USING gin(to_tsvector('english', comments));

Generalized Inverted Index (GIN) is a special index type used only with the tsvector data type. To find more information about this index, you can visit the website.

The comments_idx index we just created will be used on any queries that include to_tsvector('english', comments).

Creating a separate column to hold the tsvector is another way to speed up full-text search. If you do this, you need to include triggers for automatically updating the tsvector column when the related text is updated. You can write your own trigger functions but PostgreSQL conveniently provides two: tsvector_update_trigger, and tsvector_update_trigger_column.

Assuming your table looks like this:

CREATE TABLE response (
	comments TEXT,
	tsv_comments tsvector
);

You would define your trigger this way:

CREATE TRIGGER tsvector_update BEFORE INSERT or UPDATE
ON response FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv_comments, 'english', comments);

Another very useful function is ts_headline, which allows you to show fragments of the return document, but highlighting the search terms. The example below will wrap all instances of fluency with HTML bold tags:

SELECT ts_headline(comments, to_tsquery('english', 'fluency')) \
FROM (SELECT comments FROM response
WHERE to_tsvector('english',comments)
@@ to_tsquery('english', 'fluency'))
AS FOO;

There are many more functions that allow you to do things like rewrite query search terms (ts_rewrite), gather document statistics (ts_stat), or examin exactly what the query generator uses for lexemes (parsetree).

There were some minor syntax changes between the contrib module and integration into version 8.3, so if you were already using tsearch2, have a look in the contrib directory for help with migrating your existing installation.

XML data type

The XML data type included in 8.3 supports the ANSI SQL:2003 standard and requires a special compile-time option for use. It was initially developed through a Google Summer of Code project in 2006, and is now fully supported with an array of helper functions and supporting documentation.

To create a value of type xml, you would use the function xmlparse():

xmlparse( DOCUMENT '<?xml version="1.0" encoding="UTF-8"?> \
<plist version="1.0">...</plist>');

As with other PostgreSQL data types, many functions are available to work with the XML data type. These include xmlforest() and xmlelement(). For exporting XML data from a database table, you might use xmlforest():

SELECT xmlforest (
"confirmation_key" as "Key",
"sex" as "Gender",
"country_of_residence" as "Country")
FROM response
where notify = 1;

This will produce XML formatting for each row returned from the query:

 <Key>gelalsndicvqvpkx</Key><Gender>male</Gender><Country>uk</Country>
 <Key>pmxrtuggtludzxtt</Key><Gender>male</Gender><Country>au</Country>
 <Key>rwqlcrigstztdadt</Key><Gender>female</Gender><Country>nl</Country>
 <Key>ikkqjadxqkmcvxtk</Key><Gender>male</Gender><Country>us</Country>
 <Key>ebhhsgzfvqdubmsm</Key><Gender>female</Gender><Country>uk</Country>
 <Key>dkletumfbheustia</Key><Gender>male</Gender><Country>it</Country>

If you wanted to wrap each of these individual lines in a higher level XML element, you can use xmlelement():

SELECT xmlelement (
 name respondent, xmlforest (
"confirmation_key" as "Key",
"sex" as "Gender",
"country_of_residence" as "Country" ))
FROM response
where notify = 1;

This will wrap each element in the forest in the element ‘respondent’:

<respondent><Key>gelalsndicvqvpkx</Key><Gender>male</Gender><Country>uk</Country></respondent>
<respondent><Key>pmxrtuggtludzxtt</Key><Gender>male</Gender><Country>au</Country></respondent>
<respondent><Key>rwqlcrigstztdadt</Key><Gender>female</Gender><Country>nl</Country></respondent>
<respondent><Key>ikkqjadxqkmcvxtk</Key><Gender>male</Gender><Country>us</Country></respondent>
<respondent><Key>ebhhsgzfvqdubmsm</Key><Gender>female</Gender><Country>uk</Country></respondent>
<respondent><Key>dkletumfbheustia</Key><Gender>male</Gender><Country>it</Country></respondent>

PostgreSQL 8.3 also has functions for mapping tables, queries, and cursors to XML. Each has the option of returning XML documents or fragments. Many helper functions are also available for generating the matching schema for the data generated by any of the mapping functions. Section 9.1 of the PostgreSQL 8.3 documentation describes each of these utilities in detail.

Use of this data type requires PostgreSQL to be built with configure --with-libxml.

ENUM data type

An enumerated data type is an ordered and sortable list of items. You can compare items contained in this data type using standard operators like =, <, >.

If you search the Web for “enum” and “postgresql” you will find many useful tutorials for getting around the fact that PostgreSQL has, until now, not supported them. Suggestions included creating a DOMAIN with a constraint, creating a lookup table and using foreign keys, or writing a bit of code to create your very own ENUM type. You don’t have to do that anymore! Here’s how you use ENUM:

CREATE TYPE headset_diameter
AS ENUM ( '1', '1-1/8', '1-1/4', '1-1/2');

My headset_diameter ENUM is a list of diameters in inches. This is useful for producing sorted lists of items based on the diameter, without storing a decimal. Sort order is determined by the order the values are listed when the ENUM is created.

New type-casting behavior

PostgreSQL is known for its strict adherence to standards. A common practice is to convert, or cast, values from one data type to another to make SQL queries easier to write. In the past, if a non-character value was supplied to an operator or function that could use TEXT, the value would be automatically converted to a string without reporting an error. In version 8.3, values from non-character types are no longer automatically cast. Instead, an error will be thrown.

To help ease the transition, a generalized conversion was created to manage the casts to and from any string type (TEXT, CHAR, VARCHAR) for every other supported data type.

For example, a function like substr() data type to TEXT for its first argument, and it is not automatically cast (http://tinyurl.com/39qmne):

test=# SELECT substr(current_date, 1, 4);
STATEMENT:  SELECT substr(current_date, 1, 4);
ERROR:  function substr(date, integer, integer) does not exist
LINE 1: SELECT substr(current_date, 1, 4);

Hint: no function matches the given name and argument types. You might need to add explicit type casts. Now, you must explicitly cast values if you would like to compare them this way:

test=# SELECT substr(current_date::TEXT, 1, 4);
 substr
--------
 2007
(1 row)

You can also use CAST() to convert types explicitly.

What’s new in PL/PgSQL

PostgreSQL supports many procedural languages. PL/PgSQL, PL/Tcl, PL/Perl, and PL/Python are all available in the standard distribution. Support for Java, PHP, Ruby, R, Scheme, and Unix shells are also available through third parties.

PL/PgSQL got some special attention in 8.3 with RETURN QUERY, which is syntactic-sugar for FOR...LOOP statements like the following:

FOR foo in SELECT country, years_programming FROM response LOOP
	RETURN NEXT x;
END LOOP;

In PostgreSQL 8.3, this loop can be replaced with the following:

RETURN QUERY SELECT country, years_programming FROM response;

FOR...LOOP statements and RETURN QUERY in PL/PgSQL all use cursors to conserve memory. Cursors are a way of accessing data returned from a query a single row at a time.

Added MOVE to PL/PgSQL

Another useful PL/PgSQL feature is the addition of MOVE for use with cursors. In the past you could use FETCH and SCROLL to move forward and backward among the query data. This however, would also return the data from the row. MOVE works exactly like FETCH but does not return the data.

DECLARE programmers CURSOR FOR SELECT key, country, \
years_programing, salary FROM response ORDER BY country;

MOVE RELATIVE 4 FROM programmers;

MOVE allows arbitrary movement through a query results without the overhead of returning the data. In long loops of procedural code, this can be a performance aid.

Updatable Cursors

Another new feature in both the core database and PL/PgSQL are updatable cursors. In the past, PostgreSQL cursors were read-only. If a cursor is “simple,” that is it does not contain a JOIN, or a GROUP BY clause — it is possible to update or delete the data in the row a cursor is pointing to.

When creating a cursor you’d like to update, you may want to add FOR UPDATE to the SELECT statement to prevent rows from being deleted or altered while you are examining them.

DECLARE programmers CURSOR FOR SELECT key, country, years_programing, \
salary FROM response FOR UPDATE;

To update the cursor:

UPDATE table SET salary = salary + 5000 WHERE CURRENT OF cursor;

When you use FOR UPDATE in a cursor declaration, the data returned to the cursor is issued a lock. This prevents the data you’re about to apply an update to from being altered while you’re in the middle of changing it.

Wrapping Up

You now have ten new reasons to try the latest version of PostgreSQL:

  1. Statistics collection is now enabled by default.
  2. Server log output is available in CSV format, making analysis and maintenance easier.
  3. The NULL sort ordering option.
  4. New type-casting behavior and generalized casts to standard string types.
  5. A new enumerated data type (ENUM).
  6. XML data type and many supporting functions.
  7. Full Text Search (FTS) integrated into the core database code.
  8. The PL/PgSQL RETURN QUERY function.
  9. The PL/PgSQL MOVE function.
  10. Last but not least: the PL/PgSQL and core updatable cursors.

If that’s not enough, there are many major performance improvements for any database workload. Most require no special knowledge or configuration to use. Here’s a list of a few: HOT (Heap-Only Tuple), multi-worker autovacuum, asynchronous commit delay, and synchronized scanning.

If you’re interested in finding out more, the release notes are a good place to start. The latest distribution packages for PostgreSQL are available for download.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62