New and nifty features in PostgreSQL 8.3 that could turn the head of even the MySQL faithful make it a must-have upgrade.
What do you expect from a software upgrade? Bugfixes? Dozens of new, useful features, all clearly documented? Double or triple performance increases?
The latest version of PostgreSQL provides all of those things, plus an active, growing community of developers, administrators and end-users. The growth of the community is reflected in the breadth and depth of changes in version 8.3, likely to be released January 2008.
This article describes 10 new features system administrators and application developers can immediately use. Each feature includes examples for implementation, and descriptions are meant to be introductions to both PostgreSQL and the database-specific technology implemented. Examples use data from the recent Perl Survey (www.perlsurvey.org).
You can find more information about PostgreSQL at www.postgresql.org.
Because of many improvements in performance of the collector process, statistics collection is now enabled by default. Database statistics are all tracked in the pg_stat_* tables. The pg_stat_* tables are described in section 26 of the PostgreSQL documentation. (http://tinyurl.com/yrxbhr)
One way to identify performance problems is to look for long-running queries. A new column,
xact_start, was added to
pg_stat_activity to help determine how long a transaction associated with a particular query has been running.
Here is an example query that looks at
SELECT datname, procpid, usename, current_query, xact_start, query_start, backend_start
The return values are below, separated by ‘|’, with the headers stripped away for clarity:
test | 5552 | selena | SELECT datname,procpid,usename, \
current_query, xact_start, query_start, backend_start \
FROM pg_stat_activity; | 2007-12-31 08:01:08.043345-08 | \
2007-12-31 08:01:19.514915-08 | 2007-12-28 16:35:43.270908-08
There is a slight time difference between
query_start. In this case I used
BEGIN to start a transaction before I issued the
SELECT query. You can also see that the client connection the query was issued from has been running for a few days.
Another source of performance problems are sequential scans. Sequential scans are database table scans that walk through a table one row at a time searching for matching data. These can be computationally expensive when compared to a query that uses an index.
You can look at
pg_stat_user_tables to find information about sequential scans occurring in your system. An unexpected increase in sequential scans typically means one of two things to an administrator: a query needs to be rewritten, or a new index on the database table is needed.
SELECT relname, seq_scan, seq_tup_read, idx_scan, \
idx_tup_fetch FROM pg_stat_user_tables
ORDER BY seq_scan DESC;
relname | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch
response | 26 | 133475 | 11 | 48
industries_response | 1 | 7053 | 0 | 0
perl_versions_response | 0 | 0 | 0 | 0
platforms_response | 0 | 0 | 0 | 0
country | 0 | 0 | 0 | 0
programming_languages_response| 0 | 0 | 0 | 0
If this query on
pg_stat_user_tables is run repeatedly over time, you can see when the number of sequential scans increases or decreases as queries and usage patterns change. In combination with query logging, you could then pinpoint the queries causing problems.
Per-statement statistics can be reported to the server log with
log_executor_stats and the catch-all:
log_statement_stats. These are all disabled by default. Such log entries may provide a rough indicator of problem areas with no programming or query writing required.
Two third-party monitoring tools released in 2007 are also available, including an Simple Network Management Protocol (SNMP) agent and a command-line tool called ptop.
Statistics collection overhead has been estimated to be about 1%. You can turn statistics collection off in the server configuration file. Be aware that turning statistics collection off can have side-effects, and is not recommended for typical use in version 8.3.
Server log output in CSV
In PostgreSQL 8.3 server logs can be saved in (Comma Separated Values) CSV format. With this feature, you can load the logfile directly into a spreadsheet application or a database table without any special text processing. Server configuration for this feature is done in the postgresql.conf file, located in the $PGDATA directory for your database instance.
1 logging_collector = on
2 logging_destination = csvlog
3 log_min_error_statement = on
4 log_error_verbosity = verbose
5 log_directory = /var/log/postgresql
6 log_filename = postgresql-%a
7 log_rotation_age = 1440
8 log_truncate_on_rotation = on
9 log_rotation_size = 0 # disabled
In lines 1 and 2, set
logging_collector to on (this parameter requires a database restart) and
csvlog. By default, system logging is turned off.
A table definition youâ€™ll see later in this article requires setting
on and setting
verbose in lines 3 and 4.
In line 5 set
log_directory, which controls where log files are written, to an appropriate directory I’ve chosen /var/log/postgresql, but this is flexible.
Looking at lines 6, 7, and 8,
log_rotation_age (in minutes) are set to values that make them easy to predict to automate the loading of logs into a database. Set
log_truncate_on_rotation to on to make sure old data doesnâ€™t get mixed up with new data in the same file.
postgresql-%a will create a single logfile per day, named postgresql-Mon.csv, postgresql-Tue.csv, etc. After seven days, the first logfile in the series would be overwritten, rather than appended because we set
In line 9 I disable size-based log rotation by setting
0 (the default is 10MB). By disabling this type of rotation, your filenames will remain predictable and make automating database loads easier.
To generate some log data for testing, try setting
on, which will log the duration of every query executed on the database. You could also log different types of queries that are executed (
log_statement), or connections to your database (
The PostgreSQL documentation offers this example table and command for loading CSV logfiles:
CREATE TABLE postgres_log
log_time timestamp(3) with time zone,
session_start_time timestamp with time zone,
PRIMARY KEY (session_id, session_line_num)
COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
Combined with the configuration above, you can call the COPY command in a shell script and use cron to load your logfiles daily.
NULLS FIRST or LAST
Sorting NULLS can be a major issue for portability between databases. MySQL (like MSSQL server) sorts NULLs as less than other values, while PostgreSQL (like Oracle and DB2) sorts NULLs as greater than other values. The SQL standard doesnâ€™t specify which ordering is correct, but that doesnâ€™t make the difference less annoying.
As of 8.3, PostgreSQL allows you to chose where NULLs end up with an addition to the
ORDER BY clause. Let’s look at an example:
SELECT * FROM response ORDER BY income DESC NULLS LAST;
NULLS LAST, all the responses where a person chose not state their income will appear after responses where an income was stated.
Full-Text Search Integration
Full-text search (FTS) is invaluable whether youâ€™re storing documentation, hosting a Web site or managing an enterprise resource planning suite. Tsearch2, a search package based on OpenFTS, has been available as a contrib module to PostgreSQL prior to 8.3. It required a little extra work to get it compiled and running, but now it’s fully-integrated and requires no special database configuration to use.
Full-text search in PostgreSQL supports a data type called tsvector. This data type is a sorted list of lexemes gathered from a source document. A lexeme roughly corresponds to the root of a word without its prefixes or suffixes. For example, fluent, fluentness, fluently and fluency are all forms of the lexeme “fluent.”
Using lexemes instead of the actual words in search data types makes the search functions more flexible and useful than the more familiar SQL search operators such as ~, ~*, LIKE and ILIKE. Those SQL operators treat search strings literally, and don’t have the linguistic smarts to know that ‘fluency’ and ‘fluently’ have the same root word.
To use full-text search, you must first convert your document to the tsvector data type with the
to_tsvector() function. Then, to search the document, you create a search query of type tsquery with the
to_tsquery() function. This example searches for the word ‘fluent’ and uses @@ (the
CREATE TABLE response (
SELECT comments FROM response WHERE to_tsvector('english', comments) @@ to_tsquery('english', 'fluent');
You’ll notice the first argument in both functions is ‘english’. This specifies which dictionary PostgreSQL will use to parse and search the document. PostgreSQL allows multiple dictionaries, which are configured at run-time in postgresql.conf.
You can even create your own, special-purpose dictionaries for doing things like normalizing similar URLs, or searching in domain-specific languages. If you are looking for non-English dictionaries, they are available for eight languages at http://tinyurl.com/6lljm. The site also includes several detailed tutorials, including one for creating your own, specialized dictionaries.
Simple queries that use
tsvector conversions on text work fine for small amounts of data. As the number and size of documents grow, you will need to create indexes to speed up searches. Creating an index is straightforward:
CREATE INDEX comments_idx ON response USING gin(to_tsvector('english', comments));
Generalized Inverted Index (GIN) is a special index type used only with the tsvector data type. To find more information about this index, you can visit the website.
comments_idx index we just created will be used on any queries that include
Creating a separate column to hold the tsvector is another way to speed up full-text search. If you do this, you need to include triggers for automatically updating the tsvector column when the related text is updated. You can write your own trigger functions but PostgreSQL conveniently provides two:
Assuming your table looks like this:
CREATE TABLE response (
You would define your trigger this way:
CREATE TRIGGER tsvector_update BEFORE INSERT or UPDATE
ON response FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv_comments, 'english', comments);
Another very useful function is
ts_headline, which allows you to show fragments of the return document, but highlighting the search terms. The example below will wrap all instances of
fluency with HTML bold tags:
SELECT ts_headline(comments, to_tsquery('english', 'fluency')) \
FROM (SELECT comments FROM response
@@ to_tsquery('english', 'fluency'))
There are many more functions that allow you to do things like rewrite query search terms (
ts_rewrite), gather document statistics (
ts_stat), or examin exactly what the query generator uses for lexemes (
There were some minor syntax changes between the contrib module and integration into version 8.3, so if you were already using
tsearch2, have a look in the contrib directory for help with migrating your existing installation.
XML data type
The XML data type included in 8.3 supports the ANSI SQL:2003 standard and requires a special compile-time option for use. It was initially developed through a Google Summer of Code project in 2006, and is now fully supported with an array of helper functions and supporting documentation.
To create a value of type xml, you would use the function
xmlparse( DOCUMENT '<?xml version="1.0" encoding="UTF-8"?> \
As with other PostgreSQL data types, many functions are available to work with the XML data type. These include
xmlelement(). For exporting XML data from a database table, you might use
SELECT xmlforest (
"confirmation_key" as "Key",
"sex" as "Gender",
"country_of_residence" as "Country")
where notify = 1;
This will produce XML formatting for each row returned from the query:
If you wanted to wrap each of these individual lines in a higher level XML element, you can use
SELECT xmlelement (
name respondent, xmlforest (
"confirmation_key" as "Key",
"sex" as "Gender",
"country_of_residence" as "Country" ))
where notify = 1;
This will wrap each element in the forest in the element ‘respondent’:
PostgreSQL 8.3 also has functions for mapping tables, queries, and cursors to XML. Each has the option of returning XML documents or fragments. Many helper functions are also available for generating the matching schema for the data generated by any of the mapping functions. Section 9.1 of the PostgreSQL 8.3 documentation describes each of these utilities in detail.
Use of this data type requires PostgreSQL to be built with
ENUM data type
An enumerated data type is an ordered and sortable list of items. You can compare items contained in this data type using standard operators like =, <, >.
If you search the Web for “enum” and “postgresql” you will find many useful tutorials for getting around the fact that PostgreSQL has, until now, not supported them. Suggestions included creating a DOMAIN with a constraint, creating a lookup table and using foreign keys, or writing a bit of code to create your very own ENUM type. You donâ€™t have to do that anymore! Here’s how you use ENUM:
CREATE TYPE headset_diameter
AS ENUM ( '1', '1-1/8', '1-1/4', '1-1/2');
My headset_diameter ENUM is a list of diameters in inches. This is useful for producing sorted lists of items based on the diameter, without storing a decimal. Sort order is determined by the order the values are listed when the ENUM is created.
New type-casting behavior
PostgreSQL is known for its strict adherence to standards. A common practice is to convert, or cast, values from one data type to another to make SQL queries easier to write. In the past, if a non-character value was supplied to an operator or function that could use TEXT, the value would be automatically converted to a string without reporting an error. In version 8.3, values from non-character types are no longer automatically cast. Instead, an error will be thrown.
To help ease the transition, a generalized conversion was created to manage the casts to and from any string type (
VARCHAR) for every other supported data type.
For example, a function like
substr() data type to TEXT for its first argument, and it is not automatically cast (http://tinyurl.com/39qmne):
test=# SELECT substr(current_date, 1, 4);
STATEMENT: SELECT substr(current_date, 1, 4);
ERROR: function substr(date, integer, integer) does not exist
LINE 1: SELECT substr(current_date, 1, 4);
Hint: no function matches the given name and argument types. You might need to add explicit type casts. Now, you must explicitly cast values if you would like to compare them this way:
test=# SELECT substr(current_date::TEXT, 1, 4);
You can also use
CAST() to convert types explicitly.
Whatâ€™s new in PL/PgSQL
PostgreSQL supports many procedural languages. PL/PgSQL, PL/Tcl, PL/Perl, and PL/Python are all available in the standard distribution. Support for Java, PHP, Ruby, R, Scheme, and Unix shells are also available through third parties.
PL/PgSQL got some special attention in 8.3 with
RETURN QUERY, which is syntactic-sugar for
FOR...LOOP statements like the following:
FOR foo in SELECT country, years_programming FROM response LOOP
RETURN NEXT x;
In PostgreSQL 8.3, this loop can be replaced with the following:
RETURN QUERY SELECT country, years_programming FROM response;
FOR...LOOP statements and
RETURN QUERY in PL/PgSQL all use cursors to conserve memory. Cursors are a way of accessing data returned from a query a single row at a time.
Added MOVE to PL/PgSQL
Another useful PL/PgSQL feature is the addition of
MOVE for use with cursors. In the past you could use
SCROLL to move forward and backward among the query data. This however, would also return the data from the row.
MOVE works exactly like
FETCH but does not return the data.
DECLARE programmers CURSOR FOR SELECT key, country, \
years_programing, salary FROM response ORDER BY country;
MOVE RELATIVE 4 FROM programmers;
MOVE allows arbitrary movement through a query results without the overhead of returning the data. In long loops of procedural code, this can be a performance aid.
Another new feature in both the core database and PL/PgSQL are updatable cursors. In the past, PostgreSQL cursors were read-only. If a cursor is “simple,” that is it does not contain a
JOIN, or a
GROUP BY clause — it is possible to update or delete the data in the row a cursor is pointing to.
When creating a cursor you’d like to update, you may want to add
FOR UPDATE to the
SELECT statement to prevent rows from being deleted or altered while you are examining them.
DECLARE programmers CURSOR FOR SELECT key, country, years_programing, \
salary FROM response FOR UPDATE;
To update the cursor:
UPDATE table SET salary = salary + 5000 WHERE CURRENT OF cursor;
When you use
FOR UPDATE in a cursor declaration, the data returned to the cursor is issued a lock. This prevents the data youâ€™re about to apply an update to from being altered while youâ€™re in the middle of changing it.
You now have ten new reasons to try the latest version of PostgreSQL:
- Statistics collection is now enabled by default.
- Server log output is available in CSV format, making analysis and maintenance easier.
NULL sort ordering option.
- New type-casting behavior and generalized casts to standard string types.
- A new enumerated data type (ENUM).
- XML data type and many supporting functions.
- Full Text Search (FTS) integrated into the core database code.
- The PL/PgSQL
RETURN QUERY function.
- The PL/PgSQL
- Last but not least: the PL/PgSQL and core updatable cursors.
If thatâ€™s not enough, there are many major performance improvements for any database workload. Most require no special knowledge or configuration to use. Hereâ€™s a list of a few: HOT (Heap-Only Tuple), multi-worker autovacuum, asynchronous commit delay, and synchronized scanning.
If youâ€™re interested in finding out more, the release notes are a good place to start. The latest distribution packages for PostgreSQL are available for download.
is a system administrator and co-organizer of the PostgreSQL Conference and PostgreSQL Conference East scheduled for this year on March 29 and 30, 2008.