dcsimg

PostgreSQL 7.4

The latest release of PostgreSQL contains a host of new features for advanced database design. Here's an introduction to some of the best

The “most advanced open source database” just got more advanced. More than a year in the making, the latest release of PostgreSQL is the work of an entirely decentralized, global, and staunchly independent community that values innovation, pragmatism, and novelty as equals. PostgreSQL 7.4 has a host of performance improvements and a panoply of new database programming features for databases of all sizes. The new PostgreSQL is extensible, flexible, and clever. If you thought MySQL was the only game in town, think again.

Since the POSTGRES project was incepted in 1986 by Michael Stonebraker’s University of California at Berkeley team, PostgreSQL’s developers have emphasized both implementing the most recent ideas on relational database design and standards and enabling users to perform complex tasks inside the database. Many call PostgreSQL the “database administrator’s database,” because it maximizes the use of SQL, data integrity controls, database procedures, and relational theory to do work inside PostgreSQL instead of in middleware code.

PostgreSQL’s architecture and Object-Relational features make modular extensions of the database functionality easy and accessible, much like Linux and Apache. And like those other open source projects, PostgreSQL has attracted some unique features to solve unique problems.

PostgreSQL 7.4 release is in many ways a “performance release,” with numerous changes designed to speed up queries and ease maintenance and management of large, enterprise databases. However, during the twelve month development and beta testing period, numerous contributors from around the world added new features useful for any size database, some of which we cover below.

Data Type DOMAINs

One of the features that makes PostgreSQL popular is its built-in support for custom data types. For example, PostgreSQL can support geometrical and geographic data types that other SQL databases cannot support.

However, creating a data type is a lot of work: you need to create an input and output function, typically in C, and you typically want to create new data type-specific operators, aggregate functions, and casts.

Fortunately, there’s an easier way. DOMAINs are one of those great ideas introduced in the ANSI SQL-92 standard, yet few SQL databases support them. A DOMAIN defines a range of values within a standard data type — it’s sort of a data “sub-type.” DOMAINs make it easy to standardize formats and restrictions for special columns that are reused in several places in your database.

A common example is a 5-digit zip code, which could be expressed as a CHAR(5) field with values between 00210 and 99950. To make the example more interesting, assume that instead of allowing NULLs, you also want to use 00000 to indicate missing zip codes. Now, you could do this by creating constraints and column definitions on each of the tables that need zip codes, or by maintaining a table of all zip codes (in the United States, say). But it would be easier, not to mention more elegant and readable, to simply do this:


CREATE DOMAIN ZIP_CODE CHAR(5) NOT NULL
DEFAULT ’00000′
CHECK ( VALUE ~ ‘^[0-9]{5}$’
AND ( ( VALUE = ’00000′ ) OR
( VALUE BETWEEN ’00210′ AND ’99950′
) ) );

~ (tilde) is PostgreSQL’s regular expression comparison operator, useful here for checking the format of the incoming value.

You can create similar DOMAINs for state codes, email addresses, and other commonly used formats. With those DOMAINs defined, you can then create tables like this:


CREATE TABLE contact_addresses (
contact_id INT NOT NULL
REFERENCES contacts(contact_id),

state STATE_CODE,
zip ZIP_CODE );

Expression Indexes

For several versions, PostgreSQL has offered the ability to index tables in multiple ways. Function indexes, partial indexes, and multi-column indexes are all weapons in the arsenal of the database administrator or tuner. With them, you can index on precisely the data which is used most and avoid slow procedural code to do complex comparisons. Now 7.4 allows you to index on anything you can formulate as an immutable, value-returning expression.

A very simple example is the classic “sort by total compensation” problem for a personnel list. Imagine that a personnel list of 10,000 employees has three columns — salary, bonus, and commission — and you want to query the top ten employees by the total of these three figures, known as “compensation.” In PostgreSQL 7.4, you can now realize the sort via an indexed search by creating an expression index:


CREATE INDEX idx_compensation
ON personnel(
(salary + bonus + commission) );

The double parentheses are required for expression indexes.

Then you can do a very fast indexed sort by querying:


SELECT first_name, last_name,
(salary + bonus + commission)
as compensation
FROM personnel
ORDER BY compensation DESC LIMIT 10;

As a more complex example, imagine that you are an airline IT person in charge of implementing the FBI’s “no fly” list. Since the Soundex name matching algorithm doesn’t work (as well-documented on CNN), you should use a sophisticated comparison based on the Metaphone phonetic matching scheme. So, you go to your PostgreSQL source code and install the fuzzystrmatch module, which you can find in the contrib directory. (The fuzzystrmatch README file has instructions.)

Assuming that you arrive at the strategy of comparing the two-character metaphone of each passenger’s first name and the 6-character metaphone of their last name with the no-fly list, you could store those values in a derivative table, but that’s a big waste of database space. Instead, you can create an index on this calculated expression for both tables, as shown in the first two SQL statements of Listing One.




Listing One: An example of an expression index


CREATE INDEX idx_passenger_meta
ON passengers( (metaphone(first_name, 2)
|| metaphone(last_name, 6)) );

CREATE INDEX idx_nofly_meta
ON no_fly( (metaphone(first_name, 2)
|| metaphone(last_name, 6)) );

SELECT p.ticket_no, p.first_name,
p.last_name
FROM passengers p, no_fly n
WHERE
( metaphone(p.first_name, 2) ||
metaphone(p.last_name, 6) )
= ( metaphone(n.first_name, 2) ||
metaphone(n.last_name, 6) );

Once the indexes are created, the simple query shown at the end of Listing One gives you a list of people to send to security. Because you are filtering by exactly the expression that was indexed, PostgreSQL can use that index for any large data set. So you can scan all 110,000 passengers for the day without bringing the server to its electronic knees.

Array Improvements

In PostgreSQL, an array is a collection of elemental datums, all of the same base data type. PostgreSQL 7.4 includes substantial array functionality and a syntax more compliant with SQL 99.

For example, in Postgres version 7.3 and earlier, an integer array composed of the numbers 1 to 4 would be specified as the string literal ‘{1,2,3,4}’. In 7.4, you can also use ARRAY[1,2,3,4]. This latter version, an array constructor, complies with SQL 99 and also benefits from the fact that it is an expression instead of a string literal. This makes it possible for you to do…


SELECT ARRAY[2 * 3, 4 * 5, 6 * 7];
array
———–
{6,20,42}

… or even …


SELECT ARRAY[unique1, unique2, even]
FROM tenk1 LIMIT 1;
array
————
{8800,0,1}

An additional difference between array constructors and array string literals is seen when creating an array from a string data type such as text. A two element array composed of the string hello world and the string happy birthday would look like the following when built as a string literal:


SELECT ‘{“hello world”,”happy birthday”}’;
?column?
———————————-
{“hello world”,”happy birthday”}

Notice that the individual elements are double-quote delimited. This seems a bit unnatural in the world of SQL, doesn’t it? However with the SQL 99 syntax, the elements themselves are literals and the ARRAY construct is an expression. It would look like this:


SELECT ARRAY['hello world','happy birthday'];
array
———————————-
{“hello world”,”happy birthday”}

There are too many new array related features to address all of them here, but one other bears discussing. If in the past you made use of contrib/array, an equivalent feature is now built into the PostgreSQL server engine directly. That is, you can now apply a boolean operator comparing a single datum to an array’s elements. If all or some of the array elements satisfy the operator (depending on how the call is made), the test passes. Listing Two shows a couple of examples to help clarify this concept.




Listing Two: User and group array query

SELECT g.grosysid, g.groname,
s.usesysid, s.usename
FROM pg_shadow s, pg_group g
WHERE s.usesysid = any (g.grolist);

grosysid | groname | usesysid | usename
———-+———+———-+———
100 | g1 | 100 | user1
101 | g2 | 100 | user1
100 | g1 | 101 | user2
101 | g2 | 101 | user2
101 | g2 | 102 | user3

SELECT g.grosysid, g.groname,
s.usesysid, s.usename
FROM pg_shadow s, pg_group g
WHERE s.usesysid < all (g.grolist);

grosysid | groname | usesysid | usename
———-+———+———-+———-
100 | g1 | 1 | postgres
101 | g2 | 1 | postgres

In both of the examples in Listing Two, grolist is an array of integers, populated with the usesysid values for the users who are members of the group. The first example, a very simple SQL statement, shows all the users, by name, that belong to each group. The second example (admittedly contrived), shows all the users with usesysid values less than all of those in each group. (These are queries against the PostgreSQL system tables, so you can try them in your own database.)

This is just a small sample of the rich set of array features that have been built into PostgreSQL over the last three versions. Please see the official documentation on the PostgreSQL home page for more information. Most of the array features are covered under Data Types: Arrays and The SQL Language: Array Functions and Operators.

Polymorphic Functions

PostgreSQL functions are similar to functions in many other programming languages. They are named, the definition is stored for reuse, they accept input parameters, and they return some form of value. Functions allow you to extend SQL according to your own needs, as well as providing the capabilities of “stored procedures” in other databases.

PostgreSQL supports four kinds of functions: query language (otherwise known as SQL functions), procedural language, C-language, and internal. Users may define their own functions in all forms but the latter, and at this time, there are ten languages supported to varying degrees as “procedural languages,” including Perl, Java, Tcl, Python, Ruby, R, PHP, and PostgreSQL’s own PL/pgSQL.

Up until this release, though, functions were limited to strictly data-typed input parameters and results. This is useful because it allows for the hundreds of “overloaded” name-alike functions that do different things for different data types. But it can also be very limiting and forces users to create the same function many times to cover various data types.

Two special data types were added in PostgreSQL 7.4, anyelement and anyarray, otherwise known as polymorphic types. Poymorphic types allow functions to accept and return arbitrary data types. The arguments and results are tied to each other and are resolved to a specific data type when a query calling a polymorphic function is executed.

For example, a function declared as foo(anyelement, anyelement) returns anyarray takes any two input values as long as they are of the same data type. The return value would be an array of the input value type.

Polymorphism is useful because it allows you to define a single function that handles many different data types. The same result could be accomplished by using PostgreSQL’s standard function name overloading, but where polymorphism can be used, it reduces your maintenance burden significantly.

As a simple example, let’s say we want to define a function called greatest(), implemented as greatest (anyelement, anyelement) returns anyelement. Listing Three shows what it would look like.




Listing Three: An example of a polymorphic function


CREATE OR REPLACE FUNCTION
greatest(anyelement, anyelement)
RETURNS anyelement AS ‘
SELECT CASE WHEN $1 > $2 THEN $1
ELSE $2 END
‘ LANGUAGE SQL IMMUTABLE STRICT;

SELECT greatest(1,2);
greatest
———-
2

SELECT greatest(‘a’::text,’b');
greatest
———-
b

As you can see, one and the same function was used to handle both integer and text data types. In the second query, you may notice that the first argument was explicitly cast to text (‘a’::text). This is required; otherwise, the string literal is given the data type unknown, and the function does not give the parser any assistance in resolving the data type.

An interesting aspect of this example is that PostgreSQL 7.4 inlines simple SQL functions. The net effect of polymorphism and inlining is that greatest(x ,y) is essentially a “macro” for CASE WHEN x y THEN x ELSE y END. In a single function, you get the notational simplification, without the function call overhead. Nice, huh?

Now for a slightly more complex example. The PL/pgSQL function in Listing Four accepts any one-dimensional array data type, and returns the elements as individual rows.




Listing Four: Using the polymorphic array function


CREATE OR REPLACE FUNCTION unroll(anyarray)
RETURNS SETOF anyelement AS ‘
DECLARE
v_array ALIAS FOR $1;
lb integer;
ub integer;
i integer;
BEGIN
IF split_part(array_dims(v_array),
”:”, 3) != ”” THEN
RAISE EXCEPTION
”Input array must not exceed
one dimension”;
END IF;

lb := array_lower(v_array, 1);
ub := array_upper(v_array, 1);

FOR i IN lb..ub LOOP
RETURN NEXT v_array[i];
END LOOP;

RETURN;
END;
‘ LANGUAGE plpgsql STRICT IMMUTABLE;

SELECT * FROM unroll(ARRAY[1,3,5,7,9]);

unroll
——–
1
3
5
7
9

SELECT * FROM
unroll(ARRAY[now(),'yesterday','tomorrow']);

unroll
——————————-
2003-11-10 19:04:02.608704-08
2003-11-09 00:00:00-08
2003-11-11 00:00:00-08

The first three lines after BEGIN ensure that the function’s been given a one-dimensional array. The next two lines make use of two new array functions in PostgreSQL 7.4: array_ lower() and array_upper(). These functions allow you to get the lower and upper array index for the given array dimension (in this case, one). Then the code loops over the array, returning each element one row at a time. The final RETURN is needed to let PL/pgSQL know that the code’s all done outputting rows. Since the function is polymorphic, it can be used on any array data type, as shown by the two executed queries.

Full Text Indexing with Tsearch2

For several years, the OpenFTS Project has been developing an open source, full text search engine to rival high-end proprietary offerings. Accompanying PostgreSQL 7.4 is their Tsearch2 module, which moves much of the OpenFTS functionality into the database for better performance and greater flexibility.

Like other optional modules, the Tsearch2 module is located in the contrib directory of the PostgreSQL source code. Building it is simple and is detailed in the documentation, so let’s go straight to a test case.

One company that uses PostgreSQL keeps a large database of company failures, including detailed information about the failed companies’ business models. This company was recently enticed to migrate from Microsoft SQL Server to PostgreSQL through a demonstration of how they could use TSearch2 to “mine” their business model database. First, they set up a full text index (FTI) on their main Companies table, as shown in Listing Five .




Listing Five: Setting up a TSearch2 index


ALTER TABLE companies
ADD COLUMN fti_business tsvector;

UPDATE companies SET fti_business
= to_tsvector(‘default’,business_model);

VACUUM FULL ANALYZE companies;

CREATE INDEX idx_fti_business
ON companies USING gist(fti_business);

CREATE TRIGGER tg_fti_companies
BEFORE UPDATE OR INSERT ON companies
FOR EACH ROW EXECUTE PROCEDURE
tsearch2(fti_business, business_model);

In the code, default is the dictionary and filtering to be used with TSearch2, which supports multiple dictionaries and filtering criteria in multiple languages. (TSearch2 is highly customizable and this is just the simplest example.)

Listing Six shows the kind of text search that was available through TSearch version 1, just doing a simple “has these words” search. The query shown is searching the up to 20,000-character business_model fields in the companies table for the words “bushing” or “engine.” This could be done through a regular SQL query using LIKE but it would be much slower.




Listing Six: Simple full-text search query


SELECT company_name, business_model
FROM companies
WHERE fti_business @@
to_tsquery(‘default’,'bushing | engine’);

company_name | business_model
————————————+——————————
VALVES CO. | Valves Co., is primarily a valve, engine …
NORDIC ENGINE, INC. | Maryland corporation with its principal …
REDDY SOLID BUSHING | Produced pre-formed bushings, solid metal …
FLEX ENGINEERING | Exclusive South Dakota distributor of …
IDAHO ENGINES, INC. | Idaho-based marine and industrial engine …

Note: all of the companies listed are fictitious.

The search criteria string is a boolean construction of search words. So ‘engine | bushing’ means “containing engine or bushing or both.” These boolean strings can get quite complex if needed. For example, this string says “containing building materials or concrete but not containing plaster:”


‘( (building & materials) | concrete )
& !plaster’

The second search, shown in Listing Seven, employs two of the new features of TSearch2. rank_cd() employs an academically respected proximity and frequency algorithm to return an ordered ranking of relevance for our results. headline() selects a snippet of text where several of the search terms appear and highlights them for display. By combining the two and using some of the configurable settings and data, you can easily create your own search engine for your web site.




Listing Seven: More sophisticated full-text search query


SELECT company_name,
headline(‘default’, business_model,
‘bushing | engine’) as results,
rank_cd(fti_business, ‘bushing | engine’)
as s_rank
FROM companies
WHERE fti_business @@
to_tsquery(‘default’,'bushing | engine’)
ORDER BY s_rank DESC;

company_name | results | s_rank
——————–+———————————————-+——–
FLEX ENGINEERING | bushings until 1976, when they | 6
REDDY SOLID BUSHING | bushings, solid metal bushings | 3
NORDIC ENGINE, INC. | has been a manufacturer of engines | 2
IDAHO ENGINES, INC. | specializing in centrifugal engines | 2
VALVES CO. | engine and boiler manufacturer and | 1

And That’s Not All, Folks

That’s just a sample of the many new things PostgreSQL 7.4 offers you, let alone what you can expect to see from the project in the future. See the sidebar for a short list of items under development a few months ago, some of which may already be complete by the time you read this.




PostgreSQL Features Under Development


Feature-freeze for PostgreSQL 7.4 was in July 2003, so there are already several major features in development for succeeding versions. The features listed below are some of those most actively discussed as of November 2003, but not all of them will make it into PostgreSQL 7.5. Some will be postponed, others will be still under development, and a few will have already been completed by the time you read this.

If you really want to have an up-to-date idea of what development is going on for PostgreSQL, subscribe to the “Announce” mailing list, which includes the PostgreSQL Weekly News, giving you a rundown of development activity. For those with industrial-strength inboxes, you can subscribe to the “Hackers” mailing list as well.

ARC and the Background Writer: New buffering and disk sync algorithms that will result in significantly more efficient use of disk I/O. Hopefully, this work will also eliminate bottlenecks for a lot of people with very large databases.

SLONY-I Replication: A more automated and easier-setup replication system for PostgreSQL than the current eRServer, this system of master-slave replication is aimed fulfilling high-availability server needs.

Multi-Column Index Statistics: This patch, already committed, lets PostgreSQL keep accurate track of data correlation for multi-column indexes. This may result in substantially better performance in many queries on large, complex tables and views.

Tablespaces: The project has a partially-completed tablespaces feature, which would allow users to designate the disk storage location of tables and indexes. Tablespaces are important to users with multi-gigabyte tables or unusual disk configurations.

Two-Phase Commit: Several parties are working on patches for PostgreSQL which allow synchronous commit of the same transaction across multiple servers. This is a sought-after feature by WAN application developers, but needs work on transaction rollback.

Windows-Native Port: The single, most requested feature by members of the public, the PostgreSQL Windows port is still in development. Bruce Momjian, core developer, heads this effort and you can check progress on his Win32 web page (see Resources).

Point In Time Recovery: A couple of contributors have offered to pick up the work on this feature, halted in July 2003. No new progress has been reported as of November 2003.

Auto-Configure Script: A script to configure PostgreSQL’s system parameters based on information from the host OS and an interactive questionnaire. Combined with the Auto Vacuum Daemon released with 7.4, this should take a lot of the guesswork out of beginning PostgreSQL administration.

Named Function Parameters These eliminate the need for ALIAS statements and anonymous variables in PL/pgSQL and SQL functions.

Your Feature Here: If you don’t see a feature above that really matters to you, it’s time to join the PostgreSQL community and pitch in to make it happen. That’s what Open Source is about, after all.


Other version 7.4 features include the AutoVacuum Daemon, eRServer replication, the SQL-standard Information Schema, join rewriting, IPv6 and Apple Rendezvous support, and more powerful regular expressions.

So that just means that you have to download the code and try them yourself. Enjoy!



Josh Berkus is on the PostgreSQL core team, working on project PR and database performance. Contact him at josh@postgresql.org. Joe Conway has been involved with PostgreSQL as a contributor since 2001, and can be reached at mail@joeconway.com.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62