dcsimg

Keeping the Web in Web Services

Web Services have unleashed a raft of technologies, but are any effective? A number of developers say no and suggest that the Web already offers everything you need to build a great Web Service. And more.

The Web is a vast information system that has significantly changed the way people use computers. But what about Web Services? How many Web Services do you depend upon daily? Have Web Services changed your life in any meaningful way? What can you do today that you couldn’t do yesterday?

While there’s no doubt that Web Services are helping some people build valuable systems, Web Services have failed (so far, at least) to become catalysts for Internet-wide applications like email or instant messaging.

A growing number of developers argue that Web Services are failing “in the large” because the underlying Web Services technologies — namely, XML-RPC and SOAP — ignore the hard-earned lessons of the Web. Systems like XML-RPC and SOAP, they say, are not extensible, lack support for “little tools” that can be combined in novel ways, and don’t build upon existing, pervasive Web infrastructure. Worst of all, they say, XML-RPC and SOAP have seemingly ignored the key feature that differentiates the Web from information systems that preceded it: the ubiquitous hyperlink, the shorthand that links one resource on one site to other resources on any other site.

Those same developers say that if we want Web Services to succeed broadly, we should adopt the important ideas of the Web. Instead of adopting a raft of new Web Services specifications, so the idea goes, simply re-use the principles and protocols of the Web, perhaps no more than HTTP and XML, to create robust Web Services. Then any developer familiar with HTTP and XML could create a working, valuable service right away, leveraging the tools, scripts, code, expertise, and Web he or she already has.

This philosophy or approach to building Web Services, or more generally, distributed applications, is called REST, an acronym for “Representational State Transfer.” The REST approach — REST is not a toolkit or an API or a specification — argues that the lowly but irreplaceable universal resource identifier (or URI, commonly called “URL”) is all the interface you’ll ever need to expose a Web Service.

For example, a Web Service that searches for products and prices can be launched simply by placing a CGI script at the URI http://www.cheappricesrus.com/find?product=ipod. While a SOAP client would require a special toolkit to call a SOAP-based service, any software with HTTP support could easily call the shopping service with a common HTTP GET.

Surely, the client still requires a specification of the HTTP response, but the response could be as simple as a set of standard headers. Of course, XML is also easily returned to the client where it can be parsed with countless XML tools.

Let’s look at the REST approach and compare a Web service built in the REST style to the same service built with XML-RPC and SOAP. (You can read more about SOAP in the August issue of Linux Magazine, available online at http://www.linux-mag.com/2002-08/web_services_01.html. To learn more about XML-RPC, see the Java Matters column in the January 2003 issue, available online at http://www.linux-mag.com/2003-01/java_01.html). As you’ll see, applying the REST principles can yield an effective Web Service that is more extensible and flexible, yet conceptually simple to build.

REST: An Architectural Style, Not A Standard

In the early days of the Internet, email (via SMTP), news (via NNTP), and file transfer (via FTP) were the predominant applications. While those applications are still invaluable today, it was the introduction of links (via URIs, HTML, and HTTP) that turned a hodge-podge of online resources into a web, well, the Web. Given a single addressing scheme across all resources — the URI — and a common way to provide and access resources — HTTP — sites transformed from islands to stepping stones.

Indeed, those two fundamental concepts — a universal resource addressing scheme and a common way to provide and access resources — are the very core of the Web and of REST. In a REST-like architecture (including the Web) all clients and servers are able to process a variety of standard data types, all resources are identified using a common scheme, and all clients and servers use the same protocol to effectuate data transfers.

Again, REST is an acronym for “Representational State Transfer,” a term originally coined by Roy Fielding in his doctoral dissertation titled, “Architectural Styles and the Design of Network-based Software Architectures” (available online at http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm). (Fielding was one of the original authors of the Apache Web server, was a co-founder of the Apache Software Foundation, and is now on the Board of Directors of the Apache Software Foundation.) “Representational State Transfer” is a little opaque, so what does it mean?

Here’s how Fielding describes it in his dissertation:

REST components communicate by transferring a representation of a resource in a format matching one of an evolving set of standard data types, selected dynamically based on the capabilities or desires of the recipient and the nature of the resource. Whether the representation is in the same format as the raw source, or is derived from the source, remains hidden behind the interface… REST therefore gains the separation of concerns of the client-server style without the server scalability problem and allows information hiding through a generic interface to enable encapsulation and evolution of services.

Or, put another way:



  • Everything in REST is a resource. A resource could be a file, a PNG image, a Web page, a calendar, a database, or a printer.

  • All resources have a unique identifier.

  • Resources have an internal representation that is private and known only to the owner of that resource. For example, Web pages can be static HTML, or can be produced by PHP, CGI, or JSP. The internal representation of the content of the page is hidden, allowing the owner to change its internal representation at will.

  • Resources are provided to clients (be they remote procedure calls, browsers, or scripts) in one of many standard data types. The actual data type used — say, XML or HTML or HTTP headers — depends on the client and the resource being requested. For example, a Web browser might want database query results returned as HTML, while a script might prefer it as XML or even comma-separated values. The fact that the data is actually stored in a flat file database must remain irrelevant to the client.

  • The kinds of representations a resource supports is described in metadata. For example, the resource http://www.anothersite.com/image/logo might generate a logo image on the fly, perhaps in GIF. There, the HTTP header Content-Type: image/gif provides the metadata description.

And although it’s not mentioned in the quote above (but is discussed in the dissertation), REST also includes control data, which defines the purpose of a message between components. For example, if a resource consists of multiple representations, control data can be used to negotiate which representation (which data type) is best to use.

By the way, if the provider of the resource is able and willing, it’s also possible to return a representation of the resource in a non-standard, custom form. For instance, results from a database search might be presented in an interactive Java applet. Unanticipated (or custom) representational forms are desireable.

Ultimately, the principles of REST are intended to explain how a well-designed Web application behaves. Again from Fielding, “[Given] a network of web pages (a virtual state-machine), the user progresses through an application by selecting links (state transitions), resulting in the next page (representing the next state of the application) being transferred to the user and rendered for their use.” Also as intended, REST says nothing about what operations a resource should support.

And, as you can see, REST is not a standard, nor is it a technology, nor does it address implementation details such as what messages to send, which programming language to use, or which server-side toolkit to adopt. Instead, REST describes why the Web works so well.

Got Bugs? Get “Fly Paper”

To demonstrate REST and compare it’s approach to XML-RPC and SOAP, let’s build a hypothetical Web Service called Fly Paper that tracks issue (also called “bug”) reports. Like other issue tracking systems, Fly Paper should be able to create, modify, query, and delete issue reports.

Here’s an outline of a workable design:


  • For the unique identifier, let’s use globally unique identifiers (GUIDs), which are long pseudo-random strings. GUIDs are a good choice because we can mix them with GUIDs from other issue tracking programs without any danger that they’ll clash.

  • Let’s use HTTP as the protocol between clients and server. HTTP is ubiquitous, and a huge assortment of software and hardware already exists to manage, relay, cache, and monitor HTTP traffic.

  • Given the prevalence of XML and the availability of XML-related tools in almost every programming language, let’s use XML as our media type.

  • An issue report should have a number of fields, including a unique identifier (described above), the submitter’s email address, a product name, a description of the bug, a list of identifiers of related bugs, and issue status fields.

And, at a minimum, here are the operations that the issue tracking system must support:


  • Submit. When a new issue is submitted, a new issue report is created, and Fly Paper returns the new permanent identifier associated with that issue.

  • Retrieve. Given an identifier, this feature returns the proper issue report.

  • Update. Given an identifier, change the associated issue report. Attachments can be added to a issue report using this feature, too.

  • Delete. Given an identifier, remove the issue report. (In most tracking systems, issue reports are rarely removed. Instead, they are deactivated. You can think of this feature as deactivation.)

  • List. Return a list of all identifiers stored in the service.

  • Query. Given a product name, return all issue identifiers associated with the given product name.

At this point, you could easily derive an XML-RPC- or SOAP-based Web Service for Fly Paper. Conceptually, either kind of service would have methods (with the required arguments) for each operation, and each operation would return XML. For XML-RPC, you would publish the API. In the case of SOAP, you might create and publish a WSDL document to help others consume your service.

Doing It With XML-RPC

For example, Listing One shows a snippet of an XML-RPC server written in Python using the CGI-based SimpleXMLRPCServer module from Python 2.3. (Python is a simple enough language that you should be able to follow along even if you’re not a Python programmer.) For brevity, the IssueDatabase module has been omitted, but its purpose it to access the underlying persistent store for Fly Paper. Listing Two shows a very short Python XML-RPC client that calls two of the Fly Paper service methods: submit() and query().




Listing One: An XML-RPC, CGI-based Fly Paper server


1 #!/usr/bin/python
2
3 import SimpleXMLRPCServer
4 import IssueDatabase
5 from GUID import GUID
6 import os
7
8 issuesDB = IssueDatabase.IssueDatabase()
9
10 class IssueTracker:
11 def submit(self, email, product, description, related):
12 issue_id = GUID().guid
13 issuesDB.set(issue_id, email, product, description, related)
14 return issue_id
15
16 def retrieve(self, issue_id):
17 return issuesDB.get(issue_id)
18
19 def update(self, issue_id, email, product, description, related):
20 issuesDB.set(issue_id, email, product, description, related)
21 return 1
22
23 def delete(self, issue_id):
24 issuesDB.delete(issue_id)
25 return 1
26
27 def list(self):
28 return issuesDB.list()
29
30 def queryByProduct(self, productname):
31 return issuesDB.queryByProduct(productname)
32
33 handler = SimpleXMLRPCServer.CGIXMLRPCRequestHandler()
34 handler.register_instance(IssueTracker())
35 handler.handle_request()




Listing Two: An XML-RPC Fly Paper client written in Python


#!/usr/bin/python
import xmlrpclib

service = xmlrpclib.ServerProxy(“http://localhost/cgi-bin/issueXMLRPC“)
description = “I would like a new feature.”
guid = service.submit(“name@server.com“, “myproduct”, description, [])
service.retrieve(guid)

You may notice that Listing Two only refers to the URL http://localhost/cgi-bin/issueServer. All messages go to this URL and we’re required to pass in a GUID as the first parameter. (This is reminiscent of older windowing systems where a “handle” is passed as the first parameter.) Similarly, all of the methods exposed in Listing One are associated with a single class. As we’ll see momentarily, REST encourages us to assign a URL to each issue, and treat each issue as a first-class object with its own methods.

XML-RPC is easy to use, and an XML-RPC Fly Paper Web Service could be launched quickly. However, XML-RPC is very similar to procedure calls in an application: the results of the procedure call, including the format of the results, are fixed. If Fly Paper were to expand the content of a bug report, for example, some of the clients would readily break. The problem: while XML-RPC uses XML to express requests and results, XML-RPC does not take advantage of the “X” in XML, namely “eXtensibility.”

X Is for Extensible

The great thing about XML is its extensibility. Even once a well-designed, well-established XML vocabulary has thousands of users, it’s still possible for any one of them to add some extra information and not affect deployed tools that do not understand the extension. Using extension attributes or namespaces, vocabularies can be extended by decentralized parties without recourse to a centralized maintainer. Older tools know to ignore unknown constructs.

For many people this is the only reason to use XML for Web Services and demonstrates a key weakness in XML-RPC: XML-RPC is inflexible because it does not allow namespaces, attributes, or extension elements.

Luckily, SOAP does not inherit this flaw, and recent versions of SOAP build more fully on XML’s extensibility features. If we adopt our own XML syntax specifically designed for issue reports (call it IssueML, shown in Figure One) and adopt the rule that any elements or attributes in another namespace are ignored by Fly Paper, extensibility is assured. The latter, simple rule will help the issue markup language grow and evolve once we have released it into the wild.




Figure One: IssueML, a custom XML syntax for issue reports


<?xml version=”1.0″?>
<issue xmlns:iss=”http://www.prescod.net/issuesNS” guid=”000000f1ed3d2a8600c000a8000a00c5d68″>
<submitter href=”name@server.com“/>
<product name=”myproduct” version=”1.1″/>
<description>I would like a new feature.</iss:description>
<related>
<issue-ref guid=”000000f1ed3d2a8600c000a8000a00c5d82″/>
<issue-ref guid=”000000f1ed3d2a8600c000a8000a00c5b62″/>
</related>
</issue>

(In fact, now that we have defined IssueML, issue documents can take on a life of their own separate from our service. They can be saved to disk as files, can be edited in XML editors before being uploaded, can be styled with CSS stylesheets, and can be transformed into other XML vocabularies. IssueML could even be standardized by the Internet Engineering Task Force (IETF) or by one of the XML standardization groups. )

So, XML has let us separate the definition of our data format from the syntax of our protocol. That’s yielded big payoffs in flexibility. However, if our Web Service is to participate in the Web, it should be possible for issues in one database to naturally link to issues in another.

In the current scheme, that’ll be difficult because GUIDs do not indicate where they “came from.” To link properly, we’d also need to keep track of the origin or owner of each report, and the owner’s query API. Ugh.

What we need is an identifier that is unique and universally addressable. What we really need is URIs.

The Power of U-R-I

As mentioned earlier, REST treats everything as a resource, and each resource is accessed with a unique identifier. Following that approach, let’s assign a URI to every issue report, and like an object-oriented program, let’s call methods on each individual issue (which we’re modeling as an object) to retrieve, modify, and delete it.

We also need to create a separate resource for Fly Paper itself, which has its own methods — after all, the submit nd query operations work on the database as a whole and do not use identifiers. Therefore, operations on a single issue (retrieve, update and delete) will be sent to URIs of the form http://flypaper.com/cgi-bin/bug/4830982,” where the trailing number is some unique identifier for the issue. Operations that apply to Fly Paper as a system (submit, query) will be sent to a URI like http://flypaper.com/cgi-bin/issues. (If we had other resources such as attached files or users, they would be modelled as separate resources with their own interfaces. REST strongly encourages a clean separation of concerns.)

This may seem like we’ve just converted the GUID parameter to a URI, but there are much more interesting things going on:


  1. It’s desirable to separate the service interface from the issue resource interface so that somebody building a similar service could adopt only one or the other without necessarily adopting both. Separation of concerns buys some flexibility.

  2. Although GUIDs are widely used in COM and CORBA contexts, in a Web context they can only be considered a proprietary identification scheme.

  3. One of the fundamental principles of Web architecture is that once a URI has been created, its internal structure should remain “opaque.” It should not matter to clients that the resource is internally numbered by GUID, an integer, the names of Bond movies, or anything else that helps us keep them straight in our implementation. An HTTP URI disconnects the universal name from any internal name.

  4. A GUID isn’t needed anymore because the URI is (in and of itself) guaranteed to be globally unique! Where we called a URI with GUID before, now we simply have a single URI. Instead of just using a URI to find (“locate”) a resource, we’ve also used it as an identifier. (This is why “Universal Resource Identifier” is a more accurate term than “Universal Resource Locator.”)

Now, if multiple sites provide an issue tracking service, the URI identifies both the site and the issue. Indeed, having HTTP URIs for issues allows some extremely interesting extensions to our basic model.

For instance, references to related issues can also be URIs. Issues in one tracking service can reference related issues in other services, perhaps hosted by other vendors.

The query and list methods can return a list of links instead of GUIDs. A client that wants to work with one of those could just dereference the URI and start calling methods. This has the nice side effect that it actually does not matter whether the referenced URIs are delivered by the same server or not. It doesn’t even matter whether they’re in the same database or not.

And because the query interface is separate, we can link to issues stored in an issue database even if that issue tracking service has a radically different query “API.” Using this feature, you could merge the query and list interfaces from two issue databases (let’s say if two projects merged into one) without actually moving the referenced issues.

Finally, REST for the Weary

Finally, let’s implement Fly Paper according to REST. To do so, we must attach an HTTP interface to each method.

For example, to retrieve an issue, we should use a simple HTTP GET. The REST version of the service uses straight HTTP for maximum compatibility. With URI addressing and HTTP, our service becomes interoperable with literally hundreds of Web tools and standards.




Listing Four: HTTP/CGI server code for Fly Paper


#!/usr/bin/python
import os
import sys
from IssueDatabase import IssueDatabase
import IssueXML
import UrlDispatcher

issuesDB = IssueDatabase()

class Issue:
matches = r”/(\d+)”
def __init__(self, path, query):
self.issue_id = path[0]

def handle_GET(self):
email, name, desc, rel \
= issuesDB.get(self.issue_id)
return IssueXML.generateIssue(
email, name, desc, rel)

def handle_PUT(self):
email, name, desc, related = \ IssueXML.parseIssue(sys.stdin.read())
issuesDB.set(self.issue_id, email, name, desc, related)
return “<status>OK</status>”

def handle_DELETE(self):
issuesDB.delete(self.issue_id)
return “<status>OK</status>”

class Container:
matches = r”/$”
def __init__(self, path, query):
if query:
self.prodname = query["product"][0]
else:
self.prodname = “”

def handle_GET(self):
if self.prodname:
issues = issuesDB.queryByProduct(self.prodname)
else:
issues = issuesDB.list()
return IssueXML.generateIssueList(issues)

def handle_POST(self):
issue_id = issuesDB.nextCounter()
email, product, desc, related =\
IssueXML.parseIssue(sys.stdin.read())
issuesDB.set(issue_id, email, product, desc, related)
return IssueXML.generateIssueRef(issue_id)

handler = UrlDispatcher.UrlDispatcher()
handler.register(Issue)
handler.register(Container)
handler.dispatch()

A Python method handling GET requests for issues might look something like this (you can see this code in context in Listing Four):


class Issue:
.
.
.
def handle_GET(self):
email, name, desc, related = issuesDB.get(self.issue_id)
return IssueXML.generateIssue (email, name, desc, relatedl)

The code to parse and dispatch the incoming URL is shown in Listing Three. The UrlDispatcher code can be used for any HTTP CGI application whatsoever. It has no special code for handling issues or even XML.




Listing Three: The UrlDispatcher library


import traceback
import cgi
import os, sys
import re

class UrlDispatcher:
def __init__(self):
self.handlers = []

def register(self, handler):
self.handlers.append(handler)

def dispatch(self):
path = os.environ.get(“PATH_INFO”, “”)
query = os.environ.get(“QUERY_STRING”, “”)
method = os.environ["REQUEST_METHOD"]
error = None

try:
obj = None
for handler in self.handlers:
matchobj = re.match(handler.matches, path)
if matchobj:
data = matchobj.groups()
query_dict = cgi.parse_qs(query)
obj = handler(data, query_dict)
break

if not obj:
raise AttributeError
method_obj = getattr(obj, “handle_” + method)
data = method_obj()
except:
excdata = traceback.format_exception \ (sys.exc_type, sys.exc_value, sys.exc_traceback)
data = “<error>%s</error>” % “\n “.join( excdata )
if sys.exc_type in (AttributeError, KeyError, IndexError):
error = “Status: 404 Object Not Found ”
else:
error = “Status: 500 CGI Error ”

if error:
print error

print “Content-Type: text/xml”
print
print data

Next we need to handle the update method. Not many people know that HTTP has a method that exactly corresponds to this operation — PUT. Our implementation of PUT is very similar to our XML-RPC implementation of update, except that we parse the XML ourselves and print a little XML status report. The routine handle_PUT() in class Issue in Listing Four implements the update issue report operation.

HTTP has a method called DELETE for deleting. The delete operation is implemented in handle_DELETE().

So far, we’ve implemented three of the four important HTTP methods, GET, PUT and DELETE. The last is POST. In our service, POST is for creating a new issue. POST must be called on the service resource, “http://www.flypaper.com/cgi-bin/issues.” The method handle_POST() in class Container in Listing Four creates a new issue report in response to an HTTP PUT.

Next, we have to implement the list operation. Let’s add a handler for GET to the service URI. (If DELETE and PUT made sense, we could define those also.)

The output of the query method is very informative. It looks like this:


<issues xmlns=
http://www.prescod.net/issues‘>
<issue href= ‘http://localhost/cgi-bin/issues/1001‘/>
<issue href= ‘http://localhost/cgi-bin/issues/1002‘/>
<issue href= ‘http://localhost/cgi-bin/issues/1003‘/>
.
.
.
</issues>

Each element is actually a hyperlink. This is our first taste of the hyperlinking power that you get using HTTP and XML as they were designed to be used. Remember that these links are not in principle limited to being on the same server. REST is the first network application style that leverages the power of hypertext linking.

Next, we have to handle queries. Queries are a kind of GET: queries get information from the server and return it to the client. The client will represent a query with a single constructed URI:

The handle_GET() method in the Container class combines the listing and querying feature:

That’s the end of our service. You can see the complete server source code in Listing Four.You can see the client source in Listing Five.




Listing Five: HTTP Client Code


import httplib
import IssueXML

h1 = httplib.HTTPConnection(‘localhost’)
email, product, description, rel = “a@f.com“, “abc”, “2.0″, “blah”, []
issue = IssueXML.generateIssue(email, product, description, rel)

h1.request(“POST”, “/cgi-bin/issues/”, issue)
data = h1.getresponse().read()
issueref = IssueXML.parseIssueRef(data)

h1.request(“GET”, issueref)
data = h1.getresponse().read()
print data

Benefits of using REST and HTTP

As the code demonstrates, HTTP is logically a very object-like model. Services are resources, and are composed of sub-resources (which could themselves have sub-sub-resources). Parent resources are usually addressed as if they were directories in a file system. Child resources are addressed as if they were files or sub-directories. All resources have consistent methods: GET, PUT, POST and DELETE. You can think of these as analogous to SQL’s SELECT, UPDATE, INSERT and DELETE. Just as those methods are all you need for the relational model, the REST verbs are all you need for the REST model.

This model is very flexible. First, we could add attachments to the model by just adding another class for them. A POST to an issue would generate an attachment, just as POST to the issue container generates an issue. Each attachment would have its own URI. A DELETE to that URI would remove it. A PUT would update the attachment and a GET would retrieve it.

HTTP also supports content negotiation and client detection. You could serve up different kinds of data depending on the client application: HTML for legacy browsers, raw XML for purely programmatic clients, and formatted XML for newer browsers.

There are caching tools like Squid that can be interposed between an HTTP server and its clients. Caching can dramatically improve your performance and reduce your server and network load. Your REST application can turn on, turn off, and configure caching using cache-control headers. HTTP has standard error messages which are understood by these tools, whereas every XML-RPC or SOAP-based Web Service invents its own.

HTTP 1.1 is very sophisticated: authentication, pipelining, conditional operations, long-lived connections, protocol tunnelling, proxying, redirection and so forth. Most SOAP and XML-RPC implementations do not give you access to even a fraction of that power.

REST Stop Ahead

In short, most Web Services toolkits have not tapped the full potential of HTTP because they do not leverage its underlying architecture, REST.

The REST version of our service is more extensible because it uses the IssueML XML representation, the extensible HTTP protocol and URI-based hyperlinking. It has a clean division of responsibilities between the XML-handling, URL-handling and protocol handling features. Every object is independently addressable and can be incorporated into XML-processing systems as an XML document. Most important, objects can be directly linked across services of all types, turning our service into a single strand in the distributed information system we call the Web.

When you set out to build a Web service yourself, you’ll need to decide between the service-centric RPC architecture promoted by XML-RPC and SOAP or the resource-centric model described by REST, HTTP and the rest of Web architecture. If your service is a one-off with limited need for extensibility and integration, then you probably cannot beat the simplicity of XML-RPC. But if you want more extensibility and flexibility, REST is often a better choice.

REST is the architecture of the Web as it exists today. As more and more XML technology and data is integrated into the Web infrastructure, REST will become the natural architecture for the interfaces to Web services. Services will be routinely composed of a complex graph of XML-described hyperlinks. At that point, the “Web” in Web Services will become more than a buzzword. It will become a reality.



Paul Prescod implements XML-based systems for Blast Radius in Vancouver, BC. He is co-author of the XML Handbook and helped to define the original XML specification. You can reach Paul at paul@prescod.net. Martin Streicher is the Editor of Linux Magazine and wrote about SOAP, Web Services, and Apache Axis in the August 2002 issue of the magazine. You can reach him at lmed@linux-mag.com. The Python code shown in this feature can be downloaded from http://www.linux-mag.com/downloads/2003-02/rest.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62