Information is power -- but only if it isn't locked up. But how can organizations balance the need for ready access to critical data with the onus to protect that same information from misuse? One solution: hide the information in plain sight. Here's a primer on how to do just that.
Databases are very powerful tools. Often full of vital, proprietary, and often sensitive information, a database captures confidences and grants insights to its owner. Information is power. However, the same easy access to information — credit card numbers, medical records, a rolodex full of clients — that helps run an endeavor can be subverted by a traitorous insider, an eavesdropper, or a clever hacker to achieve some illicit gain.
So, as the truism goes, “With great power comes great responsibility.” Keeping a database filled with sensitive information comes with an onus to protect that information. But at what cost? Can you secure every last byte? And would you even want to? Perhaps not — data is worthless if it’s inaccessible. Ideally, your data is accessible and safe at the same time. And that’s the notion of translucent databases.
What’s the safest place to hide your data? In plain sight.
Always Use Protection
There are two traditional solutions to protecting sensitive information: destroy the data or lock it up in a fortress. The first is an easy choice that’s often overlooked, in part because the familiar “pack rat” instinct compels IT staff to keep the data around just in case. That’s why many managers choose the second approach and rely on the security features of databases to keep would-be trespassers out.
However, in some cases, neither choice makes sense. The data can’t be destroyed and it can’t be protected with adequate security for a reasonable fee. For example, many database administrators know that leaks occur in unlikely places. The database itself may be secure, but a hole in the operating system might open up the raw files. Plugging the operating system works, but only if the insiders are honest. And firewalls can block some outsiders, but that doesn’t protect against inside jobs (whether perpetrated by employees or savvy spies).
So, every database administrator balances the value of the data with the cost of protecting it adequately. No one really cares about the database that tracks when someone changed the light bulbs in the parking lot, but it’s hard to estimate just how badly someone might abuse a service matching baby sitters to parents. And, in general, when making trade-offs between accessibility and security, access wins every time. Database administrators often have to cross their fingers and hope the server is secure enough.
But there is another alternative. You can destroy some of the information, leaving just enough to allow effective analyses. For example, judicious encryption can combine the hand washing simplicity of destroying data with the ability to do useful work. Databases like these are translucent — partly, but not completely transparent. Applied in the right circumstances, algorithms from the worlds of cryptography, quantization, and authentication can obscure proprietary or confidential information without destroying the usefulness of the rest.
There are four basic techniques that can be combined to protect data:
* ONE-WAY FUNCTIONS. One-way functions are classic encryption algorithms that scramble data with a complicated and irreversible algorithm. The best ones promise that given f(x), it’s practically impossible to find x. One-way functions scramble data without blocking the ability to search for it. In other words, f(x) acts as a surrogate for x and protects the value at the same time. If someone knows x, they can compute f(x) and search the database. If x is unknown, then it should be practically impossible to infer x from f(x).
* QUANTIZATION. Quantization is a fancy term for rounding off the data by dividing a range of data into a small number of sets known as quanta. The simplest algorithms just round off data, while more complicated algorithms create custom collections of quanta to fit the data. In the ideal case, the quantization adds just enough blurring to protect the sensitive information without blurring it enough to block less sensitive uses.
* AUTHENTICATION. Digital signatures and message authentication codes can vouch for the accuracy of information. Mixing fake information with real information can fool unauthorized users. Those who know how to test for authenticity can separate the wheat from the chaff.
* STEGANOGRAPHY. Many of the standard techniques for hiding information or disguising it can also be used to create a second channel for a database. If a keyed algorithm is used, only a person who knows a key can extract the data.
These solutions provide a wide-range of protection against eavesdroppers, attackers, or traitorous insiders. In the most extreme case, a solution effectively destroys the information, rendering it unrecoverable. In other cases, a technique might simply obscure the information enough to protect it. In some cases, the data can even be recovered with knowledge of the correct key.
One-Way Functions
Some of the simplest and most powerful solutions use one-way functions. One-way functions scramble (some) data so completely that it’s practically impossible to run the function in reverse to find the input from the output. One-way algorithms are close cousins of cryptographic algorithms like DES or AES, and often share some of the same structure. In fact, one simple one-way function can be constructed by using a random key with any good encryption function and destroying the key afterwards.
One of the most well-known one-way functions is the Secure Hash Algorithm (SHA), a function created and endorsed by the National Institute of Standards and Technology (NIST) with the help of the National Security Agency (NSA). The SHA algorithm takes an entire file and reduces it to a 160-bit number. The algorithm was designed to be as fast as possible, while still offering no weaknesses that might help someone reverse it.
Using the word “hash” to describe the function underscores how it can be used in databases. One-way hash functions like SHA offer cryptographic security and the efficiency of indexing information. For example, if a database holds SHA(data) instead of data itself, someone perusing the database cannot learn the value of the data. The cryptographically secure structure of SHA guarantees it’s not practical to use SHA(data) to recover data.
At first blush, scrambling the data this way seems to make it useless. One way functions behave similarly to random number generators: columns containing SHA(name) or SHA(address) look like random noise. Sure, you can sort these columns and index them for fast lookup, but that is seemingly worthless if the noise bears little relationship to the data it’s supposed to represent.
In fact, scrambling effectively cloaks the data, and doesn’t make it completely inscrutable. Anyone can still look up particular values and test for equality — a feature used frequently to protect passwords. For example, many systems don’t store passwords in the clear because of the danger from eavesdroppers and attackers. Instead, the systems store the result of applying a one-way function to the password, that is SHA(password) instead of password. Then, if someone tries to log in, the password he or she types is hashed and compared to the database. If there’s an exact match, the password is authentic. There’s no easy way, however, for someone to use the database of hashed passwords because the one way functions aren’t reversible.
Password databases may be the most famous example of translucent databases, but they’re far from the only application that can benefit from the inscrutability provided by one-way functions. Practically any database with personal information can be quite useful even after hashing a name, social security number, or other personal identifier. A medical study, for instance, might store the medical records of each person under SHA(name) instead of name itself. If new information about a person arrives, the records for that person can be looked up, but anyone browsing the files won’t be able to tie an individual with his or her data.
A caveat: this approach does have limitations. While a casual browser can’t look at the value of SHA(name) and determine the name, anyone who knows a particular name can search for it quickly. This can be changed by adding a password to the mix and using SHA(name+password) for the index instead of SHA(name) (the + symbol implies concatenation).
A more sophisticated mechanism for adding a password to a hash function known as the HMAC algorithm (described in more detail in the September 2002 issue of Linux Magazine, available online at http://www.linux-mag.com/2002_09/cryptography_01.html) mixes the password and the name in a more secure way before passing it to a hash function like SHA. Concatenation is secure in most cases, but HMAC is more secure.
Table One shows some columns from a pharmacy database containing customer records. The items, the prescription sizes, and the dates of purchase are stored along with the hash of the person’s name and password. A customer who knows his password can retrieve his old purchases if he wants to reorder the same prescription again, but a nosy clerk can’t find the names of people taking the powerful Oxycontin (the pills have a sizeable value on the black market).
TABLE ONE: Confidential information is hidden by obscuring the patient name
SHA(NAME) | DATE OF PURCHASE | DRUG | PRESCRIPTION SIZE |
7A849BB938E48CA… | August 10, 2002 | Oxycontin | 28 |
7A849BB938E48CA… | August 10, 2002 | Fenfluromin | 28 |
7A849BB938E48CA… | August 10, 2002 | Acetominophen | 12 |
AA8482FE9384FED… | August 11, 2002 | Oxycontin | 16 |
7E6469A7B7AAABC… | August 11, 2002 | Tylenol 3 | 7 |
8AAA8B8CD8838FF… | August 12, 2002 | Tylenol 3 | 7 |
8AAA8B8CD8838FF… | August 12, 2002 | Claritin | 7 |
Here are four columns from a pharmacist’s database. The names are protected by the secure hash algorithm. It’s possible to warn a patient about new drug interactions by looking at past records, but it’s not possible for someone to find the names of those taking Oxycontin.
|
Databases like these show how to block personal information without obscuring the impersonal. For example, every time a customer fills a new prescription, the database can check for dangerous drug interactions. And while these schemes block personal information, they leave enough information for, say, a marketing department to use. A marketing analyst can study the database in Table One and match individuals with their spending habits. If a similar system were used in a department store, an analyst could determine how many people bought a red hat and a blue shirt in the same month. The analyst can’t discern the names or credit card numbers of the consumers, but can make generalizations all the same.
One-way Functions for Commerce
Protecting the identities of prescription drug users is just the first way that one-way functions can be used to protect personal information without impeding quantitative analysis or clever business practices. Many online stores, for instance, may keep customer names and addresses to make it easier for a returning customer to fill out shipping and billing forms. The persisted values are ready to go whenever a customer arrives at check out. Of course, the stored information can also be abused by insiders, hackers, and identity thieves.
In a slight variation of one-way functions, sensitive values can be locked with a hash function, but done in a reversible way so data can be extracted when the customer returns. If SHA(name+password) produces one value that’s inscrutable, then SHA(password+name) produces an entirely different value. The latter result can be used as a key to encrypt the personal information with a standard cipher like AES.
By definition, a good hash function guarantees that there will be no relationship between the two values. In practice, short names and passwords can yield problems with some standard hash functions in extreme cases. A more sophisticated approach is to compute SHA(SHA(name+s1)+SHA(password+s2)), where s1 and s2 are two random passwords chosen when a database is initialized. s1 and s2 should be kept as secret as possible because they cannot be easily changed once a database starts collecting data.
Creating two numbers from the name and password string separates the personal information into two parts in a way that both of them are scrambled and can’t be abused by insiders or hackers. When a woman presents her name and password to a web site, the site can compute SHA(name,password), search the database for past purchases, and tailor the web site to the woman’s historical interests. The value of SHA(password,name) can be computed to unlock any personal information like a mailing address or a credit card number.
Combining these two techniques can solve many common customer privacy concerns. A store like Amazon can offer almost all of their customization and cross-selling features without worrying about leaks of sensitive data.
One-way Pitfalls
One-way functions are, by their nature, very sensitive — a consequence of ensuring that it’s difficult to deduce the value of x that produced some value SHA(x). If the one-way function is working correctly, a small change in the value of x will lead to a large and unpredictable change in SHA(x).
Unfortunately, this sensitivity can lead to glitches. Because hash functions are sensitive to case, format, and the addition of extra non-printing characters, even small spelling errors or slight changes can leave records irreversibly scrambled. These dangers are well-known to many user interface designers, but fixing them is more important here because there’s no way to reconstruct the error after the fact. If someone types “Bob Smith” into the form on one trip to the store and “Rob Smith” on a second trip, then SHA(“Bob Smith”) and SHA(“Rob Smith”) are going to generate two entirely different numbers.
Many of the standard solutions work well. Removing non-alphanumeric characters and converting all lower-case letters to upper-case remove many of the problems. More sophisticated algorithms like SOUNDEX can convert two English words that sound alike into the same code.
The one danger with these solutions is that they shrink the size of the space that must be searched by someone trying to guess a name and password. Increasing the usability means decreasing the security. If the passwords and names are long enough, this may not be much of a concern, but it can be a potential problem if they’re short.
Blurring the Data
Quantization is another very useful techniques that can make a database translucent. The right amount of mathematical blurring can obscure sensitive information without destroying its usefulness. The essential information isn’t destroyed — just the precision that might be abused.
Rounding off numbers is one of the simplest forms of quantization. An algorithm is said to quantize real numbers by assigning each value to quanta of i-.5 <= x < i+.5. This technique is usually done to speed calculations or strip away false precision, but it also adds some secrecy by obscuring the detail.
Table Two shows the positions of some naval ships in latitude and longitude. Let’s assume, for instance, that the Navy wants to effectively declassify the latitude and longitude of the ship by rounding off to the nearest degree. (In practice, the circle of declassification could be larger or smaller.) Removing the restrictions on the less precise data can make life a bit easier for everyone: sailors can tell their families where they are without endangering the ship, and schedulers can arrange for mail, fuel, and food to reach the ships without knowing enough information to help a missile find the ship. The Navy might even want to place the less precise information on a web site, while keeping the detailed data locked up in a secure database.
TABLE TWO: A seemingly accurate table of ship locations
LATITUDELONGITUDE
| SHIP | D | M | S | D | M | S |
Ticonderoga | 45 | 20 | 12 | 20 | 10 | 4 |
Indomitable | 48 | 12 | 18 | 14 | 10 | 14 |
Inscrutable | 40 | 15 | 21 | 27 | 12 | 50 |
Impertinent | 41 | 18 | 25 | 23 | 55 | 58 |
This database shows the location of several naval ships, but the numbers in the minutes and seconds column are scrambled by adding an encrypted value. Someone who knows the right key (“Beat Army”) can uncover the true values, but everyone else must live with imprecision embedded in the data. |
Rounding off values is not the only way to effectively quantize information. There are a number of sophisticated algorithms for finding the optimal number of quanta to describe a different data set. The best ones adapt the size and the shape of the quanta to minimize the error.
One close cousin to rounding off the information is adding (seemingly) random noise to each point. This doesn’t break the data into discrete units, but it does ensure that close points become virtually indistinguishable.
This approach can even store secret, precise values in the same database as imprecise values. The trick is to encrypt the noise added to each value. Anyone with the right key can decrypt the extra precision, but the rest of the world just sees the imprecise version.
Table Two shows the positions of some naval ships with seemingly great precision. The locations aren’t accurate, however, because a random value was added to each position with the following algorithm (see also Listing One):
1. Concatenate the name of the ship with the password “Beat Army.”
2. Pass the previous result to a version of SHA.
3. Take bytes from the result in the previous step and add them to the minutes and seconds of each position.
4. Blur the longitude by plus-or-minus one degree by converting the random value from step (3) into minutes and seconds and adding it to the ship’s longitude.
5. Repeat the process for the latitude — using different words in step (1).
LISTING ONE: Java code to blur ship locations
This Java code uses a JDBC prepared statement to insert information on the position of ships into a database. The degrees of latitude and longitude (lad and lod) are left untouched, but the minutes and seconds are scrambled by adding a byte from the hashed value of the passcode and the name of the ship. MD5Bean is a Java object that implements the MD5 standard. Java’s built-in cryptography functions offer a wide array of one-way functions. Read the documentation to discover which are more secure.
public void initAppointments1()
{
String q=”INSERT INTO navy1 SET ship=?,”;
q=q+”latitudeD=?,latitudeM=?,latitudeS=?,”;
q=q+”longitudeD=?,longitudeM=?,longitudeS=?;”;
navyInsert=dw.createPreparedStatement(q);
}
public int scrambleValue(int in, byte b)
{
return (in +Math.abs(b % 60))%60;
}
public int unscrambleValue(int in, byte b)
{
int i=in-Math.abs(b%60);
if (i<0) i+=60;
return i;
}
public void encryptedInsert(String passcode,String ship,
int lad, int lam, int las,
nt lod,int lom, int los)
{
try {
// take care of unencrypted first.
navyInsert.setString(1,ship);
navyInsert.setInt(2,lad);
navyInsert.setInt(5,lod);
// Find encryption key
MD5Bean.clear();
MD5Bean.Update(“Beat Army!”);
MD5Bean.Update(passcode+ship);
byte[] b=MD5Bean.Final();
int temp=scrambleValue(lam,b[0]);
navyInsert.setInt(3,temp);
temp=scrambleValue(las,b[1]);
navyInsert.setInt(4,temp);
temp=scrambleValue(lom,b[2]);
navyInsert.setInt(6,temp);
temp=scrambleValue(los,b[3]);
navyInsert.setInt(7,temp);
navyInsert.executeUpdate();
} catch (SQLException e){
System.out.println(“SQL Exception:”+e);
}
}
|
Practically any choice of words works as long as they are distinct. This guarantees that the two random values are different. This database lets anyone see the unclassified location of the ship, and allows anyone who knows the password, “Beat Army,” to recover the true location.
Embracing Translucency
All of these data hiding tricks are useful in the right circumstances, but they often require some political changes in organizations. Many database administrators and IT managers instinctively save as much information as possible because the facts and details about customers, products, contracts, and relationships are the glue that hold together the organization. Almost everyone can tell a story about how an odd, almost forgotten backup copy saved the day.
But few people consider how backup tapes and too much information can ruin some days — even though the news is filled with examples. Errant clerks keep a copy of credit card numbers for their own use; stalkers often abuse databases; and criminals often find ways to read police databases.
Database designers must learn to balance the potential for harm against the potential to help when deciding what data to keep. In many cases, company lawyers are leading the way. Many now require employees to delete all email soon after it loses its relevancy. Others are weighing the costs of complying with a subpoena when they store information.
Building a good translucent database is as much a political act as a technical job. The database administrator — perhaps that’s you — must determine what information can and should be destroyed. In the best situations, a good translucency technique can extend the usefulness of information long past the time when it may have otherwise been destroyed.
Unfortunately, not every situation lends itself to these tricks. For example, most employees want a business to keep an accurate database cataloging the names and the amount of taxes deducted from their paycheck. Destroying the link between identity and information can sometimes hurt people.
Finding the right amount of translucency is an art for the database administrator. In the right situations, the right amount of scrambling can produce a very efficient database that can withstand attacks from without and within.
Peter Wayner is the author of a dozen books including Translucent Databases
, a book filled with dozens of examples of how to keep just the right information around.