Although nobody wants to build slow and corrupt databases, it seems that many people do a very good job of slowing down and corrupting their databases. I have listed some of the methods they use below. (In case my sarcasm is not apparent, I am suggesting that you avoid the actions indicated by the section headings.)
Note: In the following discussion, I have included links for some of the terms to items in Wikipedia. This is done for convenience. The SQL for entering the sample database shown below can be found by clicking here.
Don’t use constraints
All of the relational database management packages allow you to apply constraints on a database. These are rules that must be satisfied before a new record is inserted or a change is made to an existing record.
- Not null – A specification of a column in a table that may not contain null values.
- Uniqueness – A set of columns in a table whose set of values must be unique within the table. These sets of columns are called candidate keys, with one of the sets being the primary key.
- Foreign key – For every given record in one table, there must be a corresponding record in another table. For example, records in an inventory list may contain part numbers. For each record in the inventory list, the part number must agree with a record in the list of part numbers.
The following is an example of SQL code that will use constraints when setting up the tables.
CREATE TABLE COUNTRY (
COUNTRY VARCHAR(2) NOT NULL,
PRIMARY KEY (COUNTRY)
CREATE TABLE STATE (
STATE VARCHAR(2) NOT NULL,
COUNTRY VARCHAR(2) NOT NULL,
PRIMARY KEY (COUNTRY, STATE),
FOREIGN KEY (COUNTRY) REFERENCES COUNTRY(COUNTRY)
CREATE TABLE CITY (
CITY VARCHAR(40) NOT NULL,
COUNTRY VARCHAR(2) NOT NULL,
STATE VARCHAR(2) NOT NULL,
PRIMARY KEY (COUNTRY, STATE, CITY),
FOREIGN KEY (COUNTRY, STATE) REFERENCES STATE(COUNTRY, STATE)
Identifying and inserting the constraints does take time, and the following are the reasons that some people have used for not implementing them.
- If the code is written correctly, none of the records in the database will violate the constraints. Therefore, they argue, time spent in applying constraints is time wasted. However, how can you insure that the code is written correctly if you haven’t identified all of the constraints? Furthermore, it is never safe to assume that your code is error free. As the saying goes, “trust, but verify”.
- Sometimes the data in the database is already corrupt. By not applying the constraints, you can keep operating until some time in the future when you clean up the data. However, as time goes on, it becomes more and more difficult to clean the data, and that time in the future never seems to come. Furthermore, processing corrupt data can have strange results.
- I have seen people apply both of the first two arguments simultaneously. The code was incorrect in the past and corrupted the data, but is now correct and no further corruption will take place. My counterarguments for both of these cases still apply.
Even if your current programmers have achieved perfection, there is no assurance that programmers in the future will share this blissful state. Furthermore, if you haven’t identified all of the constraints, how can you check that your software is correct.
In addition to preventing corruption of the database, these constraints are also used by the optimizer in the database management system to find the best way of accessing the data. (There are often multiple sets of indices that can be used to locate the records, and the job of the optimizer is to find the most efficient set, which is referred to as a strategy.) Failure to implement the constraints will result in non-optimal strategies being selected. With some database managers, you can get around this by inserting “hints” in the data requests. These hints override some of the heuristic algorithms in the optimizer. However, I have a few problems with hints.
- In every case that I have seen, the use of hints was required because the constraints were not fully applied.
- Valid changes to the database can cause the hints to become invalid.
- Depending on the constraints in the database, the database manager may ignore some of your hints. The database manager knows that the constraints are true, and will often apply more weight to the constraints than to the hints when determining a search path.
There is another reason for using foreign keys. The API’s for databases can examine the metadata and automatically generate an entity relationship diagram using the foreign keys.
Don’t Use Indexes
Indexes can speed up your transactions, but there are a few things to consider.
- If the records are inserted infrequently but read very often, indexes will speed performance tremendously, and the creation of additional indexes doesn’t pose much of a problem. If the data is read infrequently, but records are frequently inserted, additional indexes can impose high demands on your system, and may not save you much. However, you will still need indexes on the primary and candidate keys to insure that the database isn’t corrupted.
- Hashed keys are only useful if you know the values of all of the columns in the index. This may seem obvious, but it is sometimes overlooked.
- For sorted (ordered) keys, the order of the columns is significant. For a table that is read very often, you may want additional ordered indexes. For example, a table of medical activities might have the patient as the first item in one index and the provider as the first item in another index. In addition, adding additional columns to the end of an ordered index doesn’t introduce a heavy burden on the system, but may be useful.
- The impact of the indexes will change as the database grows. Make sure that you test the indexes on a data set that is comparable in size to the production data.
In some cases, the database manager may add indexes based on the primary and foreign keys. In other cases, it may search the entire table if keys aren’t provided.
Don’t Use Transactions
Transactions allow the system to insure that SQL queries are executed properly. When you group queries within a transaction, all of the queries will be executed or all will be cancelled. However, some people still don’t use transactions. Another problem is that some people misuse transactions. The period between the start and end of a transaction should be kept short. If the database operations require user interaction or extensive preparations, prepare the data first and then carry out the data modifications in a single short transaction. This will reduce the chance of the database locking up.
Documentation for a database serves a number of purposes.
- A data dictionary defines the tables and columns in a database and enables all of the people working with the database to share a common understanding of the schema.
- Creating documentation forces you to think about how the database is organized, and can help to avoid errors. It also allows others to review the design.
- It reduces the effort required and the number of mistakes that are made when changes are made to the system.
Before you begin the design, you need to identify the most important tables for the application. These are the tables where changes will require major changes to the system, and they need to be defined as soon as possible. For these tables, you will need a list of the columns, the primary and alternate keys, and the foreign keys relating the tables.
Some parts of the documentation will be required for testing of the system and its components. If you are going to test for corruption of the database, you will need to know the constraints on the database.
These hierarchical sets will contain most of the key tables in the database. Changes to these tables will normally involve massive reworking of the code processing the database. Your data dictionary will need to include these tables first.
Ignore Errors and Crashes
Some people will say that it doesn’t matter if a system freezes so long as it works when you restart it. However, if you don’t know why the system froze, how do you know that the data hasn’t been corrupted. (Hopefully, the proper use of transactions will help avoid problems.) If the system crashes or produces error messages, you really need to know what is happening. The story of the Therac-25 is often used as a cautionary tale of what happens when you ignore error messages. (In this case, ignoring error messages resulted in a number of people dying from radiation exposure.)
When you run the database queries from your programs, be sure to check the return codes. If you are using Java, do catch blocks for expected return codes first. Unexpected return codes should generate a log entry. I have seen some people make assumptions about the error being returned. For example, they assume that an error code returned from a “CREATE TABLE” command will either be success or that the table already exists. Assumptions can be devastating.
There may also be questionable areas that you want to mark down for future review. For example, if you find hard limits for some values, you may want to make a list for consideration in future updates.
Don’t Run Sanity Checks
When two tables are linked together using a one to many relationship, the constraint doesn’t restrict how many records in the second table can be linked to the first table. The following code is an example for testing the relationship between a table containing countries and a table list states/provinces within each country.
mysql> SELECT MULT, COUNT(MULT) AS FREQ
-> ( SELECT COUNT(COUNTRY) AS MULT, COUNTRY
-> FROM STATE GROUP BY COUNTRY ) FIRST
-> GROUP BY MULT
-> ORDER BY MULT;
This indicates that there is one country with 13 states (Canada) and one country (USA) with 51 states. If I had a database using 300 countries and the query said that one country had 3000 states while all of the others had under 100, I would be very suspicious of the country that had 3000 states.
For large databases, these queries can take a long time to run but you only have to run them during testing sessions. Some of the outliers may actually be legitimate, but looking at the outliers will reduce the number of cases that have to be examined.
The goal of these sanity checks is to identify statistical anomalies in the data. Upon inspection, some may be valid, but some may point to problems in the underlying software.
And some may point to problems in other areas. There was a bank where a few employees were carrying out fraudulent transactions for cash values that were just below the trigger level for audits. If they had been running statistical distributions of the value of the transactions, the bank would have been able to detect this problem before they lost a few million dollars.
Don’t Allow Users to Make Temporary Tables and View
In addition to the database administrator, there are other programmers that use the database. If they aren’t allowed to generate their own tables and views, it can make their job much more difficult. (These temporary files are placed in a separate schema, so that they can’t affect the actual production database.) There are workarounds to not being able to create these, but they are messy and expensive. Furthermore, if the creation of temporary tables can corrupt the production database, you have very big problems.
Worship the Optimizer
When you run an SQL query, there are often multiple ways of searching through the tables for the desired information. The optimizer selects what it assumes will be the best method for the search. Some people assume that the optimizer is so good that they don’t have to worry about the quality of their code. The optimizer is not magic, and it is essential to understand it and its limitations. (See Freefall, strip 255, http://freefall.purrsia.com)
It is preferable to give statements to the database manager rather than hints. Since it knows the statements are true, it can make many inferences. Hints are viewed as “might be true”. The statements are the constraints on the tables, the contents of the indexes, and the size of the tables.
Don’t Update Software
If you don’t apply updates to the drivers and database managers, very strange things can happen. One major item is that security holes are fixed, and this often includes security holes that the vendor never told anybody about.
Of course, you have to test the software when you carry out updates. But you have to test it anyway, and the testing may reveal problems unrelated to the updates. And if you do find errors and haven’t updated the software, there is a chance that the problems are fixed by the update and aren’t actually due to your code.