Content area
Virtually every major database manage- ment systems vendor-long touting strengths in online transaction processing (OLTP) systems-has announced recently that it can now handle the very different decision-support needs of a data warehouse.
A key aspect of this new-found functionality is the use of bitmap indexing technology, which significantly improves response time over traditional indexing methods by greatly reducing the number of read operations to the data. u Besides allowing more users to access the warehouse simultaneously, bitmap indexing also makes it easier for users to pose a series of queries to analyze the data. Moreover, an acceptable level of responsiveness can be achieved with a lower hardware expenditure than with traditional indexes-and often without either specialized logical database designs, such as star schemas, specialized physical database designs required by massively parallel computers, or specialized DBMSs that are dedicated to multidimensional analysis.
- In some ways, bitmap indexing has become the latest battleground in the increasingly nasty database wars. Sybase Inc.'s IQ database, announced on Oct. 30, is a fully inverted database built on new indexing methods, of which bitmapped access is an integral part. Both Oracle and Red Brick Systems will incorporate bitmaps into their next releases.
To scrutinize vendor claims, make informed purchasing decisions, and do a proper job of database design, database administrators need to better understand bitmaps and the different ways that the DBMS vendors are incorporating them.
B-Trees And Hashing
Most database management systems provide a variety of ways to physically access the data. The mix of tasks you are trying to accomplish (the workload) and the nature of the data determine which access method or methods will deliver the best performance.
Virtually all relational databases support both sequential and indexed access as fundamental ways of finding data. Sequential access is simply reading the data from the start of a file until you locate the desired row or record, at which time you can read it, update it, or delete it. This may be an effective access method for small tables where all of the rows are physically clustered together, or in applications that need to perform some action on every record in the database.
But suppose you want to find one customer out of hundreds of thousands who are stored in customer number sequence-and all you know is the last name. A sequential search of the collection of customers would take a considerable amount of time.
More likely, you would have made Customer_Last_Name an index field. Typically, the DBMS would build a B-tree index and store information in a hierarchy (or tree) of pages. Each page would contain many index entries as well as pointers to the next page of index entries. The advantage of B-tree indexes over sequential access is that, instead of looking at tens of thousands of data pages, the DBMS needs to look at only a few-perhaps just three or four-index pages until it finds the requested rows. Then it retrieves just those data pages from the database.
Customer_Last_Name is an example of data with many different values, which is called high-cardinality data. The most extreme case of high-cardinality data is a database column-such as Customer_ID-where every value is unique. For these columns, some DBMSs, such as Oracle and CA/Ingres, provide hashed method access. Instead of building an index, hashing uses a mathematical function to directly calculate the page on which a data row is stored. Provided that the function distributes the records fairly evenly across the pages of the database, only one read operation-to the data itself-is required to retrieve the requested row.
At the other extreme is low-cardinality data-when the number of different values for a column is low. In an employee database, for example, the gender column would typically have only two values-male or female. In a database with a roughly equal number of men and women spread evenly throughout the database, building a B-tree often does not save read operations (and hence does not speed access) because to look at either the male or female rows, you would still need to retrieve all the database pages.
Hashed and B-tree indexes work well in transaction-oriented systems where the number of rows being accessed by a transaction is relatively low. They also perform reasonably well on update operations, although to maintain performance, B-trees may need to be rebuilt after many changes have been made to the database.
But hashing and B-tree indexes begin to have problems with queries that have a complex set of conditions-in other words, exactly the kind of query for which people use data warehouses. When selection conditions are combined with ANDs and ORs in the SQL WHERE clause, it is common for a query optimizer to choose the index that produces the fewest number of rows and then further search that data for the remaining query conditions. A few systems will do a little more. For example, Oracle will process ANDs of key fields in the index, and IBM's DB2 version 2 Common Server will perform ORs of key fields in the index, but this is the barest minimum.
Enter Bitmaps
Bitmaps are attractive for data warehousing applications because, when properly implemented as part of an index scheme, they can vastly improve query response time. This is because complex searches on many conditions, aggregations, calculations, and joins can be done entirely in the indexes without reading data. Even limited use of bitmaps can reduce read operations to the database and improve the performance of some queries.
To best see how bitmaps work, lets look at an example-an automobile database with 10 million cars. Each row of the database has a unique internal system identifier called the row-id, or RID (not to be confused with the unique data identifier, or primary key). Each distinct value in this column would have its own bitmap index consisting of 10 million bits-one for each record in the database. If a bit is on (1), that value occurs in the record. If the bit is off (0), the value does not occur in the record. Bitmaps don't need pointers to data records because those records are identified through their position in the bitmap; thus the first bit represents record 1, the second bit, record 2, and so on.
Building a bitmap index with an implementation as simplistic as I have described would lead to two problems: handling high-cardinality data and update performance.
As a rule of thumb, somewhere between 100 and 1,000 distinct values (depending on your implementation and database size) would be considered high-cardinality data, and thus would be inappropriate for a pure bitmap index.
For example, a range retrieval using simple bitmaps on high-cardinality data, such as "Find all cars bought between Feb. 5, 1994 and July 8, 1995," would require evaluating more than 500 bitmaps-one for each day in the range. Unless this problem is solved, bitmaps will work well only with low-cardinality data. Query performance will suffer because the database system will still need to scan data pages to find the records that meet the selection criteria.
The second problem is that updating a bitmap to reflect changes in data can take substantially longer than updating a plain B-tree. Modifying bitmaps takes a fair amount of overhead in finding the bits to switch on or off, as well as the need to add bitmaps when new values are added to the database.
Advanced Indexing
The most common methodology of database vendors today is to use complex indexes based on combinations of B-trees, bitmaps, and record ID lists.
For example, each Vehicle ID number in our sample database is unique, and therefore, if we used bitmaps alone, there would be 10 million bitmaps of 10 million bits each, for a total of 100 trillion bits or 12.5 terabytes. Even if the DBMS compresses the bitmaps, storage and performance are still a problem, so some kind of RID list is frequently used. On the other hand, the make of car, for which there are only a few distinct manufacturers, could be implemented as a small B-tree resulting in 100 bitmaps at the leaves. On average, each bitmap contains 100,000 bits (10 million/100) indicating a particular manufacturer.
The complexity and variety of the index structures reveal two facts. First, there is no simple solution for query performance in large databases. Simply adding bitmaps is not enough. A variety of integrated, high-performing structures are necessary to allow flexible querying with fast response. Second, a corporation needs a database designer who understands what is in the database and how it is used. Good physical database design is the prerequisite for good performance.
Sybase IQ
Sybase IQ is designed to handle moderately large data warehouses of up to a few hundred gigabytes without requiring expensive hardware. It is a descendant of Expressway, which Sybase acquired in late 1994. IQ differs from Expressway in that it is integrated with Sybase's SQL Server database catalog, can perform ad hoc joins (while still supporting predefined joins), and has additional SQL support, such as correlated subqueries and inter-column arithmetic.
Sybase also added support in IQ for symmetric multiprocessing computers with parallelization of loading and building indexes, processing bitmaps, and performing sorts. Now in beta, IQ will be generally available in the first quarter of 1996.
IQ's designers concentrated on quickly performing operations and calculations on any data type of any cardinality. This led Sybase to develop a variety of index structures for which bitmaps are a key component, but not the only piece. Sybase provides a choice of five index techniques: High Group, High Non-Group, Low Fast, Low Disk, and Fast Projection Index. The appropriate selection of index (or indexes) depends on the cardinality of the data and how the data is accessed in queries.
For low-cardinality data, IQ has two index types: Low Fast and Low Disk. Both indexes use a B-tree with a compressed bitmap in the leaves. Low Disk makes better use of disk space, but is more processor-intensive.
For high-cardinality data, Sybase uses two types of index, High Group and High Non-Group. High Group creates a B-tree and bitmaps, whereas High Non-Group is a pure bitmap. Both are designed for range retrievals and aggregations, but for fields where equality and grouping in queries is important, a High Group index is preferable because the standard B-tree processes such a request more effectively than bitmaps.
The last index type, the Fast Projection Index (also referred to as the column store), is designed for vertical partitioning of data. Typically, this indexing is performed on fields as a second index, in addition to using one of the other index types. The Fast Projection Index contains a compressed list with a value for each row of the specified column. It is designed for operations such as "like" comparisons, inter-column comparisons or calculations, or ad hoc joins.
Unlike most DBMSs, IQ is designed to be used with a fully inverted database-every column in the database is indexed. Even though the user sees the data as standard relational tables, in fact, no data rows, per se, are stored; all the data is contained in the index structures. IQ uses compression to keep the database size smaller than the space the raw data requires, even when columns have two indexes. With tens of millions of rows in a table, the bitmaps can be very large, so Sybase also applies a variety of compression techniques based on the index structures.
IQ's support of predefined joins significantly speeds up linking tables. Ad hoc joins also are supported, using the row ids identified in the bitmaps. The join algorithms are either a sort-merge, nested loop, or a modified hash join.
Because IQ is designed for read-only data warehouses, it does not support interactive updates, which would be especially time-consuming for a fully inverted system. Instead, all updates are done periodically in batch mode. Sybase has also designed its index structures so that large incremental changes to indexes do not require them to be rebuilt. This obviously avoids the problems of update performance for bitmaps.
Red Brick Warehouse
Red Brick Warehouse version 4.0, due in early December, will add a column index, called a TargetIndex, This new index type is based on bitmaps and designed for fast selection from a given table. It supplements Red Brick's existing B-trees, pattern indexes, and the StarIndexes. Queries can be answered using a combination of all index types.
Unlike Sybase IQ, TargetIndexes can be interactively updated. Although users do not typically update data in a data warehouse, Red Brick sees this as important for data administration and data cleaning.
To address the problems associated with high-cardinality data, Red Brick provides three index structures in building a TargetIndex: very low cardinality (2 to 16 values), medium cardinality (8 to 256 values), and high cardinality (64 to 128,000 values).
The low-cardinality index is implemented as a B-tree with bitmaps in the leaves and the value of the bitmap in the header. These indexes typically take only 20% of the time to build as pure B-tree indexes. Red Brick does not compress the bitmaps, because its measurements showed that compression for very-low cardinality data yielded no significant increase in performance.
As the cardinality increases, the number of occurrences of each value decreases. So for medium-cardinality data, Red Brick supplies a B-tree with a compressed list of RIDs at the leaves. For high-cardinality data, Red Brick supplies a B-tree with a non-compressed list of RIDs at the leaves.
To get good update performance with bitmaps, Red Brick sets up pointers into the bitmap or compressed RID list to make it quicker to find the bits or RIDs that need to be changed. It also carefully manages the RID list in the high-cardinality index to maintain update performance.
Red Brick has paid a lot of attention to the query optimizer to extract the maximum performance from this relatively straightforward indexing. For example, the optimizer compares whether it is faster to combine results of each selection condition as a bitmap or a list of RIDs. With multiple selection criteria, Red Brick uses a heuristic algorithm to determine when it is more cost-effective to fetch the data rather than combine all of the bitmaps.
The StarIndex, always one of Red Brick Warehouse's distinctive features, allows it to rapidly join star schema data. This schema is characterized by a central fact table (such as the amount of a purchase) that is surrounded by dimension tables (such as date of sale, product, or sales organization) on which calculations are based. The fact table usually consists of many rows-possibly hundreds of millions-while the dimension tables are usually less than 10% the size of the fact table.
The TargetIndex speeds star joins by finding selected rows, which are then used with the StarIndex. The TargetIndex also can work with StarIndexes to join fact tables or to speed ad hoc joins if no StarIndex is available.
While interactive updates are supported, Red Brick Warehouse is designed especially-as its name implies-for data warehousing. So it is not surprising that most updates are performed by the database administrator using the data loader, which allows updates and inserts, but not deletes. Most sites update the warehouse daily, although some update weekly. Loading can be done in parallel, with parallelized index constraints coming early next year.
Oracle7.3
Oracle is adding a bitmap index to go along with its hash clusters and B-tree indexes. Oracle bitmaps are intended for low-cardinality data, while B-trees will continue to be used for high-cardinality data. Oracle7.3 is scheduled to ship in the first quarter of 1996, with bitmap indexes to be added later.
Oracle's approach is a straightforward implementation of bitmaps in the leaves of B-trees. The bitmaps are compressed using a method adopted from Oracle's text server. A particularly neat feature of this compression mechanism is that it allows logical manipulation of the bitmaps in their compressed state.
Oracle intends that bitmaps be used as part of the regular mix of access methods and allows bitmaps to be updated in real-time.
To support row locking, Oracle locks only a portion of the bitmap. While the company believes the update performance should be comparable to that of B-trees when the product is released, it is still too early for measurements to show this.
When multiple selection criteria reference bitmaps, the results can be logically combined. Counts can be performed in the index without the necessity of going to the data pages. In a complex query that mixes selection criteria that are indexed with both B-trees and bitmaps, the two index types cannot be logically combined; the more selective result (typically the B-tree) would be used to find the requested data.
Knowledge Is Power
By applying the appropriate indexing techniques for the specific data cardinalities and query types found in their corporate databases, administrators can greatly increase the responsiveness of large databases-without a big investment in hardware.
Herb Edelstein is a principal of Euclid Associates, a database and document image management consulting firm in Potomac, Md.
(Copyright 1995 CMP Publications, Inc. All rights reserved.)
