Projection Segmentation

To optimize query performance, projections can be segmented (distributed evenly across multiple database nodes). The term segment refers to the portion of a projection stored at a particular node. A node can store a collection of segments of various projections.

Dimension tables are assumed to be much smaller than the fact table and are not segmented; they are replicated on multiple nodes. The system catalogs are also stored redundantly at all nodes; they are small and are updated infrequently.

Hash segmentation is the preferred method of segmentation. It allows you to segment a projection based on a built-in hash function that provides even distribution of data across multiple nodes, resulting in optimal query execution. A hash function is a reproducible method of converting data into a number that can serve as a digital "fingerprint" of the data. In a projection, the data to be hashed consists of one or more column values, each having a large number of unique values and an acceptable amount of skew in the value distribution. Primary key columns that meet the criteria may be an excellent choice for hash segmentation.

Hash segmentation is used in segmented projections produced by the

Database Designer.

The Database Designer is a tool that analyzes a logical schema definition, sample queries, and sample data and generates a set of projections in the form of an SQL script to be executed after you create the tables but before you load any data. The script creates a minimal set of superprojections to ensure K-Safety, and optionally pre-join projections. In most cases, the projections created by the Database Designer provide excellent query performance within physical constraints. You can, however, write a custom projection script should the Database Designer not meet your needs.

Range segmentation consists of designating one column of a projection as the segmentation column. The range of possible values of a segmentation column is decomposed into disjoint sub-ranges which are allocated to specific nodes.

Range segmentation based on dates and/or timestamps is not recommended because it results in a skewed data distribution. For example, suppose you segment by year using a DATE column. All rows inserted within the current year would loaded into the same segments on specific nodes. Thus, a query asking for rows inserted during any specific year would be executed only on those nodes. The rest of the nodes would be idle.