Book Contents

Book Index

Next Topic

Home

Using an Artificial Segmentation Column

The Database Designer recognizes a fact table column named specifically ARTIFICIAL_SEGMENTATION_COLUMN and automatically uses that column as the range segmentation column in superprojections and pre-join projections in which that fact table is the anchor table.

The purpose of the artificial segmentation column in Vertica V1.3 was to provide a way to get even data distribution for tables that have no good range segmentation column. Because Vertica V2.1 includes built-in hash segmentation, the use of an artificial segmentation column is supported only for the maintenance of existing databases that use the feature.

In Vertica V2.0 and later, you can use an artificial segmentation column to force the Database Designer to use range segmentation instead of hash segmentation. For example, if your table has an excellent choice of segmentation columns, you can name it ARTIFICIAL_SEGMENTATION_COLUMN in order to to avoid the overhead of using the hash function on that column.

Modifying the Logical Schema

Modifying the logical schema requires nothing more than adding a column to the CREATE TABLE statement that creates the fact table. For example:

CREATE TABLE Retail_Sales_Fact (

Date_Key INTEGER NOT NULL,

Product_Key INTEGER NOT NULL,

Store_Key INTEGER NOT NULL,

Promotion_Key INTEGER NOT NULL,

POS_Transaction_Number INTEGER,

Sales_Quantity INTEGER,

Sales_Dollar_Amount INTEGER,

Cost_Dollar_Amount INTEGER,

Gross_Profit_Dollar_Amount INTEGER,

ARTIFICIAL_SEGMENTATION_COLUMN INTEGER NOT NULL);

Use INTEGER as the data type of the artificial segmentation column (in order to get properly formatted output, should you require it). The column obviously cannot contain nulls. The ordinal position of the column makes no difference.

Preparing the Data

The process of generating segmentation data is quite simple. The goal is to create a wide range of data values with even frequency distribution within statistical requirements, which need not be stringent. You can use any valid value for the INTEGER data type, whose range is -2^63+1 to 2^63-1. Vertica Systems, Inc. recommends a range consisting of at least twenty thousand (20K) unique values.

Do not use a small range of numbers for this purpose. For example, the number of nodes in the cluster may seem like a good choice but it does not allow for future cluster expansion and is not suited to the Database Designer's algorithms.

Two methods are provided below. If you choose a different method, keep in mind that segmentation column values are used only when writing the data to disk and thus need not be reversible.

1. Random Data

This involves using a random number generator to produce values within a specific range. A random number generator is a function that generates an extremely long sequence of numbers that appear random. The sequence is determined by an initial value called a seed. A computer's real time clock is often used as a seed. High-quality random number generators use physical processes such as atmospheric noise as the seed. The syntax may look something like:

data-value = random(ceiling,floor,seed)

where the parameters specify the range of the generated numbers and the seed.

2. Hashed Data

The Hash Segmentation available in Vertica V2.0 and later is similar to this method but does not require any data to be prepared in advance. It also has the advantage of using all column data types equally well.

This involves using a hash function and a modulo function. A hash function is a reproducible method of converting some sort of data into a number that serves as a "fingerprint" of the data. These fingerprints are called hash values. A modulo function causes values to "wrap around" after they reach a certain value (the modulus). The syntax may look something like:

data-value = mod(hash(d),n)

where:

d is an integer that is derived by manipulating the values of one or more columns in each row of data.

n is an integer where n-1 specifies the ceiling (upper limit) of the range of data values. The floor (lower limit) of the range is zero (0).