Glossary of Terms

Please note that some of the HTML cross-reference links in this document jump to the target term rather than displaying it in an expanding block, as is common in the rest of the Vertica documentation. This is due to a limitation in the HTML generator program, which does not handle "circular references" correctly.

Administration Host

The host on which the Vertica

rpm package was manually installed. Always run the Administration Tools on this host if possible.

rpm is a powerful package manager, which can be used to build, install, query, verify, update, and erase individual software packages. A package consists of an archive of files and meta-data used to install and erase the archive files. The meta-data includes helper scripts, file attributes, and descriptive information about the package. Packages come in two varieties: binary packages, used to encapsulate software to be installed, and source packages, containing the source code and recipe necessary to produce binary packages.

Administration Tools

The tools needed for administering a Vertica database are provided in the form of a graphical user interface that allows you to perform various tasks quickly and easily. The tools also provide a convenient way to connect to a database using

vsql. Always run the Administration Tools on the Administration Host if possible.

vsql is the Vertica implementation of psql, a character-based, interactive, front-end that is part of PostgreSQL and used by other database management systems. It allows you to type in SQL statements and see the results. It also provides a number of meta-commands and various shell-like features to facilitate writing scripts and automating a wide variety of tasks.

AHM

An abbreviation that provides an alternative name for the Ancient History Mark. The Ancient History Mark (AHM), is the

epoch prior to which historical data can be purged from physical storage.

An epoch represents committed changes to the data stored in a database between two specific points in time. In other words, an epoch contains all COPY, INSERT, UPDATE, and DELETE operations that have been executed and committed since the end of the previous epoch.

Anchor Table

Ancient History Mark

The Ancient History Mark (AHM), is the

epoch prior to which historical data can be purged from physical storage.

ATM

An abbreviation that provides an alternative name for the (Automatic) Tuple Mover. The tuple mover is the component of Vertica that moves the contents of the

Write Optimized Store (WOS) into the

Read Optimized Store (ROS). This data movement is known as a moveout. Normally, the tuple mover runs automatically in the background at preset intervals and is referred to as the ATM.

The ROS (Read Optimized Store) is a highly optimized, read-oriented, physical storage structure that is organized by projection and that makes heavy use of compression and indexing. You can use the COPY...DIRECT and INSERT (with direct hint) statements to load data directly into the ROS.

The WOS (Write Optimized Store) is a memory-resident data structure into which INSERT, UPDATE, DELETE, and COPY (without DIRECT hint) actions are recorded. Like the ROS, the WOS is arranged by projection but it stores tuples without sorting, compression, or indexing and thus supports very fast load speeds. The WOS organizes data by epoch and holds uncommitted transaction data.

Authentication

Authentication is the process of attempting to verify the identity of a user attempting to connect to a database.

Authorization

Authorization is the process of verifying that a user has permission to perform a certain operation, such as query a specific table.

Blade Server

Blade servers consist of some number of small servers, called blades, in a common chassis that allows blades to share common components such as power supplies, cooling fans, and switching capabilities.

Buddy Projection

Buddy projections are required for K-Safety. Two projections are considered to be buddies if they contain the same columns and have the same range segmentation or hash segmentation using different node ordering. Buddy projections usually have different sort orders for query performance purposes.

Bulk loading

A process of loading large amount of data, such as an initial load of historic data.

C-Store

A research project at MIT, Brandeis, Brown, and UMass (Boston) on which Vertica is based.

Catalog

In Vertica, the catalog is a set of files that contain information (metadata) about the objects in a database (nodes, tables, constraints, projections, etc.) The catalog is replicated on all nodes in a cluster.

Cluster

A cluster generally refers a collection of hosts or a collection of nodes bound to a database. A cluster is not part of a database definition and thus does not have a name.

Compression

LZO for compressible data. Although compression is generally considered to be a form of encoding, the terms have different meanings in Vertica.

LZO is an abbreviation for Lempel-Ziv-Oberhumer. It is a data compression algorithm that is focused on decompression speed. The algorithm is lossless and the reference implementation is thread safe.

Connection

A SQL connection is an occurrence of an interactive user or an application program requesting and being granted access to a database via a network connection. In Vertica, the scope of a connection is the same as that of a session.

Data Warehouse

A data warehouse is a relational database that is designed for query and analysis rather than transaction processing. Data warehouses:

Database

A database is a

cluster of nodes that, when active, can perform distributed data storage and SQL statement execution through administrative, interactive, and programmatic user interfaces.

A cluster generally refers a collection of hosts or a collection of nodes bound to a database. A cluster is not part of a database definition and thus does not have a name.

Database Administrator

The database administrator (DBA) is the Linux user account that owns the database catalog and data storage on disk. The DBA can bypass all database authorization rules. However, the DBA must supply a password to connect to a running database and to use Administration Tools commands that affect a running database. The DBA can drop a stopped database without supplying a password.

Database Designer

The Database Designer is a tool that analyzes a logical schema definition, sample queries, and sample data and generates a set of projections in the form of an SQL script to be executed after you create the tables but before you load any data. The script creates a minimal set of

superprojections to ensure K-Safety, and optionally pre-join projections. In most cases, the projections created by the Database Designer provide excellent query performance within physical constraints. You can, however, write a custom projection script should the Database Designer not meet your needs.

A superprojection is a projection that contains every column of a table in the Logical Schema. A table can have multiple superprojections with different sort orders.

Database Superuser

The database superuser is the automatically-created database user who has the same name as the Linux database administrator account and who can bypass all GRANT/REVOKE authorization. Superuser status cannot be granted to another user. (The concept of a database superuser should not be confused with Linux superuser (root) privilege. In fact, a database superuser cannot have Linux superuser privilege.)

DBA

DDL

SQL Data Description (or Definition) Language generally consists of CREATE, ALTER, and DROP commands, which operate on the Logical Schema metadata (tables, users, etc.)

DELTAVAL (Delta Encoding)

DELTAVAL (Delta Encoding) stores only the differences between sequential data values rather than the values themselves.

Dialog

Dialog is a Linux utility that creates user interfaces to shell scripts or other scripting languages, such as perl. In Vertica V2.1, Dialog is used to implement the Administration Tools, which run in a terminal window.

Dimension Table

A dimension table (sometimes called a lookup or reference table) is one of a set of companion tables to a fact table in a star schema. It contains the PRIMARY KEY column corresponding to the join columns in fact tables. For example, a business might use a dimension table to contain item codes and descriptions.

Dimension tables can be connected to other dimension tables to form a hierarchy of dimensions in a snowflake schema.

DML

SQL Data Manipulation Language generally consists of INSERT, UPDATE, and DELETE commands, which modify existing data (single

tuples or sets of tuples) in the Logical Schema.

A tuple is a collection of data values corresponding conceptually to a row of a table, projection, or result set. A tuple in Vertica does not have the same physical representation as in a traditional RDBMS; it is a virtual entity that may have entirely different storage representations.

Dynamic SQL

Dynamic SQL is a programmatic interface to a database management system that allows SQL statements to be defined and executed at run time, usually based on user input.

Encoding

Encoding is the process of transforming data from one format into another. In Vertica, encoded data can be processed directly, which distinguishes it from compression. Vertica uses a number of different encoding strategies, depending on column data type, table cardinality, and sort order.

Epoch

ETL

ETL (Extract, Transform, Load) is a process in data warehousing that involves extracting data from outside sources, transforming it to fit a specific schema, and ultimately loading it into the database.

Executor Node

An executor node is any

node that participates in executing a specific SQL statement. The initiator node can, and usually does, also function as an executor node.

A node is a host configured to run an instance of Vertica. It is a member of a database cluster (see Node Definition). For a database to have the ability to recover from the failure of a node requires at least three nodes. Vertica Systems, Inc. recommends that you use a minimum of four nodes.

Fact Table

A fact table is a table that represents quantitative or factual data. For example, in a business, a fact table might represent orders.

A fact table is often located at the center of a star schema or snowflake schema. It typically has a large number of tuples and is surrounded by a collection of dimension tables, each with a lesser number of tuples. The fact table participates in a join with every dimension table. In other words, a fact table can contain data but generally contains many join columns (with optional FOREIGN KEY constraints), each of which corresponds to the primary key column of a dimension table.

Functional Dependency

A functional dependency is a relationship between two sets of column values in a table where one column value can determine the other. It represents knowledge of the data that cannot otherwise be expressed in the logical schema. In Vertica functional dependencies are expressed as CORRELATION constraints and are used by the Database Designer to produce more efficient physical schema designs.

Grid Computing

Grid computing is a software environment based on open standards and protocols that make it possible to share disparate, loosely coupled resources across organizations and geographies. Resources can potentially include almost any component: computer cycles, storage spaces, databases, applications, files, sensors or scientific instruments.

Hash Function

A hash function is a reproducible method of converting data into a number that can serve as a digital "fingerprint" of the data. In Vertica, the built-in HASH function is used in segmentation to evenly distribute data over a cluster of nodes.

Hash Segmentation

Hash segmentation allows you to segment a projection based on a built-in hash function that provides even distribution of data across some or all of the nodes in a cluster, resulting in optimal query execution.

Historical Query

Vertica can execute a query from a snapshot of the database taken at a specific date and time. The syntax is:

The command queries all data in the database up to and including the

epoch representing the specified date and time without holding a lock or blocking write operations.

Host

A host is a computer system with a 32-bit or 64-bit Intel or AMD processor, RAM, hard disk, and TCP/IP network interface (IP address and hostname). Hosts share neither disk space nor main memory with each other.

Initiator Node

In the context of a client

connection, the initiator node is the

node associated with the specific

host to which the connection was made. The initiator node can, and usually does, also function as an

executor node.

An executor node is any node that participates in executing a specific SQL statement. The initiator node can, and usually does, also function as an executor node.

Instance

An instance of Vertica consists of the running Vertica process and disk storage (catalog and data) on a host. There can be only one instance of Vertica running on a host at any time.

JDBC

JDBC (Java Database Connectivity) is a call-level API that provides connectivity between Java programs and data sources (SQL databases and other non-relational data sources, such as spreadsheets or flat files). JDBC is included in the Java 2 standard and enterprise editions.

K-Safety

K represents the maximum number of nodes in a database that can fail and recover with no loss of data. In Vertica V2.1, the value of K can be zero (0) or one (1). The value of K can be one (1) only when the

Physical Schema design meets certain requirements. The designs generated by the Database Designer are K-Safe.

A physical schema consists of a set of projections used to store data on disk. The projections in the physical schema are based on the objects in the Logical Schema.

Last Good Epoch

The last good

epoch is a term used in manual recovery to refer to the most recent epoch that can be recovered.

LGE

An abbreviation for Last Good Epoch. The last good

epoch is a term used in manual recovery to refer to the most recent epoch that can be recovered.

Logical Schema

A logical schema consists of a set of tables and referential integrity constraints in a Vertica database. The objects in the logical schema are visible to SQL users. The logical schema does not include projections, which make up the physical schema.

LZO

LZO is an abbreviation for Lempel-Ziv-Oberhumer. It is a data compression algorithm that is focused on decompression speed. The algorithm is lossless and the reference implementation is thread safe.

Materialized View

A materialized view is similar to a standard SQL view with one major exception: the data is actually stored on disk rather than computed each time the view is used in a query. A materialized view, then, must be refreshed whenever the data in the underlying tables is changed. A projection is a special case of a materialized view.

Mergeout

ROS containers are subsets of the Read Optimized Store (ROS) that are created as the result of changes to the data stored within a projection as a result of bulk loads and DML. The Tuple Mover periodically merges ROS containers in order to maximize performance. A segmented projection can be temporarily stored within several ROS containers on any node at any moment but never fewer than one.

The tuple mover is the component of Vertica that moves the contents of the Write Optimized Store (WOS) into the Read Optimized Store (ROS). This data movement is known as a moveout. Normally, the tuple mover runs automatically in the background at preset intervals and is referred to as the ATM.

Moveout

Node

A node is a host configured to run an

instance of Vertica. It is a member of a database

cluster (see Node Definition). For a database to have the ability to recover from the failure of a node requires at least three nodes. Vertica Systems, Inc. recommends that you use a minimum of four nodes.

A cluster generally refers a collection of hosts or a collection of nodes bound to a database. A cluster is not part of a database definition and thus does not have a name.

An instance of Vertica consists of the running Vertica process and disk storage (catalog and data) on a host. There can be only one instance of Vertica running on a host at any time.

Node Definition

A node definition is a metadata object that binds a host to a database. A node definition contains a symbolic node name that is used to specify

segmentation in projections.

Segmentation is the horizontal partitioning of a projection so that it can be stored on multiple nodes. The goal is to distribute physical data storage evenly across a database so that all nodes can participate in query execution. See also:

ODBC

Open DataBase Connectivity is a standard application programming interface (API) for access to database management systems.

Out-of-Date

A projection is out-of-date if it requires a refresh in order to participate in query execution.

Physical Schema

A physical schema consists of a set of projections used to store data on disk. The projections in the physical schema are based on the objects in the Logical Schema.

PostgreSQL

PostgreSQL is an open source database manager whose front-end components are used in Vertica. For more information, see the PostgreSQL documentation.

Pre-Join Projection

Projection

A projection is a special case of a

materialized view that provides physical storage for data. A projection can contain some or all of the columns of one or more tables. A projection that contains all of the columns of a table is called a

superprojection. A projection that joins one or more tables is called a pre-join projection. Most projections are used for ad-hoc query processing and K-safety but it is possible to have query-specific projections.

A superprojection is a projection that contains every column of a table in the Logical Schema. A table can have multiple superprojections with different sort orders.

Projection Set

A projection set is a group of buddy projections that are safe for a given level of K-safety. (When K=1 there are two buddies in a set; when K=2 there are three buddies.) A projection must be part of a projection set before it is refreshed. Once a projection set is created (by creating buddies) the set is refreshed in a single transaction.

psql

psql is a character-based, interactive, front-end that is part of PostgreSQL and used by other database management systems. It allows you to type in SQL statements and see the results. It also provides a number of meta-commands and various shell-like features to facilitate writing scripts and automating a wide variety of tasks. On a Vertica host, psql is actually a symbolic link to vsql.

Query Optimizer

The query optimizer is the component that evaluates different strategies for executing a query and picks the best one.

Query-Specific Projection

Range Segmentation

Range segmentation allows you to segment a projection based on a known range of values stored in a specific column chosen to provide even distribution of data across a set of nodes, resulting in optimal query execution.

Recovery

Recovery is the process of restoring the database to a fully-functional state after one or more nodes in the system has experienced a software or hardware related failure. Vertica has a unique approach to recovering a node that is based on querying replicas of the data stored on other nodes. For example, a hardware failure may cause a node to lose database objects or to miss changes made to the database (INSERTs, UPDATEs, etc.) while offline. When the node comes back on line, it recovers lost objects and catches up with changes by querying the other nodes.

Referential Integrity

Referential integrity in Vertica consists of a set of constraints (logical schema objects) that define primary key and foreign key columns. In a star schema or snowflake schema:

Refresh

A refresh operation ensures that all projections on a node are up-to-date (can participate in query execution). This process may take a long time, depending on how much data is in the table(s).

RLE (Run Length Encoding)

RLE (Run Length Encoding) replaces sequences of the same data values within a column by a single value and a count number.

ROS (Read Optimized Store)

The ROS (Read Optimized Store) is a highly optimized, read-oriented, physical storage structure that is organized by projection and that makes heavy use of

compression and indexing. You can use the COPY...DIRECT and INSERT (with direct hint) statements to load data directly into the ROS.

Compression is the process of transforming data into a more compact format. Compressed data cannot be directly processed; it must first be decompressed. Vertica uses integer packing for unencoded integers and LZO for compressible data. Although compression is generally considered to be a form of encoding, the terms have different meanings in Vertica.

ROS Container

ROS containers are subsets of the

Read Optimized Store (ROS) that are created as the result of changes to the data stored within a projection as a result of bulk loads and DML. The

Tuple Mover periodically merges ROS containers in order to maximize performance. A

segmented projection can be temporarily stored within several ROS containers on any node at any moment but never fewer than one.

rpm

Safe

SAN

A SAN (Storage Area Network) is a dedicated hardware/software platform consisting of networked servers that provide access to storage-related resources such as such as disk array controllers. It provides high data transfer speeds similar to those used for internal disk drives (ATA, SCSI, etc.) and is highly scalable.

Schema

Secure Shell (SSH)

Secure Shell or SSH is a set of standards and an associated network protocol that establishes a secure TCP/IP data transmission channel between a local and a remote computer. It utilizes strong encryption and authentication to ensure confidentiality, integrity, and authenticity of the transferred data.

SSH is typically used to login to a remote machine and execute commands. Use SSH only between two devices that are both under your own administration, when both devices are trustworthy.

Segmentation

Session

A SQL session is an occurrence of a user interacting with a database through the use of SQL statements. A session can be invoked using

vsql or a

JDBC application. In Vertica, the scope of a session is the same as that of a connection.

Site

Site is a deprecated term that is used to mean node in some contexts, such as the CREATE PROJECTION statement.

Snapshot Isolation

Vertica can run any SQL query in snapshot isolation mode in order to obtain the fastest possible execution. To be precise, snapshot isolation mode is actually a form of

historical query. The syntax is:

Vertica can execute a query from a snapshot of the database taken at a specific date and time. The syntax is:

AT TIME 'timestamp' SELECT...

The command queries all data in the database up to and including the epoch representing the specified date and time without holding a lock or blocking write operations.

The command queries all data in the database up to but not including the current epoch without holding a lock or blocking write operations. This may cause the query to miss tuples loaded by other users up to (but no more than) a specific number of minutes before execution.

Snowflake Schema

A snowflake schema is the same as a star schema except that a dimension table can be normalized (hierarchically decomposed) into additional dimension tables. Every dimension table participates in a 1::n join with the fact table or another dimension table.

Spread

Spread is an open source toolkit used in Vertica to provide a high performance messaging service that is resilient to network faults. Each running node in a database includes a Spread daemon in addition to a Vertica process. Spread daemons start automatically when a database starts up for the first time.

SQL

SQL (Structured Query Language) is a widely-used, industry standard data definition and data manipulation language for relational databases.

SSH

SSH is typically used to login to a remote machine and execute commands. Use SSH only between two devices that are both under your own administration, when both devices are trustworthy.

Star Schema

The star schema (sometimes called a star join schema) is the simplest

data warehouse schema. In a star schema design there is a central fact table with a large number of tuples, optionally surrounded by a collection of dimension tables, each with a lesser number of tuples. Every dimension table participates in a 1::n join with the fact table.

A data warehouse is a relational database that is designed for query and analysis rather than transaction processing. Data warehouses:

are often subjected to a heavy load of periodic and ad-hoc queries.
contain historical information that enables analysis of correlations and trends over long periods of time.
integrate data from various production (transactional) databases. Extraction, transformation, and loading (ETL) software converts the data to a common format and copies it into a data warehouse at regular intervals.
ETL (Extract, Transform, Load) is a process in data warehousing that involves extracting data from outside sources, transforming it to fit a specific schema, and ultimately loading it into the database.
typically consist of one or more star or snowflake schemas.

Superprojection

A superprojection is a projection that contains every column of a table in the Logical Schema. A table can have multiple superprojections with different sort orders.

Superuser

Table

In the relational model, a table (relation) is a set of data elements (values) organized into horizontal rows (tuples) and vertical columns. A table has a specified number of columns but can have any number of rows.

In Vertica, a table is a metadata-only entity referred to in queries. The physical representation of data is a projection.

Tuple

Tuple Mover

The tuple mover is the component of Vertica that moves the contents of the

Write Optimized Store (WOS) into the

Read Optimized Store (ROS). This data movement is known as a moveout. Normally, the tuple mover runs automatically in the background at preset intervals and is referred to as the ATM.

Up-To-Date

A projection is up-to-date if it does not require a refresh in order to participate in query execution.

User Agent

A user agent is a client application program used to access resources on networks such as the World Wide Web. User agents include web browsers, search engine crawlers, PDAs, cell phones, and so forth.

UTC

UTC is an abbreviation of Coordinated Universal Time (in English), the high-precision atomic time standard that replaced Greenwich Mean Time on 1 January 1972 as the basis for the main reference time scale or civil time in various regions. UTC is also referred to by the military and civil aviation as Zulu time (Z).

UTF-8

UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. It is able to represent any universal character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is coincident with ASCII (requiring little or no change for software that handles ASCII but preserves other values).

vsql

WOS (Write Optimized Store)

The WOS (Write Optimized Store) is a memory-resident data structure into which INSERT, UPDATE, DELETE, and COPY (without DIRECT hint) actions are recorded. Like the

ROS, the WOS is arranged by projection but it stores tuples without sorting,

compression, or indexing and thus supports very fast load speeds. The WOS organizes data by epoch and holds uncommitted transaction data.