Use COPY...DIRECT to load the database for the first time and load the dimension tables before loading the
fact table. Loads run at 30-50GB/hour and can be done on several nodes at once, raising the load rate to as high as 200GB/hour.
Extracting Data from an Existing Database
If possible, export the data in text form to a local file or attached disk.
Oracle does not provide a tool than can export to text. To export data, run a SELECT query in Oracle's SQL*Plus command line query tool using a specified column delimiter, suppressed headers, etc. Redirect the output to a local file.
Dimension tables generally fit into a single load file. Split the fact table data into 250-500GB load files. For example, a 10 TB fact table would require 20-40 load files.
Choose a column-value delimiter character that does not appear in any CHAR(N) or VARCHAR(N) data values. The vertical bar (|) may be a good choice. You can use a query to test for the existence of a certain character in a column. For example:
SELECT COUNT(*) FROM T WHERE X LIKE '%|%'
If only a few rows contain |, you can eliminate them from the load file using a WHERE clause and load them separately using a different delimiter.
Oracle has a REGEX_REPLACE function that can substitute one substring with another but this will slow down the unload operation significantly. A better approach is to use a WHERE clause to avoid problem rows in the main load file, and the negated WHERE clause with REGEX_REPLACE for just the problem rows.
Moving Data from an Existing Database
Consider using:
Deliver chunks of data to the different Vertica nodes, by connecting the transport disk or writing files from network copy.
Loading From a Local Hard Disk
USB 2.0 disks can deliver data at about 30 MB per second, or 108 GB per hour, which is fast enough. USB 2.0 disks are easy to use for transporting data from Linux to Linux. Simply set up a ext3 filesystem on the disk and write large files there. Linux 2.6 has USB plug-and-play support, so a USB 2.0 disk is instantly usable on various Linux systems.
For other variants of UNIX, if there is no common filesystem format available, you can use the disk without a filesystem for a single large file. For example:
$ cp bigfile /dev/sdc1
Even without a filesystem on the disk, plug-and-play support still works on Linux to provide a device node for the disk. To find out the assigned device, plug in the disk and enter:
$ dmesg | tail -40
SATA disks are usually internal, but can be external, or unplugged safely internally.
Loading Over the Network
A 1Gbps (gigabit per second) network can deliver about 50 MB/s, or 180GB/hr. Vertica can load about 200GB in four hours on each of four nodes. Thus a dedicated 1Gbps LAN should be usable. Slower LANs will be proportionally slower, and non-local networks are probably untenable because the delays over distance slow down the TCP protocol to a small fraction of its apparent bandwidth, even without competing traffic.
Loading From Windows
For loading files directly from Windows to Linux, use NTFS. Although Red Hat Linux as originally installed can read Windows FAT32 file systems, this is not recommended.