COMMAND NAME: hawq load

Runs a load job as defined in a YAML formatted control file.


*****************************************************
SYNOPSIS
*****************************************************

hawq load -f <control_file> [-l <log_file>] 
[-h <hostname>] [-p <port>] [-U <username>] [-d <database>] [-W] 
[--gpfdist_timeout <seconds>] [[-v | -V] | [-q]] [-D]

hawq load -? 

hawq load --version


*****************************************************
PREREQUISITES
*****************************************************

The client machine where hawq load is executed must have the following:

* Python 2.6.2 or later, pygresql (the Python interface to PostgreSQL),
  and pyyaml. Note that Python and the required Python libraries are 
  included with the HAWQ server installation, so if you have HAWQ 
  installed on the machine where hawq load is running, you do not need 
  a separate Python installation.
  Note: HAWQ Loaders for Windows supports only Python 2.5 
  (available from www.python.org).

* The gpfdist parallel file distribution program installed and 
  in your $PATH. This program is located in $GPHOME/bin of your 
  HAWQ server installation.

* Network access to and from all hosts in your HAWQ array 
  (master and segments).

* Network access to and from the hosts where the data to be loaded
  resides (ETL servers).

*****************************************************
DESCRIPTION
*****************************************************

hawq load is a data loading utility that acts as an interface
to HAWQs external table parallel loading feature. Using a load
specification defined in a YAML formatted control file, hawq load 
executes a load by invoking the HAWQ parallel file server (gpfdist), 
creating an external table definition based on the source data 
defined, and executing an INSERT, UPDATE or MERGE operation to 
load the source data into the target table in the database.

*****************************************************
OPTIONS
*****************************************************

-f control_file

 Required. A YAML file that contains the load specification details.
 See Control File Format below.

--gpfdist_timeout seconds
  
  Sets the timeout for the gpfdist parallel file distribution program
  to send a response. Enter a value from 0 to 30 seconds (entering 0 to disables timeouts). Note that you might need to increase this value when operating on high-traffic networks.
 
-l log_file
 
  Specifies where to write the log file. Defaults to 
  ~/hawqAdminLogs/hawq_load_YYYYMMDD. 

-v (verbose mode) 

  Show verbose output of the load steps as they are executed.

-V (very verbose mode)

  Shows very verbose output.

-q (no screen output)
 
  Run in quiet mode. Command output is not displayed on the screen, 
  but is still written to the log file.

-D (debug mode)
 
  Check for error conditions, but do not execute the load. 

-? (show help)
  Show help, then exit.

--version
 
  Show the version of this utility, then exit.


*********************
CONNECTION OPTIONS
*********************

-d database

  The database to load into. If not specified, reads from the 
  load control file, the environment variable $PGDATABASE or 
  defaults to the current system user name. 

-h hostname
  
  Specifies the host name of the machine on which the HAWQ master
  database server is running. If not specified, reads from the load 
  control file, the environment variable $PGHOST or defaults to localhost.

-p port
  
  Specifies the TCP port on which the HAWQ master database server 
  is listening for connections. If not specified, reads from the load 
  control file, the environment variable $PGPORT or defaults to 5432.

-U username

  The database role name to connect as. If not specified, reads
  from the load control file, the environment variable $PGUSER or 
  defaults to the current system user name.

-W (force password prompt)

  Force a password prompt. If not specified, reads the password 
  from the environment variable $PGPASSWORD or from a password file  
  specified by $PGPASSFILE or in ~/.pgpass. If these are not set, then 
  hawq load will prompt for a password even if -W is not supplied.

*****************************************************
CONTROL FILE FORMAT
*****************************************************

The hawq load control file uses the YAML 1.1 document format and 
then implements its own schema for defining the various steps 
of a HAWQ load operation. The control file must be a valid YAML document.
The hawq load program processes the control file document in order 
and uses indentation (spaces) to determine the document hierarchy 
and the relationships of the sections to one another. The use of white
space is significant. White space should not be used simply for formatting
purposes, and tabs should not be used at all.

The basic structure of a load control file is:
---

VERSION: 1.0.0.1
DATABASE: <db_name>
USER: <db_username>
HOST: <master_hostname>
PORT: <master_port>
GPLOAD:
   INPUT:
    - SOURCE:
         LOCAL_HOSTNAME:
           - <hostname_or_ip>
         PORT: <http_port>
       | PORT_RANGE: [<start_port_range>, <end_port_range>]
         FILE: 
           - </path/to/input_file>
         SSL: true | false
         CERTIFICATES_PATH: /path/to/certificates
    - COLUMNS:
           - <field_name>: <data_type>
    - TRANSFORM: 'transformation'
    - TRANSFORM_CONFIG: 'configuration_file_path' | 'OFF'
    - MAX_LINE_LENGTH: integer
    - FORMAT: text | csv
    - DELIMITER: '<delimiter_character>'
    - ESCAPE: '<escape_character>' | 'OFF'
    - NULL_AS: '<null_string>'
    - FORCE_NOT_NULL: true | false
    - QUOTE: '<csv_quote_character>'
    - HEADER: true | false
    - ENCODING: <database_encoding>
    - ERROR_LIMIT: <integer>
    - ERROR_TABLE: <schema>.<table_name>
   OUTPUT:
    - TABLE: <schema>.<table_name>
    - MODE: insert | update | merge
    - MATCH_COLUMNS:
           - <target_column_name>
    - UPDATE_COLUMNS:
           - <target_column_name>
    - UPDATE_CONDITION: '<boolean_condition>'
    - MAPPING:
             <target_column_name>: <source_column_name> | '<expression>'
   PRELOAD:
    - TRUNCATE: true | false
    - REUSE_TABLES: true | false
   SQL:
    - BEFORE: "<sql_command>"
    - AFTER: "<sql_command>"


*****************************************************
CONTROL FILE SCHEMA ELEMENT DESCRIPTIONS
*****************************************************

VERSION - Optional. The version of the hawq load control file schema. 
          The current version is 1.0.0.1. 

DATABASE - Optional. Specifies which database in HAWQ to connect to. 
           If not specified, defaults to $PGDATABASE if set or the current 
           system user name. You can also specify the database on the command
           line using the -d option. 

USER - Optional. Specifies which database role to use to connect. 
       If not specified, defaults to the current user or $PGUSER 
       if set. You can also specify the database role on the command 
       line using the -U option. If the user running hawq load is not a 
       HAWQ superuser, then the server configuration parameter 
       'gp_external_grant_privileges' must be set to on in order 
       for the load to be processed.

HOST - Optional. Specifies HAWQ master host name. If not 
       specified, defaults to localhost or $PGHOST if set. You can 
       also specify the master host name on the command line using 
       the -h option.

PORT - Optional. Specifies HAWQ master port. If not specified, 
       defaults to 5432 or $PGPORT if set. You can also specify the 
       master port on the command line using the -p option.

GPLOAD - Required. Begins the load specification section. A GPLOAD 
         specification must have an INPUT and an OUTPUT section 
         defined.

INPUT - Required. Defines the location and the format of the input 
        data to be loaded. hawq load will start one or more instances of 
        the gpfdist file distribution program on the current host and 
        create the required external table definition(s) in HAWQ 
        that point to the source data. Note that the host 
        from which you run hawq load must be accessible over the network 
        by all HAWQ hosts (master and segments).

SOURCE - Required. The SOURCE block of an INPUT specification defines 
         the location of a source file. An INPUT section can have more 
         than one SOURCE block defined. Each SOURCE block defined 
         corresponds to one instance of the gpfdist file distribution 
         program that will be started on the local machine. Each 
         SOURCE block defined must have a FILE specification.

LOCAL_HOSTNAME - Optional. Specifies the host name or IP address of 
                 the local machine on which hawq load is running. If this 
                 machine is configured with multiple network interface 
                 cards(s), you can specify the host name or IP of each 
                 individual NIC to allow network traffic to use all 
                 NICs simultaneously. The default is to use the local 
                 machine's primary host name or IP only. 

PORT - Optional. Specifies the specific port number that the gpfdist 
       file distribution program should use. You can also supply a 
       PORT_RANGE to select an available port from the specified range. 
       If both PORT and PORT_RANGE are defined, then PORT takes 
       precedence. If neither PORT or PORT_RANGE are defined, the 
       default is to select an available port between 8000 and 9000. 
       If multiple host names are declared in LOCAL_HOSTNAME, this port 
       number is used for all hosts. This configuration is desired if 
       you want to use all NICs to load the same file or set of files 
       in a given directory location.

PORT_RANGE - Optional. Can be used instead of PORT to supply a range 
             of port numbers from which hawq load can choose an available 
             port for this instance of the gpfdist file distribution 
             program.

FILE - Required. Specifies the location of a file, named pipe, or 
       directory location on the local file system that contains data 
       to be loaded. You can declare more than one file so long as the 
       data is of the same format in all files specified. If the files 
       are compressed using gzip or bzip2 (have a .gz or .bz2 file 
       extension), the files will be uncompressed automatically 
       (provided that gunzip or bunzip2 is in your path). When 
       specifying which source files to load, you can use the wildcard 
       character (*) or other C-style pattern matching to denote 
       multiple files. The files specified are assumed to be relative 
       to the current directory from which hawq load is executed (or you 
       can declare an absolute path).

SSL - Optional. Specifies usage of SSL encryption.

CERTIFICATES_PATH - Required when SSL is true; cannot be specified when 
                    SSL is false or unspecified. The location specified 
                    in CERTIFICATES_PATH must contain the following 
                    files:

                    - The server certificate file, server.crt
                    - The server private key file, server.key
                    - The trusted certificate authorities, root.crt

                    The root directory (/) cannot be specified 
                    as CERTIFICATES_PATH.

COLUMNS - Optional. Specifies the schema of the source data file(s) 
          in the format of <field_name>: <data_type>. The DELIMITER 
          character in the source file is what separates two data value 
          fields (columns). A row is determined by a line feed character 
          (0x0a). If the input COLUMNS are not specified, then the 
          schema of the output TABLE is implied, meaning that the source 
          data must have the same column order, number of columns, and 
          data format as the target table. The default source-to-target 
          mapping is based on a match of column names as defined in this 
          section and the column names in the target TABLE. This default 
          mapping can be overridden using the MAPPING section.

TRANSFORM - Optional. Specifies the name of the input XML transformation 
            passed to hawq load.

TRANSFORM_CONFIG - Optional. Specifies the location of the XML transformation
                   configuration file that is specified in the TRANSFORM 
                   parameter, above.

MAX_LINE_LENGTH - Optional. An integer that specifies the maximum length
                  of a line in the XML transformation data passed to hawq load.

FORMAT - Optional. Specifies the format of the source data file(s); 
         either plain text (TEXT) or comma separated values (CSV) format. 
         Defaults to TEXT if not specified.

DELIMITER - Optional. Specifies a single ASCII character that separates
            columns within each row (line) of data. The default is a tab 
            character in TEXT mode, a comma in CSV mode. You can also 
            specify a non-printable ASCII character via an escape sequence\
            using the decimal representation of the ASCII character. 
            For example, \014 represents the shift out character..

ESCAPE - Optional. Specifies the single character that is used for C 
         escape sequences (such as \n,\t,\100, and so on) and for 
         escaping data characters that might otherwise be taken as row
         or column delimiters. Make sure to choose an escape character 
         that is not used anywhere in your actual column data. The 
         default escape character is a \ (backslash) for text-formatted 
         files are a "(double quote for csv-formatted files), however it 
         is possible to use another character to represent an escape. It 
         is also possible to disable escaping by specifying the value 
         'OFF' as the escape value. This is very useful for data such as 
         text-formatted web log data that has many embedded backslashes 
         that are not intended to be escapes.

NULL_AS - Optional. Specifies the string that represents a null value. 
          The default is \N (backslash-N) in TEXT mode, and an empty value 
          with no quotations in CSV mode. You might prefer an empty string 
          even in TEXT mode for cases where you do not want to distinguish 
          nulls from empty strings. Any source data item that matches this 
          string will be considered a null value.

FORCE_NOT_NULL - Optional. In CSV mode, processes each specified column as 
                 though it were quoted and hence not a NULL value. For the 
                 default null string in CSV mode (nothing between two 
                 delimiters), this causes missing values to be evaluated 
                 as zero-length strings.

QUOTE - Required when FORMAT is CSV. Specifies the quotation character 
        for CSV mode. The default is double-quote (").

HEADER - Optional. Specifies that the first line in the data file(s) is 
         a header row (contains the names of the columns) and should not 
         be included as data to be loaded. If using multiple data source 
         files, all files must have a header row. The default is to assume 
         that the input files do not have a header row.

ENCODING - Optional. Character set encoding of the source data. 
           Specify a string constant (such as 'SQL_ASCII'), an 
           integer encoding number, or 'DEFAULT' to use the default 
           client encoding. 

ERROR_LIMIT - Optional. Enables single row error isolation mode for 
              this load operation. When enabled, input rows that have 
              format errors will be discarded provided that the error 
              limit count is not reached on any HAWQ segment instance 
              during input processing. If the error limit is not reached, 
              all good rows will be loaded and any error rows will either 
              be discarded or logged to the table specified in ERROR_TABLE. 
              The default is to abort the load operation on the first error 
              encountered. Note that single row error isolation only 
              applies to data rows with format errors; for example, extra 
              or missing attributes, attributes of a wrong data type, or 
              invalid client encoding sequences. Constraint errors, such 
              as primary key violations, will still cause the load 
              operation to abort if encountered.

ERROR_TABLE - Optional when ERROR_LIMIT is declared. Specifies an error 
              table where rows with formatting errors will be logged when 
              running in single row error isolation mode. You can then 
              examine this error table to see error rows that were not 
              loaded (if any). If the error_table specified already exists, 
              it will be used. If it does not exist, it will be 
              automatically generated.

OUTPUT - Required. Defines the target table and final data column values 
         that are to be loaded into the database.

TABLE - Required. The name of the target table to load into.

MODE - Optional. Defaults to INSERT if not specified. There are three 
       available load modes:

       * INSERT - Loads data into the target table using the following 
            method: INSERT INTO <target_table> SELECT * FROM <input_data>;

       * UPDATE - Updates the UPDATE_COLUMNS of the target table where 
           the rows have MATCH_COLUMNS attribute values equal to those 
           of the input data, and the optional UPDATE_CONDITION is true.

       * MERGE - Inserts new rows and updates the UPDATE_COLUMNS of 
           existing rows where MATCH_COLUMNS attribute values are equal
           to those of the input data, and the optional UPDATE_CONDITION
           is true. New rows are identified when the MATCH_COLUMNS value 
           in the source data does not have a corresponding value in the
           existing data of the target table. In those cases, the entire 
           row from the source file is inserted, not only the MATCH and 
           UPDATE columns. If there are multiple new MATCH_COLUMNS values 
           that are the same, only one new row for that value will be 
           inserted (use UPDATE_CONDITION to filter out the rows you want 
           to discard). 

MATCH_COLUMNS - Required if MODE is UPDATE or MERGE. Specifies the 
                column(s) to use as the join condition for the update.
                The attribute value in the specified target column(s)
                must be equal to that of the corresponding source data
                column(s) in order for the row to be updated in the
                target table.

UPDATE_COLUMNS - Required if MODE is UPDATE or MERGE. Specifies 
                 the column(s) to update for the rows that meet the 
                 MATCH_COLUMNS criteria and the optional UPDATE_CONDITION.

UPDATE_CONDITION - Optional. Specifies a Boolean condition (similar to 
                   what you would declare in a WHERE clause) that must 
                   be met in order for a row in the target table to be 
                   updated (or inserted in the case of a MERGE).

MAPPING - Optional. If a mapping is specified, it overrides the default 
          source-to-target column mapping. The default source-to-target 
          mapping is based on a match of column names as defined in the 
          source COLUMNS section and the column names of the target 
          TABLE. 
          A mapping is specified as either:
            target_column_name: source_column_name
           or
            target_column_name: 'expression'
          Where expression is any expression that you would specify in 
          the SELECT list of a query, such as a constant value, a column 
          reference, an operator invocation, a function call, and so on.

PRELOAD - Optional. Specifies operations to run prior to the load operation.
          Right now the only preload operation is TRUNCATE.

TRUNCATE - Optional. If set to true, hawq load will remove all rows in the 
           target table prior to loading it.

REUSE_TABLES - Optional. If set to true, hawq load will not drop the external 
               table objects and staging table objects it creates. These 
               objects will be reused for future load operations that use 
               the same load specifications. This improves performance of 
               trickle loads (ongoing small loads to the same target table).

SQL - Optional. Defines SQL commands to run before and/or after the load 
      operation. Commands that contain spaces or special characters must 
      be enclosed in quotes. You can specify multiple BEFORE and/or 
      AFTER commands. List commands in the order of desired execution.

BEFORE - Optional. An SQL command to run before the load operation starts.
         Enclose commands in quotes.

AFTER - Optional. An SQL command to run after the load operation completes.
        Enclose commands in quotes.


*****************************************************
NOTES
*****************************************************

If your database object names were created using a double-quoted 
identifier (delimited identifier), you must specify the delimited 
name within single quotes in the hawq load control file. For example, 
if you create a table as follows:

     CREATE TABLE "MyTable" ("MyColumn" text);

Your YAML-formatted hawq load control file would refer to the above 
table and column names as follows:

        - COLUMNS:
          - '"MyColumn"': text
    
    OUTPUT:
        - TABLE: public.'"MyTable"'

*****************************************************
LOG FILE FORMAT
*****************************************************

Log files output by hawq load have the following format:

<timestamp>|<level>|<message>

Where <timestamp> takes the form: YYYY-MM-DD HH:MM:SS, 
<level> is one of DEBUG, LOG, INFO, ERROR, and <message> is 
a normal text message. Some INFO messages that may be of 
interest in the log files are (where # corresponds to the 
actual number of seconds, units of data, or failed rows):

INFO|running time: #.## seconds
INFO|transferred #.# kB of #.# kB.
INFO|hawq load succeeded
INFO|hawq load succeeded with warnings
INFO|hawq load failed
INFO|1 bad row
INFO|# bad rows

*****************************************************
EXAMPLES
*****************************************************

Run a load job as defined in my_load.yml:

hawq load -f my_load.yml


Example load control file:

---
VERSION: 1.0.0.1
DATABASE: ops
USER: gpadmin
HOST: mdw-1
PORT: 5432
GPLOAD:
   INPUT:
    - SOURCE:
         LOCAL_HOSTNAME:
           - etl1-1
           - etl1-2
           - etl1-3
           - etl1-4
         PORT: 8081
         FILE: 
           - /var/load/data/*
    - COLUMNS:
           - name: text
           - amount: float4
           - category: text
           - desc: text
           - date: date
    - FORMAT: text
    - DELIMITER: '|'
    - ERROR_LIMIT: 25
    - ERROR_TABLE: payables.err_expenses
   OUTPUT:
    - TABLE: payables.expenses
    - MODE: INSERT
   SQL:
    - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
    - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"


*****************************************************
SEE ALSO
*****************************************************

gpfdist, CREATE EXTERNAL TABLE
