COMMAND NAME: gpfdist

Serves data files to or writes data files out from HAWQ segments.

*****************************************************
SYNOPSIS
*****************************************************


gpfdist [-d <directory>] [-p <http_port>] [-l <log_file>] [-t <timeout>]  [-c <config_file>]
[-S] [-v | -V] [-m <maxlen>] [--ssl certificate_path]

gpfdist [-? | --help] | --version

*****************************************************
DESCRIPTION
*****************************************************


gpfdist is HAWQs parallel file distribution program. 
It is used by readable external tables and gpload to serve 
external table files to all HAWQ segments in parallel. 
It is used by writable external tables to accept output 
streams from HAWQ segments in parallel and write them out to a file.

In order for gpfdist to be used by an external table, the 
LOCATION clause of the external table definition must specify 
the correct file location using the gpfdist:// protocol 
(see CREATE EXTERNAL TABLE). 

NOTE: If the --ssl option is specified to enable SSL security, 
create the external table with the gpfdists:// protocol.

The benefit of using gpfdist is that you are guaranteed maximum 
parallelism while reading from or writing to external tables, 
thereby offering the best performance as well as easier 
administration of external tables.

For readable external tables, gpfdist parses and serves data 
files evenly to all the segment instances in the HAWQ
system when users SELECT from the external table. For writable 
external tables, gpfdist accepts parallel output streams from 
the segments when users INSERT into the external table, and 
writes to an output file.

For readable external tables, if load files are compressed using 
gzip or bzip2 (have a .gz or .bz2 file extension), gpfdist 
uncompresses the files automatically before loading provided 
that gunzip or bunzip2 is in your path. 

NOTE: Currently, readable external tables do not support 
compression on Windows platforms, and writable external 
tables do not support compression on any platforms.

Most likely, you will want to run gpfdist on your ETL machines 
rather than the hosts where HAWQ is installed. 
To install gpfdist on another host, simply copy the utility 
over to that host and add gpfdist to your $PATH.

NOTE: When using IPv6, always enclose the numeric IP address 
in brackets.

You can also run gpfdist as a Windows Service. See below for
details.

*****************************************************
OPTIONS
*****************************************************


-d <directory>

The directory from which gpfdist will serve files for 
readable external tables or create output files for writable
external tables. If not specified, defaults to the current directory.


-l <log_file>

The fully qualified path and log file name where standard output 
messages are to be logged.


-p <http_port>

The HTTP port on which gpfdist will serve files. Defaults to 8080.


-t <timeout>

Sets the time (in seconds) allowed for HAWQ to 
establish a connection to a gpfdist process. Default is 5 seconds.
Valid values are 2 to 30 seconds.  May need to be increased on 
systems with a lot of network traffic.

-m <max_length>

Sets the maximum allowed data row length in bytes. Default is 32768.
Should be used when user data includes very wide rows, i.e when
"line too long" error message is receieved. Should not be used otherwise
as it increases resource allocation. 
Valid range is 32K to 256MB. (The upper limit is 1MB on Windows systems.)


-S (use O_SYNC)

Opens the file for synchronous I/O with the O_SYNC flag. Any writes to 
the resulting file descriptor block gpfdist until the data is 
physically written to the underlying hardware.

--ssl certificate_path

Adds SSL encryption to data transferred with gpfdist. After executing 
gpfdist with the --ssl certificate_path option, the only way 
to load data from this file server is with the gpfdists protocol. 
The location specified in certificate_path must 
contain the following files:

- The server certificate file, server.crt
- The server private key file, server.key
- The trusted certificate authorities, root.crt

The root directory (/) cannot be specified as certificate_path.

-c <config_file>

Configuration file for transformations.The option config_file specifies
the location of the transformation configuration file, passed to gpload via -c.
The gpfdist configuration is expected to be a YAML file with the following format:
--- 
VERSION: 1.0.0.1 
TRANSFORMATIONS: 
  transformname1:
      TYPE:    input | output
      COMMAND: command1
      CONTENT: data | paths
      SAFE:    posix-regex

  transformname2: 
      TYPE:    input | output
      COMMAND: command2
  ...

-v (verbose)

Verbose mode shows progress and status messages.


-V (very verbose)

Verbose mode shows all output messages generated by this utility.


--version 

Prints out the version of this utility.


-?
--help 

Displays online help.

*****************************************************
RUNNING GPFDIST AS A WINDOWS SERVICE
*****************************************************

HAWQ Loaders allow gpfdist to run as a Windows Service.

Follow the instructions below to download, register and
activate gpfdist as a service: 

1. Update your HAWQ Loader package to the latest 
   version. This package is available from the 
   EMC Download Center (https://emc.subscribenet.com) 

2. Register gpfdist as a Windows service:
   * Open a Windows command window
   * Run the following command:
      sc create gpfdist binpath= "path_to_gpfdist.exe -p 8081
      -d External\load\files\path -l Log\file\path"
      
     You can create multiple instances of gpfdist by 
     running the same command again, with a unique 
     name and port number for each instance, for example:
       sc create gpfdistN binpath= "path_to_gpfdist.exe
       -p 8082 -d External\load\files\path -l Log\file\path"

3. Activate the gpfdist service:
   * Open the Windows Control Panel and select 
     Administrative Tools>Services.
   * Highlight then right-click on the gpfdist 
     service in the list of services.
   * Select Properties from the right-click menu, 
     the Service Properties window opens.
     Note that you can also stop this service 
     from the Service Properties window.
   * Optional: Change the Startup Type to 
     Automatic (after a system restart, this 
     service will be running), then under Service 
     status, click Start.
   * Click OK.
Repeat the above steps for each instance of 
gpfdist that you created. 


*****************************************************
EXAMPLES
*****************************************************

Serve files from a specified directory using port 8081 
(and start gpfdist in the background):

gpfdist -d /var/load_files -p 8081 &


Start gpfdist in the background and redirect output and 
errors to a log file:

gpfdist -d /var/load_files -p 8081 -l /home/gpadmin/log &


To stop gpfdist when it is running in the background:

--First find its process id:

ps ax | grep gpfdist 

  OR on Solaris

ps -ef | grep gpfdist

--Then kill the process, for example:

kill 3456


*****************************************************
SEE ALSO
*****************************************************

CREATE EXTERNAL TABLE
gpload
