Java GridFTP client - programmer guide 

terminology


purpose

This document describes the Java package org.globus.ftp developed at DSL of Argonne National Lab. This package is a client side interface to FTP and GridFPT protocols.
Class FTPClient is the main interface for the FTP client side functionality and implements following features:

Class GridFTPClient is the main interface to the GridFTP client side functionality. In addition to the above characteristics of FTPClient, it implements following features:

org.globus.ftp package overview

layered structure

Package ftp provides a low level interface to FTP and GridFTP. Instead of hiding the protocol features, the package tends to expose them to certain extent so that the user can achieve everything he could otherwise do working with the raw protocol. However certain low level concepts are hidden and user does not have to care of them. For instance, user does not need to send FTP commands or parse replies, but still has to call methods that have effect of issuing a single command.

Conceptually, the client is composed of three layers.

Layer 1

This highest layer provides the user interface, represented by classes FTPClient and its subclass GridFTPClient, and other classes present in org.globus.io.ftp. Layer 1 classes are intended for direct use.

Layer 2

Layer 2 implements basic control protocol concepts: control channel, command, response, etc, and provides an interface to the data channel management (FTPServerFacade). It is called the "server facade", because its functionality is common for the server and client. In active transfer mode, client acts as server and vice versa. So the FTPServerFacade is really an internal FTP server.

Layer 3

Data channel management suffers from higher complexity than the control channel. Inside FTPServerFacade there is a third layer of objects representing low level data channel functions: data pathways (class DataChannel), readers and writers associated with various transfer types and modes. GridFTPServerFacade also handles parallelism and striping by servicing multiple data pathways at the same time. Note that Layer 3 classes are only needed if data channel is in use, which takes place during client-server transfer or LIST command.
 layers


Only Layer 1 classes (belonging to org.globus.ftp) are intended for direct use. Layer 2 and 3 classes are designed as internal classes. However, as Layer 2 in a handy manner abstracts the control and data channels, it might be useful for some advanced development that needs direct access to these.

package structure

threading 

FTP transfers are handled in threaded manner.Although thread operations are hidden from the user, it may be useful to have an idea how they work. Here are the basic concepts:
The user interface of FTPClient and GridFTPClient is not thread safe.

exceptions

In ftp package, programmer is most likely to encounter the following exceptions:
Apart from the latest two which belong to the standard Java library, all exceptions used by ftp package are subclasses of FTPException and inherit its features:
  1. exception code can be used to more precisely identify the problem. Exception codes are defined within each exception class (look at the source code). For example, in ClientException, code 8 (ClientException.BAD_MODE) indicates that client refused operation because of bad transfer mode, while code 13 (ClientException.BAD_TYPE) indicates that the same thing was caused by bad transfer type. To programmatically retrieve the exception code, use exception.getCode().
  2. exception nesting can be used to track the root of the exceptions that come from lower software layers. It is explained below.
The following example illustrates the concept of exception nesting (embedding, chaining).
FTPClient contains two lower layer entities: FTPControlChannel and FTPServerFacade. User retrieves a remote file by calling FTPClient.get(). This method calls FtpControlChannel.execute() to execute control channel commands, and FTPServerFacade.store() to receive the data on the data connection.
exceptions

If the server refuses the operation and sends a negative reply, execute() will throw UnexpectedFTPReplyException. The will be intercepted inside FTPClient.get() and embedded inside a ServerException. So eventually the user receives a ServerException. Without getting into details, he can tell that it is the remote party that caused the problem. However if he needs, he can retrieve the embedded exception by calling serverException.getCause(), and then see that the problem was caused by negative server reply, and even also see the reply content.

The problem of meaningful error information delivery has not yet been fully resolved in ftp package. In future it can undergo certain enhancements. For the moment, at least ServerException consequently embeds either an UnexpectedFTPReplyException or an FTPReplyParseException.


FTP operations

In this type of operations FTPClient class serves as programmer interface.

FTP third party transfers

In the third party transfer, the typical sequence of actions is the following:
If any of the optional actions is omitted, the default values (as defined by FTP) will be assumed. If server modes are undefined, the source will be implicitly set to active and destination to passive. The transfer() method will check that the settings are correct and throw an exception if the desired combination is faulty. For instance, you cannot set both servers to active server mode.

Example: FTPClientTest class. Look at test3Party(...) method.

FTP client-server transfers

To achieve client-server transfer, you need to supply an interface to the local data storage. If you are sending data, you need to support a DataSource interface that reads your file. If you want to receive one, you need to support a DataSink interface that will save the received data to a file. Package ftp provides a simple implementations of the two: DataSourceStream and DataSinkStream. They should be suitable for most purposes, but users can also supply their own implementations.
This is the typical sequence of actions in case of incoming transfer (sometimes referred to as "get"):
Procedure for the outgoing ("put") transfers is similar. Instead of DataSink you have to provide a DataSource, and instead of FTPClient.get() call FTPClient.put().
When setting the server mode (active/passive), it is important to also set the local server mode to the opposite. The typical syntax is:

client.setPassive();
client.setLocalActive();

It might be argued that the second call is not necessary and could be done implicitly. Because the server is set to passive, the client could automatically set itself to active.
It is true, but note that FTPClient can be also used for third party transfer. Third party operation is pretty trivial from the client side of view since it does not have to operate the data channels. So by default the client will assume a third party functionality and stay light weight. By issuing setLocalActive() or setLocalPassive(), you notify the client that you are going to switch to the client-server mode. Only then the client will launch the data channel management module (the internal server).

Example: FTPClient2PartyTest class. Look at testGet(...)  and testPut() methods.

Sending and receiving data other than files

In some situations you may want to store the received data in memory or other device than a disk file. To achieve this, provide a DataSink implementation accessing the data destination of your choice. In similar fashion, if you want to send data from non standard source, provide your own DataSource implementation.

transferring multiple files

To transfer multiple files, it is not enough to issue FTPClient.get() several times. In FTP stream mode, the data channel is automatically closed after each single file transfer. Afterwards it is required that either FTP PORT command is issued again, or the client waits for a certain timeout period. At the level of FTPClient interface, the first option means that the server mode should be reset after each transfer:

HostPort hp2 = client.setPassive();
client.setLocalActive();

restart markers

Vanilla FTP defines a way to transmit restart markers over the data channel in block or compressed transfer mode. Ftp package does not support these modes and hence does not support restart markers in vanilla FTP. Though it is possible to construct a StreamModeRestartMarker object and invoke FTP REST via FTPClient.setRestartMarker() method, there is no way to obtain the restart marker from the data channel. We do support the GridFTP markers though.

directory listing

FTP command LIST, unlike most other commands, transfers the data over the data channel. In that fashion it is very similar to file transfer commands. FTPClient.list() requires the same sequence of actions as put() or get(), including setting transfer and server modes if necessary. Transfer type must be ASCII. FTPClient provides two versions of list(). One that does not take arguments, issues the most common command "LIST -d *" and returns the parsed data as a vector of FileInfo objects, each of which represents one file. It is possible however that some servers will send back other format of LIST data that cannot be successfully parsed into a FileInfo. In this case you have to use the parametrized FTPClient.list(...) and intercept the input to the DataSink interface, just like you would do with get().

Example:
FTPClientListTest class. To see how to implement a DataSink, also look at the code of FTPClient.list() without parameters.

advanced file and directory listing with MLST and MLSD

MLST and MLSD (together also known as MLSx) are proposed extensions to FTP that standardize the output of file listing. All servers have to follow the same output format, therefore parsing of the reply is OS independent. MLST allows discovering properties of a single remote file, such as creation date, size, permissions, etc. MLSD is similar to LIST, and also provides these parameters for each listed file.

Jftp library (CVS version of Oct 2003, to be released) supports MLSx extensions. However, note that to use it, you also need an FTP or GridFTP server which supports MLSx.

To obtain programmatic access to information of a single remote file, use FTPClient.mlst() command. The resulting MlsxEntry object will contain all the information returned by the server.

Example:  MlsxTest.testMlst()

To obtain remote directory listing with similar properties, use FTPClient.mlsd() command which will return a Vector of MlsxEntry objects, each corresponding to one file in the directory.

Example: MlsxTest.test3()

Example of extracting information from MlsxEntry: MlsxTest.testMlsxEntry()

aborting the running transfer

FTPClient provides command abort() which unfortunately has problems. RFC 959 requires that ABORT is sent with a telnet interrupt sequence which requires a TCP urgent notification flag. Sadly, we are obliged to be compatible with java 1.3 which does not support urgent out of band TCP messages. For this reason FTPClient is not setting the urgent flag as required. As a consequence, the server may see the ABORT only after the transfer completes.

An orthogonal issue worth noting is that if you abort the transfer, the server will most probably return a negative reply to the transfer command, which on the API level will be translated to a ServerException thrown by the transfer command that you were calling. You will have to handle this exception.

We currently do not advice using abort().

other FTP commands

Equivalents of other FTP operations are accessible by methods of FTPClient. They do not use data channel and so are simple to understand. They mostly implement remote file system operations and set various transfer options.

Example:
FTPClienTest class. Look at all public methods.

GridFTP operations

GridFTPClient is the interface for this type of actions.

extended block mode

Extended block mode is one of the focal concepts of GridFTP. Mode E allows for striped and parallel transfers. Package ftp supports Mode E only with transfer type IMAGE. It can be turned on on like that:

client.setType(GridFTPSession.TYPE_IMAGE);

client.setMode(GridFTPSession.MODE_EBLOCK);

security

RFC 2228 [3] defines control channel authentication and data channel protection mechanisms. GridFTP [1] provides also the command DCAU for data channel authentication. By default, data channel is authenticated.
Package ftp supports GSI based authentication of both channels, and protection of data channel. Two data channel authentication modes are supported: SELF and NONE. Typically, setting the security modes requires the following actions:

To authenticate to GridFTP server (using CoG 0.9.13/1.0):

client.authenticate(GlobusProxy.getDefaultUserProxy());

To authenticate to GridFTP server (using CoG 1.1):

client.authenticate(null);

To enable data channel security with integrity protection:

client.setProtectionBufferSize(16384);

client.setDataChannelAuthentication(DataChannelAuthentication.SELF);

client.setDataChannelProtection(GridFTPSession.PROTECTION_SAFE);

Example:
FTPClient2PartyTest class. Look at testPut() and testGet().

third party and client-server transfers in GridFTP

The procedure is similar to that of FTPClient, with following differences:

Example:
GridFTPClientTest.test3PartyModeE() demonstrates third party transfer, and GridFTPClient2PartyTest.get() and put() do client-server transfers.

parallel transfers

One of the extensions of GridFTP is parallel transfer. If mode E is used, the active side can form several connections to the passive side. The data is then transferred in parallel streams. GridFTP allows for declaring desired number of parallel streams by sending options to RETR command, containing three values: starting, maximum and minimum parallelism. Parallelism in this context designates the number of data channel pathways that can be used at the same time.
Ftp package supports parallel transfers in both third party and client-server modes. Starting, maximum and minimum parallelism must be equal. For instance, to declare parallelism of 5, use the following notation:

client.setOptions(new RetrieveOptions(5));

To parallelize your GridFTP transfer, you also must use an implementation of DataSink or DataSource that supports random data access. DataSinkStream and DataSourceStream are not good here. Use FileRandomIO or supply your own implementation.
Also bear in mind that you will have to use mode E.
In case of two-party transfer, parallelism should be chosen with caution. The advantage of having multiple streams has mostly to do with low level TCP procedures and is also related to the TCP window size. Using twice the number of parallel streams will not necessarily involve twice better performance. Actually, from a certain point you will rather experience decrease in performance. Current implementation of ftp package handles each  data pathway in a separate thread, so unless your machine has multiple CPUs, you only add computing overhead by increasing parallelism.

Example: GridFTPClientTest.test3PartyModeE() demonstrates third party transfer, and GridFTPClient2PartyTest.get() and put() do client-server transfers.

striped client-server transfers

GridFTP supports striped transfers. In vanilla FTP, the passive side listens on one server socket and the active side connects to it. In GridFTP striped mode, the server can listen on more than one socket, which can be distributed across several machines. What's more, if parallelism is used, the client forms several connections to each of the listening sockets. For example, if the server listens on 2 sockets and parallelism is set to 5, eventually 10 data pathways will be formed.
Ftp package supports striped transfers in both third party and client-server mode. However, a word of caution is necessary here. Package ftp has been tested against GridFTP server available with Globus Toolkit 2.0. This server implements striping in its rudimentary form and always returns only one socket address in the response to SPAS command. Thus although we have put all effort to ensure the code correctness, we cannot guarantee it has been tested throughly.
The procedure for a striped transfer is very similar to that of non-striped, with three exceptions.
  1. To set server mode, you have to use striped counterparts of mode setting methods, distinguishable by prefix striped
  2. You are using mode E, so for transfer use methods with extended prefix. 
  3. Remember to FileRandomIO instead of DataSinkStream or DataSourceStream. (look at the section "parallel transfers")
To summarize, the typical procedure for striped client-server transfer is:
Example: GridFTPClient2PartyStripingTest demonstrates client-server file storage and retrieval. Look at methods put() and get().

striped third party transfer

Typical striped third party transfer would look similar to striped client-server transfer. The same initial settings need to be issued to both servers. Then you would type:

HostPortList hpl = dest.setStripedPassive();

source.setStripedActive(hpl);

source.extendedTransfer(sourceFile, dest, destFile, null);

Finally you would close the servers.
You have to explicitly call setStripedPassive() and setStripedActive().

Example:
GridFTPClientTest.test3PartyModeE() demonstrates third party striped transfer.

transfering multiple files in Mode E

Opening a data connection is a time consuming operation and so GridFTP allows for data connection caching. Data pathways can be open once only and handle many transfers before being closed. However, all data pathways will be cleared upon subsequent issuing of PASV, PORT, SPAS or SPOR.
Package ftp supports connection caching both for third-party and client-server transfers. It is done automatically by the server and the client whenever Mode E is in use. However, to ensure that the data channel is not being torn after each transfer, make sure that between the transfer requests (like get(), put(), transfer() etc) you do not issue any of the commands altering the server mode, such as setActive(), setPassive(), setStripedActive() or setStripedPassive(). Any such call would cause the tearing of the existing data channels.

This sequence will transfer the two files over the same reused data pathway:

source.extendedTransfer(sourceFile1, dest, destFile1, null);

source.extendedTransfer(sourceFile2, dest, destFile2, null);


To explicitly reset the data channels, insert one of the commands altering the server mode between the transfer requests, like this:

source.extendedTransfer(sourceFile1, dest, destFile1, null);

source.setStripedActive(dest.setStripedPassive());

source.extendedTransfer(sourceFile2, dest, destFile2, null);

Examples
: MultipleTransferTest.test3PartyMultipleTransfersModeE() demonstrate the data channel reuse for third party transfers. MultipleTransferTest.test2PartyMultipleTransfersModeE() and MultipleTransferTest.test3PartyMultipleTransfers() demonstrate the explicit tearing of data connections. DataChannelReuseTest and DataChannelReuseVarParTest demonstrate the data channel reuse in 2-party mode.


restarting a third party transfer

GridFTP deprecates the FTP mechanism of block mode data channel restart markers. Instead it introduces control channel restart markers periodically sent by the server as 111 replies. Currently there is no mechanism for the client to control the frequency of restart markers.
Package ftp supports GridFTP restart mechanism. The program needs to take the following actions:

byteRangeList.merge(marker.toVector());


client.setRestartMarker(byteRangeList);

source.extendedTransfer(sourceFile, dest, destFile, listener);

The difference between GridFTPRestartMarker and ByteRangeList needs explanation. GridFTP allows the server to send the restart byte ranges in any order, possibly even overlapping each other. They can arrive in that shape inside one restart marker or in many restart markers, and it is the client's responsibility to merge them into one. GridFTPRestartMarker stores the byte ranges that arrived in one marker, without processing them in any way. ByteRangeList can be used to process and merge all the received byte ranges from one or many markers.

Example: MarkerTest demonstrates how to implement MarkerListener, process restart markers, merge them into ByteRangeList and retrieve the final information from it. For more information on byte range merging, also look at ByteRangeListTest and ByteRangeList javadoc and code.

restarting a client-server transfer

If using mode E, restart of a client-server transfer can be done in the following way.

client.setRestartMarker(byteRangeList);

source.extendedTransfer(sourceFile, dest, destFile, listener);

performance monitoring

GridFTP defines performance markers that can periodically be sent by the server. Package ftp supports GridFTP performance monitoring and can parse the following tags of performance markers:
The way to get hold of performance marker information is analogous to obtaining restart information. You need to implement a MarkerListener that would intercept PerfMarker objects.

Example:
MarkerTest demonstrates how to do it. You can also look at PerfMarkerTest that tests PerfMarker methods that are pretty trivial.

other GridFTP features

Other guidelines

Use separate client instances for third party and client-server transfers

Third party transfer and client-server transfer are usually not used together and most users will need only one of these. However, theoretically it is possible to use the same FTPClient or GridFTPClient instance for both. It is discouraged.  Although from the user point of view, client-server and third party transfer procedures are alike, their underlying functionality differs and they use different resources (TCP sockets, memory, threads). Using the same client for both functions causes internal chaos and can potentially cause errors. There should be no need for doing this. Just like you have a separate Client instance for each server, you should also use one Client instance for transfer between server X and other servers, and another instance for transfers between server X and your local machine.

do not forget to close

When you are finished with the transfers, you must call close(). This will release the resources, including the threads. Package ftp uses aggressive threading policy with a separate thread for each data channel path. It is easy to quickly run out of memory by creating several Client instances and forgetting to close them.

transfers with old GSIFTP servers

GSIFTP is a predecessor to GridFTP. GSIFTP server has been released in the early versions of Globus Toolkit (2001). GridFTP is not fully interoperable with GSIFTP in one respect: in GSIFTP there is no data channel authentication.
Package ftp is compatible with both GridFTP and GSIFTP. To successfully communicate to a GSIFTP server, before starting data channel operations you need to inform the GridFTPClient that data channel authentication (which is default in GridFTP) should not be used:

client.setLocalNoDataChannelAuthentication()

This is a local command and is not being sent to the server. The server would not understand it.

transfer timeout

In the beginning of the transfer, a critical moment is the period of wait for the initial server reply. If it does not arrive after certain time,
client assumes that the transfer could not start for some reason and aborts the operation.  If necessary, timeout parameters can be changed by calling FTPClient.setClientWaitParams().

Developer notes

testing

Package ftp has been unit tested with junit (http://www.junit.org/index.htm). Comprehensive package testing requires several resources and the set up is not trivial. In short, you need to have two GridFTP servers, at least one of which allowing data writing, and two FTP servers, ideally one of which should also allow writing. Here is the testing setup in short:
Some notes on configuring FTP and GridFTP servers are here.
It is also possible to perform partial tests of concrete classes and methods, by manually running classes from test package.

debugging

The package uses log4j logging utility (http://jakarta.apache.org/log4j/docs/index.html).  Most classes have their own loggers.
To get rid of all logging messages, create a file log4j.properties and set the root log level to WARN. Read more about configuring logging at log4j web manual.

Hint: for solving most problems that require debugging, it is a good idea to enable DEBUG level for  org.globus.ftp.vanilla.FTPControlChannel and org.globus.ftp.vanilla.Reply. You will then see the communication to and from the server.

your comments and contact list

If you have read up to here, you most certainly need to use the ftp package for some purpose. We are always glad to hear about new users and their needs. Please tell us about:
Java CoG Kit, developed in Argonne National Lab:  http://www.globus.org/cog/java/index.html
Mailing list: http://www.globus.org/cog/mailinglists.html
Main contacts for Java CoG Kit are Gregor von Laszewski and Jarek Gawor.
Package ftp has been created by Pawel Plaszczak, Jarek Gawor and Peter Lane.
Java CoG Kit is part of the Globus Project: http://www.globus.org


references

[1] GridFTP: Protocol Extensions to FTP for the Grid (draft of April 2002)
[2] Postel, J. and Reynolds, J. "FILE TRANSFER PROTOCOL (FTP)", RFC 959
[3] Horowitx, M. and Lunt, S. "FTP Security Extensions", RFC 2228
[4] Hethmon, P. and Eltz, R. "Feature negotiation mechanism for the File Transfer Protocol", RFC 2389