Table of Contents
This document includes recommendations for increasing the scalability and performance of WS GRAM in a Grid.
GRAM4 service and or container can run out of memory under lower settings. For this reason, set the Max container heap size to be 1GB.
GLOBUS_OPTIONS="-Xms256M -Xmx1024M"
The account the container runs under, typically "globus", can run out of open file descriptors. For this reason, set the open file descriptors to 16,384.
Specific settings can vary per operating system; for a "globus" user on redhat / RHEL based distributions, add the following to
/etc/security/limits.conf:globus hard nofile 16384
The GRAM4 service stores the per job metadata used for crash/ recovery in files on disk. By default, the container account's home dir is used, specifically
~/.globus/persisted/. Often this home dir is not located on a local disk, but on NFS. NFS is not needed for this purpose and can negatively effect performance. For this reason, configure the container to use a local disk.GLOBUS_OPTIONS="-Dorg.globus.wsrf.container.persistence.dir=/use/this/path"
Make sure you don't overwrite the above memory settings. You could provide both settings in the same GLOBUS_OPTIONS variable like:
GLOBUS_OPTIONS="-Xms256M -Xmx1024M -Dorg.globus.wsrf.container.persistence.dir=/use/this/path"
The container can run out of container threads resulting in client- side timeouts. The default is too low in GT 4.0.5 and earlier releases. We recommend these settings as part of the global configuration in
$GLOBUS_LOCATION/etc/globus_wsrf_core/server-config.wsdd:<globalConfiguration: ... <parameter name="containerThreads" value="20"/> <parameter name="containerThreadsMax" value="50"/> <parameter name="containerThreadsHighWaterMark" value="10"/> ... </globalConfiguration>For more information, see global configurations under Java WS Core here.
4.0.5+ only: To significantly improve the performance of jobs that include file staging, set the value of the parameter
enableLocalInvocationsto true in the homeConfiguration of the ManagedJobFactoryService in$GLOBUS_LOCATION/etc/gram-service/jndi-config.xml. This however can only be done if WS-GRAM and RFT, which is used for staging, are co-located in the same container:... <parameter> <name> enableLocalInvocations </name> <value> true </value> </parameter> ...
To avoid client-side timeouts in job submissions to a GT4 server under heavy load, the default of 2 minutes can be increased. To do this with globusrun-ws add the
-T <milliseconds>options to the job submission command and see globusrun-ws -help for further information. To do this with the Java API GramJob call the method setTimeOut():... GramJob gramJob = new GramJob (...) gramJob.setTimeOut(300000); // value in milliseconds ...
If Condor-G is used as client: Make sure to use Condor version 6.9.3 or higher. In this version, the communication with the WS-GRAM service has been improved by consolidating multiple web service operations into one. Also, Condor-G includes its own version of GT related Java archives. 6.9.2 and earlier contain an old version of these archives that cause occasional security errors when submitting to a 4.0.5 WS GRAM service.
