GT 5.2.1 Release Notes: GRAM5


1. Component Overview

The Grid Resource Allocation and Management (GRAM5) component is used to locate, submit, monitor, and cancel jobs on Grid computing resources. GRAM5 is not a Local Resource Manager, but rather a set of services and clients for communicating with a range of different batch/cluster job schedulers using a common protocol. GRAM5 is meant to address a range of jobs where reliable operation, stateful monitoring, credential management, and file staging are important.

2. Feature summary

New Features new since 5.2.0:

  • Better integration with Linux operating systems with native RPM and Debian packages
  • Improved logging and integration with system log rotation tools
  • Improved scalability and reliability

Other Standard Supported Features

  • Remote job execution and management
  • Uniform and flexible interface to local resource managers
  • File staging before and after job execution
  • File and directory clean up after job termination
  • Service auditing for each submitted

Removed Features

  • Condor SEG module is no longer included. Its functionality has been moved into the core of the job manager program.

3. Summary of Changes in GRAM5

3.1. New Features: GRAM5

  • GRAM-294: New RSL Attribute "expiration" to clean up files for old jobs

3.2. Improvements: GRAM5

  • GRAM-272: Allow site-specific RVF entries

4. Fixed Bugs for GRAM5

  • GRAM-321: globus-job-manager emits warning about all jobs on restart
  • GRAM-230: globus-gatekeeper does not reap children in threaded mode
  • GRAM-232: Incorrect directory permissions cause an infinite loop
  • RIC-205: Missing directories $GLOBUS_LOCATION/var/lock and $GLOBUS_LOCATION/var/run
  • GRAM-296: Compile Failure on Solaris
  • GRAM-297: job manager service definitions contain unresolved variables
  • GRAM-299: Not all job log messages obey loglevel RSL attribute
  • GRAM-300: GRAM job manager doxygen refers to obsolete command-line options
  • GRAM-301: GRAM validation file parser doesn't handle empty quoted values correctly
  • GRAM-302: Incorrect error when state file write fails
  • GRAM-303: Gatekeeper's syslog output cannot be controlled
  • GRAM-305: Jobmanager reporting DONE status when stage-out failed
  • RIC-226: Some dependencies are missing in GPT metadata
  • GRAM-306: Job Manager stdio_size query logging crash
  • GRAM-309: GRAM5 doesn't work with IPv4 only gatekeepers
  • GRAM-310: sge configure script error
  • GRAM-311: Undefined variable defaults in shell scripts
  • GRAM-312: Make crontab not fail if the package is uninstalled
  • GRAM-315: Job locking doesn't handle ENOENT gracefully
  • RIC-239: GSSAPI Token inspection fails when using TLS 1.2
  • GRAM-317: job manager fails transferring job between processes if the proxy is larger than the socket buffer
  • GRAM-318: Periodic lockup of SEG
  • GRAM-323: RVF parser leaks file descriptors
  • GRAM-325: job manager crashes when reading empty condor log
  • GRAM-326: Can't renew job proxy after GLOBUS_GRAM_PROTOCOL_ERROR_COMMIT_TIMED_OUT error
  • GRAM-328: job manager waits for two-phase delay when stopping
  • GRAM-330: Buffer overflow in globus_gram_job_manager_seg_parse_condor_id
  • GRAM-334: job manager doesn't work if unix socket path is too long
  • GRAM-335: init scripts fail on solaris because of stop alias
  • GRAM-336: Job manager can't guess osname on some operating systems
  • GRAM-337: GRAM job manager config file has unresolved paths
  • GRAM-338: GRAM job manager mishandles peer name when proxying messages through the gatekeeper
  • GRAM-339: globus-job-run and globus-job-submit can't always handle "-e" as an argument
  • GRAM-340: job manager crashes during stdio size query
  • GRAM-341: globusrun ignores state callbacks that occur too early
  • GRAM-342: intra-job manager protocol doesn't keep do signal-safe reads
  • GRAM-343: lrm packages grid-service files aren't in CLEANFILES

5. Known Problems in GRAM5

  • GRAM-320: globus-gatekeeper leaks logfile to globus-job-manager
  • GRAM-105: Held Condor jobs should be reported as SUSPENDED
  • GRAM-138: GRAM5 job manager uses a lot of memory when SEG is pointed to incorrect log path
  • GRAM-231: audit not working when proxy expires
  • GRAM-237: Fork LRM doesn't include softenv RSL attribute in rvf file
  • GRAM-238: GRAM Fork LRM's softenv implementation doesn't work without SEG
  • GRAM-291: RSL eval doesn't indicate what symbol was not found

6. Technology dependencies

GRAM depends on the following GT components:

  • Globus Common
  • GSI C
  • GridFTP server

7. Tested platforms

Tested platforms for GRAM5:

  • Linux

    • CentOS 5, 6 x86_64, i386
    • Fedora 15, 16 x86_64, i386
    • Red Hat Enterprise Linux 5, 6 x86_64, i386
    • Scientific Linux 5, 6 x86_64, i386
    • Debian 6, 7 (testing) x86_64, i386
    • Ubuntu 10.04LTS, 10.10, 11.04, 11.10, 12.04LTS (testing) x86_64, i386

  • Mac OS X

    • Mac OS X 10.7 (Lion)

  • Solaris

    • Solaris 11

8. Backward compatibility summary

Protocol changes in GRAM since GT4 series:

  • The GRAM5 service uses a superset of the GRAM2 protocol for communciation between the client and service. The extensions supported in GRAM5 are implemented in such a way that they are ignored by GRAM2 services or clients. These extensions provide improved error messages and version detection.
  • GRAM5 does not support task coallocation using DUROC and its related protocols. Jobs submitted using DUROC directives will fail.
  • GRAM5 does not support file streaming. The standard output and standard error streams are sent after the job completes instead of during execution. As a special case, support for the Condor grid monitor program implements a small subset of the streaming capabilities of GRAM2 in GT 4.2.x.

9. Associated Standards

None

10. For More Information

See GRAM5 for more information about this component.

Glossary

L

Local Resource Manager (LRM)

A system which controls access to a compute resource, such as a compute cluster or parallel computer. Such systems provide batch execution interfaces, which GRAM uses to execute jobs. Condor, PBS, GridEngine are examples of local resource managers.