WS GRAM: Developer's Guide

Overview
GRAM slides [ html ] [ pdf ]
API
Architecture
Fault Tolerance Architecture > Testing <
RSL Schema
MJS Fault Types
Samples
Scheduler tutorial
Troubleshooting

Fault Tolerance Testing Scenarios

Definitions
------------
Kill Master = ctrl-c in Master terminal
Kill UHE = 2 methods are possible:
                call admin service shutdown
                kill -9 UHE process
job completes = client receives DONE notification
submit a job = managed-job-globusrun -factory xxx -file yyy
submit a batch job = managed-job-globusrun -batch -factory xxx -file yyy
FSD = FindServiceData call; managed-job-globusrun -status 
job contact = GSH returned from managed-job-globusrun -batch

test case 1 - Only MHE crash:
----------------------------------
  - Master is up
  - Submit a job (UHE starts up)
  - Job completes
  - Kill Master
  - Restart Master (UHE still active)
      (Ping is failing even though UHE is up: Ravi investigating)
      -- This is fixed and committed to Trunk
      (Ping should not fail if UHE is active)
  - Submit a job
  - Job completes

test case 2 - Both MHE and UHE crash:
----------------------------------
  - Master is up
  - Submit a job (UHE starts up)
  - Job completes
  - Kill Master
  - Kill UHE
  - Restart Master (UHE is restarted)
  - Submit a job
  - Job completes

 test case 3 - UHE crash; no jobs: 
----------------------------------
  - Master is up
  - UHE is killed (no jobs at the time)
  - Restart handler in MHE restarts UHE
  - Submit a job (UHE starts up)
  - Job completes

 test case 4 - UHE crash, long job, subscribe:
----------------------------------
I configured the UHESweeperTask ( that periodically sweeps all the uhes and pin
gs whether they are active or not ) to sweep the uhes every 5 seconds. I killed
the UHE by fishing out its pid from ps -efwww | grep  of uhe process.

- Master is up
  - Submit a long running sleep job
     Waits for DONE notification
  - UHE is killed (1 active job)
  - Restart handler in MHE restarts UHE
  - job completes on service side
  - Client receives DONE

test case 5 - UHE crash, long job, FSD: 
----------------------------------
  - Master is up
  - Submit a long running sleep job - batch
  - UHE is killed (1 active job)
  - Restart handler in MHE restarts UHE
  - client does a FSD with job contact (GSH) and gets ACTIVE
  - job completes on service side
  - client does a FSD with job contact (GSH) and gets DONE

test case 6 - UHE inactivity shutdown 1:
----------------------------------
  - Master is up
  - Submit a simple /bin/date job
  - Job completes
  - UHEActivitySweeper shuts down UHE
  - Submit a simple /bin/date job
    + verifies UHE startup works after shutdown

test case 7 - UHE inactivity shutdown 2:
----------------------------------
  - Master is up
  - Submit a simple /bin/date job
  - Job completes
  - do FSD to UHE MJFS to keep UHE up even though there are no more instances
  - stop doing FSD allowing the UHEActivitySweeper to shut down UHE
  - Submit a simple /bin/date job
    + verifies UHE startup works after shutdown

Test Results

GT 3.2 release results for MMJFS/MJS fork
  • Test case 1: Success
  • Test case 2: Success
  • Test case 3: Success
  • Test case 4: Success
  • Test case 5: Success
  • Test case 6: Success
  • Test case 7: Success