WS GRAM: Developer's Guide
Overview
GRAM slides [ html ] [ pdf ]
API
Architecture
Fault Tolerance Architecture > Testing <
RSL Schema
MJS Fault Types
Samples
Scheduler tutorial
Troubleshooting
Fault Tolerance Testing Scenarios
Definitions
------------
Kill Master = ctrl-c in Master terminal
Kill UHE = 2 methods are possible:
call admin service shutdown
kill -9 UHE process
job completes = client receives DONE notification
submit a job = managed-job-globusrun -factory xxx -file yyy
submit a batch job = managed-job-globusrun -batch -factory xxx -file yyy
FSD = FindServiceData call; managed-job-globusrun -status
job contact = GSH returned from managed-job-globusrun -batch
test case 1 - Only MHE crash:
----------------------------------
- Master is up
- Submit a job (UHE starts up)
- Job completes
- Kill Master
- Restart Master (UHE still active)
(Ping is failing even though UHE is up: Ravi investigating)
-- This is fixed and committed to Trunk
(Ping should not fail if UHE is active)
- Submit a job
- Job completes
test case 2 - Both MHE and UHE crash:
----------------------------------
- Master is up
- Submit a job (UHE starts up)
- Job completes
- Kill Master
- Kill UHE
- Restart Master (UHE is restarted)
- Submit a job
- Job completes
test case 3 - UHE crash; no jobs:
----------------------------------
- Master is up
- UHE is killed (no jobs at the time)
- Restart handler in MHE restarts UHE
- Submit a job (UHE starts up)
- Job completes
test case 4 - UHE crash, long job, subscribe:
----------------------------------
I configured the UHESweeperTask ( that periodically sweeps all the uhes and pin
gs whether they are active or not ) to sweep the uhes every 5 seconds. I killed
the UHE by fishing out its pid from ps -efwww | grep of uhe process.
- Master is up
- Submit a long running sleep job
Waits for DONE notification
- UHE is killed (1 active job)
- Restart handler in MHE restarts UHE
- job completes on service side
- Client receives DONE
test case 5 - UHE crash, long job, FSD:
----------------------------------
- Master is up
- Submit a long running sleep job - batch
- UHE is killed (1 active job)
- Restart handler in MHE restarts UHE
- client does a FSD with job contact (GSH) and gets ACTIVE
- job completes on service side
- client does a FSD with job contact (GSH) and gets DONE
test case 6 - UHE inactivity shutdown 1:
----------------------------------
- Master is up
- Submit a simple /bin/date job
- Job completes
- UHEActivitySweeper shuts down UHE
- Submit a simple /bin/date job
+ verifies UHE startup works after shutdown
test case 7 - UHE inactivity shutdown 2:
----------------------------------
- Master is up
- Submit a simple /bin/date job
- Job completes
- do FSD to UHE MJFS to keep UHE up even though there are no more instances
- stop doing FSD allowing the UHEActivitySweeper to shut down UHE
- Submit a simple /bin/date job
+ verifies UHE startup works after shutdown
Test Results
GT 3.2 release results for MMJFS/MJS fork- Test case 1: Success
- Test case 2: Success
- Test case 3: Success
- Test case 4: Success
- Test case 5: Success
- Test case 6: Success
- Test case 7: Success