[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: problems with remote staging of mpi pgms



Dear Nick,
 First of all, thanks for your reply.
 We tried the example in www.globus.org/mpi (First
MPICH G2 Application - ring.c) and followed your
instructions. When the machines file has only one
machine (localhost), the program gives the following
output when 'ring' is executed.
$<MPICH_Install-dir>/bin/mpirun -np 4 ring
Master: end of trip 1 of 1: after receiving
passed_num=4 (should be =trip*numprocs=4) from
source=3
(which is said to be correct in the above website)
 
But we observed that if we include more than one
machine (other than localhost - ie. another valid host
on our grid) in the <MPICH-Intall-dir>/bin/machines
file, the following log file is generated and the
output is not obtained.
>>>>>Job RSL (post-validation-eval)
3/11 14:15:31 JMI: Getting RSL output value
3/11 14:15:31 JMI: Processing output positions
3/11 14:15:31 JMI: Getting RSL output value
3/11 14:15:31 JMI: Processing output positions
3/11 14:15:31 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_REMOTE_IO_FILE_CREATE
3/11 14:15:31 JM: Opening output destinations
3/11 14:15:31 JM: stdout goes to
x-gass-cache://btts2.stp.cdac.ernet.in/21120.1110530731/dev/stdout
3/11 14:15:31 JM: stderr goes to
x-gass-cache://btts2.stp.cdac.ernet.in/21120.1110530731/dev/stderr
3/11 14:15:31 JM: Opening
https://btts2.stp.cdac.ernet.in:3440/dev/stdout
3/11 14:15:31 JM: Opened GASS handle 1.
3/11 14:15:31 JM: exiting
globus_l_gram_job_manager_output_destination_open()
3/11 14:15:31 JM: Opening
https://btts2.stp.cdac.ernet.in:3440/dev/stderr
3/11 14:15:31 JM: Opened GASS handle 2.
3/11 14:15:31 JM: exiting
globus_l_gram_job_manager_output_destination_open()
3/11 14:15:31 stdout or stderr is being used, starting
to poll
3/11 14:15:31 JM: Finished opening output destinations
3/11 14:15:32 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_OPEN_OUTPUT
3/11 14:15:32 JM: GSSAPI type is GSI.. relocating
proxy
3/11 14:15:32 JMI: testing job manager scripts for
type fork exist and permissions are ok.
3/11 14:15:32 JMI: completed script validation: job
manager type is fork.
3/11 14:15:32 JMI: in
globus_gram_job_manager_script_proxy_relocate()
3/11 14:15:32 JMI: cmd = proxy_relocate
Fri Mar 11 14:15:32 2005 JM_SCRIPT: New Perl
JobManager created.
Fri Mar 11 14:15:32 2005 JM_SCRIPT:
proxy_relocate(enter)
3/11 14:15:32 JMI: while return_buf =
GRAM_SCRIPT_X509_USER_PROXY =
/home/arunz87/.globus/.gass_cache/local/md5/07/55dfc6ee0957ad94adfa3d101487d6/md5/f2/413c4cad96e3f320ef74c720716d82/data
3/11 14:15:32 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_PROXY_RELOCATE
3/11 14:15:32 JM: Relocated Proxy to
/home/arunz87/.globus/.gass_cache/local/md5/07/55dfc6ee0957ad94adfa3d101487d6/md5/f2/413c4cad96e3f320ef74c720716d82/data
3/11 14:15:33 JM: before sending to client: rc=0
(Success)
3/11 14:15:33 Job Manager State Machine (exiting):
GLOBUS_GRAM_JOB_MANAGER_STATE_TWO_PHASE
3/11 14:15:33 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_TWO_PHASE
3/11 14:15:33 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_TWO_PHASE_COMMITTED
3/11 14:15:33 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_STAGE_IN
3/11 14:15:33 JMI: testing job manager scripts for
type fork exist and permissions are ok.
3/11 14:15:33 JMI: completed script validation: job
manager type is fork.
3/11 14:15:33 JMI: in globus_gram_job_manager_submit()
3/11 14:15:33 JMI: local stdout filename =
/home/arunz87/.globus/.gass_cache/local/md5/07/55dfc6ee0957ad94adfa3d101487d6/md5/6d/ffcc344a201b811f5fac8c72230b35/data.
3/11 14:15:33 JMI: local stderr filename =
/home/arunz87/.globus/.gass_cache/local/md5/07/55dfc6ee0957ad94adfa3d101487d6/md5/64/e7e66b76ea6fcc4b60c4424c19f53d/data.
3/11 14:15:33 JMI: cmd = submit
3/11 14:15:33 JMI: returning with success
Fri Mar 11 14:15:33 2005 JM_SCRIPT: New Perl
JobManager created.
3/11 14:15:33 JMI: while return_buf =
GRAM_SCRIPT_JOB_ID = 21123
3/11 14:15:33 JMI: while return_buf =
GRAM_SCRIPT_JOB_STATE = 2
3/11 14:15:33 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_SUBMIT
3/11 14:15:33 JM: in
globus_gram_job_manager_reporting_file_create()
3/11 14:15:33 JM: not reporting job information
3/11 14:15:33 JM: in
globus_gram_job_manager_history_file_create()
3/11 14:15:33 JM: NOT empty client callback list.
3/11 14:15:33 JM: sending callback of status 2
(failure code 0) to
https://btts2.stp.cdac.ernet.in:3441/.
3/11 14:15:33 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_POLL2
3/11 14:15:33 JMI: testing job manager scripts for
type fork exist and permissions are ok.
3/11 14:15:33 JMI: completed script validation: job
manager type is fork.
3/11 14:15:33 JMI: in globus_gram_job_manager_poll()
3/11 14:15:33 JMI: local stdout filename =
/home/arunz87/.globus/.gass_cache/local/md5/07/55dfc6ee0957ad94adfa3d101487d6/md5/6d/ffcc344a201b811f5fac8c72230b35/data.
3/11 14:15:33 JMI: local stderr filename =
/home/arunz87/.globus/.gass_cache/local/md5/07/55dfc6ee0957ad94adfa3d101487d6/md5/64/e7e66b76ea6fcc4b60c4424c19f53d/data.
3/11 14:15:33 JMI: cmd = poll
3/11 14:15:33 JMI: returning with success
Fri Mar 11 14:15:34 2005 JM_SCRIPT: New Perl
JobManager created.
Fri Mar 11 14:15:34 2005 JM_SCRIPT: polling job 21123
3/11 14:15:34 JMI: while return_buf =
GRAM_SCRIPT_JOB_STATE = 2
3/11 14:15:34 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_POLL1
3/11 14:15:40 globus_gram_job_manager_query_callback()
not a literal URI match
3/11 14:15:40 JM : in
globus_l_gram_job_manager_query_callback, query=cancel
3/11 14:15:40 JM : reply: (status=2 failure code=0
(Success))
3/11 14:15:40 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_POLL_QUERY1
3/11 14:15:40 JMI: testing job manager scripts for
type fork exist and permissions are ok.
3/11 14:15:40 JMI: completed script validation: job
manager type is fork.
3/11 14:15:40 JMI: in
globus_gram_job_manager_script_cancel()
3/11 14:15:40 JMI: cmd = cancel
3/11 14:15:40 JMI: returning with success
Fri Mar 11 14:15:40 2005 JM_SCRIPT: New Perl
JobManager created.
Fri Mar 11 14:15:40 2005 JM_SCRIPT: cancel job 21123
3/11 14:15:45 JMI: while return_buf =
GRAM_SCRIPT_JOB_STATE = 4
3/11 14:15:45 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_POLL_QUERY2
3/11 14:15:45 JM : sending reply:
protocol-version: 2^M
status: 4^M
failure-code: 0^M
job-failure-code: 8^M
^@3/11 14:15:45 -------------------
3/11 14:15:45 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_POLL1
3/11 14:15:45 JM: in
globus_gram_job_manager_history_file_create()
3/11 14:15:45 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED
3/11 14:15:45 closing destination
https://btts2.stp.cdac.ernet.in:3440/dev/stdout
3/11 14:15:45 JM: exiting
globus_l_gram_job_manager_output_destination_close()
3/11 14:15:45 closing destination
https://btts2.stp.cdac.ernet.in:3440/dev/stderr
3/11 14:15:45 JM: exiting
globus_l_gram_job_manager_output_destination_close()
3/11 14:15:45 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_CLOSE_OUTPUT
3/11 14:15:45 JM: NOT empty client callback list.
3/11 14:15:45 JM: sending callback of status 4
(failure code 8) to
https://btts2.stp.cdac.ernet.in:3441/.
3/11 14:15:45 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE
3/11 14:15:45 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE_COMMITTED
3/11 14:15:45 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_FILE_CLEAN_UP
3/11 14:15:45 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_SCRATCH_CLEAN_UP
3/11 14:15:45 JMI: testing job manager scripts for
type fork exist and permissions are ok.
3/11 14:15:45 JMI: completed script validation: job
manager type is fork.
3/11 14:15:45 JMI: cmd = cache_cleanup
Fri Mar 11 14:15:47 2005 JM_SCRIPT: New Perl
JobManager created.
Fri Mar 11 14:15:47 2005 JM_SCRIPT:
cache_cleanup(enter)
Fri Mar 11 14:15:47 2005 JM_SCRIPT:
cache_cleanup(exit)
3/11 14:15:47 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_CACHE_CLEAN_UP
3/11 14:15:47 JM: in
globus_gram_job_manager_reporting_file_remove()
3/11 14:15:47 JM: exiting globus_gram_job_manager.

I appreciate your help.
Guna





 --- Karonis Nicholas <karonis@niu.edu> wrote: 
> Dear Guna,
> 
> The problem is in the way you are trying to launch
> the job.
> First, you should be using MPICH-G2's mpirun command
> like this:
> <MPICH-Install-dir>/bin/mpirun -globusrsl <your rsl
> filename>
> While it's true that MPICH-G2's mpirun command is a
> rather
> thin wrapper around globusrun you are not calling
> globusrun
> in a proper manner for MPICH-G2.
> Second, and perhaps the more important point, the
> RSL
> that you have provided is not valid for MPICH-G2. 
> All
> RSL's that are used to launch applications
> compiled/linked
> with MPICH-G2 must be "multirequest".  This means it
> must
> start with a "+" sign which, in turn, triggers a new
> set
> of Globus functionality (called DUROC) that is
> required
> by MPICH-G2.
> If you haven't already done so, I encourage you to
> take
> a look at the MPICH-G2 web page www.globus.org/mpi
> paying
> particular attention to the section on how to run
> your first
> MPI program.  There you will find instructions with
> examples.
> 
> Nick
> 
> On Mar 10, 2005, at 12:04 AM, guna kosh wrote:
> 
> > We are facing a small problem regarding running an
> MPI
> > program in a remote machine and displaying its
> results
> > in the job submitting node.
> > All other programs (script files and normal C
> > programs) are working fine with remote staging of
> > executables.
> > The procedure followed by us for running mpi pgm
> is:
> > 1. writing the code with mpi statements.(say ex.c)
> > 2. compiling using <MPICH-Install-dir>/bin/mpicc
> ex.c
> > -o ex
> > 3. globusrun -r a1 '&(executable=<valid path of
> ex>)
> > (stdout="<FQDN>/dev/stdout")
> > The following log file is created in machine a1.
> > 3/10 11:11:49 JMI: Getting RSL output value
> > 3/10 11:11:49 JMI: Processing output positions
> > 3/10 11:11:49 JMI: Getting RSL output value
> > 3/10 11:11:49 JMI: Processing output positions
> > 3/10 11:11:49 Job Manager State Machine
> (entering):
> >
> GLOBUS_GRAM_JOB_MANAGER_STATE_REMOTE_IO_FILE_CREATE
> > 3/10 11:11:49 JM: Opening output destinations
> > 3/10 11:11:49 JM: stdout goes to
> >
>
x-gass-cache://btts4.stp.cdac.ernet.in/2115.1110433309/dev/stdout
> > 3/10 11:11:49 JM: stderr goes to
> >
>
x-gass-cache://btts4.stp.cdac.ernet.in/2115.1110433309/dev/stderr
> > 3/10 11:11:49 JM: Opening
> > https://btts2.stp.cdac.ernet.in:2001/dev/stdout
> > 3/10 11:11:49 JM: Opened GASS handle 1.
> > 3/10 11:11:49 JM: exiting
> >
> globus_l_gram_job_manager_output_destination_open()
> > 3/10 11:11:49 JM: Opening
> > https://btts2.stp.cdac.ernet.in:2001/dev/stderr
> > 3/10 11:11:49 JM: Opened GASS handle 2.
> > 3/10 11:11:49 JM: exiting
> >
> globus_l_gram_job_manager_output_destination_open()
> > 3/10 11:11:49 stdout or stderr is being used,
> starting
> > to poll
> > 3/10 11:11:49 JM: Finished opening output
> destinations
> > 3/10 11:11:49 Job Manager State Machine
> (entering):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_OPEN_OUTPUT
> > 3/10 11:11:49 JM: GSSAPI type is GSI.. relocating
> > proxy
> > 3/10 11:11:49 JMI: testing job manager scripts for
> > type fork exist and permissions are ok.
> > 3/10 11:11:49 JMI: completed script validation:
> job
> > manager type is fork.
> > 3/10 11:11:49 JMI: in
> > globus_gram_job_manager_script_proxy_relocate()
> > 3/10 11:11:49 JMI: cmd = proxy_relocate
> > Thu Mar 10 11:11:50 2005 JM_SCRIPT: New Perl
> > JobManager created.
> > Thu Mar 10 11:11:50 2005 JM_SCRIPT:
> > proxy_relocate(enter)
> > 3/10 11:11:50 JMI: while return_buf =
> > GRAM_SCRIPT_X509_USER_PROXY =
> > /home/budania/.globus/.gass_cache/local/md5/fc/ 
> >
>
902bcb6cbe1d0c66880fd95d800ed7/md5/65/e2c681db1feaf32e44b3c4a7ad117c/
> 
> > data
> > 3/10 11:11:50 Job Manager State Machine
> (entering):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_PROXY_RELOCATE
> > 3/10 11:11:50 JM: Relocated Proxy to
> > /home/budania/.globus/.gass_cache/local/md5/fc/ 
> >
>
902bcb6cbe1d0c66880fd95d800ed7/md5/65/e2c681db1feaf32e44b3c4a7ad117c/
> 
> > data
> > 3/10 11:11:50 JM: before sending to client: rc=0
> > (Success)
> > 3/10 11:11:50 Job Manager State Machine (exiting):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_TWO_PHASE
> > 3/10 11:11:50 Job Manager State Machine
> (entering):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_TWO_PHASE
> > 3/10 11:11:50 Job Manager State Machine
> (entering):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_TWO_PHASE_COMMITTED
> > 3/10 11:11:50 JM: NOT empty client callback list.
> > 3/10 11:11:50 JM: sending callback of status 64
> > (failure code 0) to
> > https://btts2.stp.cdac.ernet.in:1743/.
> > 3/10 11:11:50 JMI: testing job manager scripts for
> > type fork exist and permissions are ok.
> > 3/10 11:11:50 JMI: completed script validation:
> job
> > manager type is fork.
> > 3/10 11:11:50 JMI: in
> > globus_gram_job_manager_script_stage_in()
> > 3/10 11:11:50 JMI: cmd = stage_in
> > 3/10 11:11:51 JMI: returning with success
> > 3/10 11:11:51
> globus_gram_job_manager_query_callback()
> > not a literal URI match
> > 3/10 11:11:51 JM : in
> > globus_l_gram_job_manager_query_callback,
> query=cancel
> > 3/10 11:11:51 JM : reply: (status=64 failure
> code=0
> > (Success))
> > 3/10 11:11:51 JM : sending reply:
> > protocol-version: 2^M
> > status: 64^M
> > failure-code: 0^M
> > job-failure-code: 0^M
> > ^@3/10 11:11:51 -------------------
> > Thu Mar 10 11:11:52 2005 JM_SCRIPT: New Perl
> > JobManager created.
> > Thu Mar 10 11:11:52 2005 JM_SCRIPT:
> stage_in(enter)
> > Thu Mar 10 11:11:53 2005 JM_SCRIPT: stage_in(exit)
> > 3/10 11:11:53 Job Manager State Machine
> (entering):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED
> > 3/10 11:11:53 closing destination
> > https://btts2.stp.cdac.ernet.in:2001/dev/stdout
> > 3/10 11:11:53 JM: exiting
> >
> globus_l_gram_job_manager_output_destination_close()
> > 3/10 11:11:53 closing destination
> > https://btts2.stp.cdac.ernet.in:2001/dev/stderr
> > 3/10 11:11:53 JM: exiting
> >
> globus_l_gram_job_manager_output_destination_close()
> > 3/10 11:11:53 Job Manager State Machine
> (entering):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_CLOSE_OUTPUT
> > 3/10 11:11:53 JM: in
> > globus_gram_job_manager_history_file_create()
> > 3/10 11:11:53 JM: NOT empty client callback list.
> > 3/10 11:11:53 JM: sending callback of status 4
> > (failure code 8) to
> > https://btts2.stp.cdac.ernet.in:1743/.
> > 3/10 11:11:53 Job Manager State Machine
> (entering):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE
> > 3/10 11:11:53 Job Manager State Machine
> (entering):
> >
>
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE_COMMITTED
> > 3/10 11:11:53 Job Manager State Machine
> (entering):
> > GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_FILE_CLEAN_UP
> > 3/10 11:11:53 Job Manager State Machine
> (entering):
> >
>
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_SCRATCH_CLEAN_UP
> 
=== message truncated === 

Send instant messages to your online friends http://uk.messenger.yahoo.com