It is possible to run AMUSE on multiple machines simultaneously. The AMUSE script itself aways runs on a users’ local machine, while workers for codes can be “send out” to remote machines such as workstations, clusters, etc.
Deploying workers one remote machines requires a full installation of AMUSE on each machine. For each code “sockets” support needs to be present. This is done by default, and should be available for all codes. Note that older versions of Fortran lack the necessary features to support sockets workers.
On each machine, the distributed code also needs to be build. Distributed AMUSE requires a Java Development Kit (JDK), preferably Oracle Java version 7 or 8. The configure script tries to locate the JDK, but you may need to specify it by hand. For details, see:
> ./configure --help
To build distributed amuse run the following at the amuse root:
> make distributed.code
To check if the installation is set-up properly, run all the tests related to the worker interface:
> cd $AMUSE_DIR > nosetests -v test/codes_tests/test*implementation.py
Note that Distributed AMUSE is mostly tested with the version of MPI includes in the amuse “prerequisites”. If you get MPI errors while running remote (parallel) workers, try using the install.py script included in AMUSE to install the prerequisites.
Usage of Distributed Amuse is (by design) very close to the usage of any other code in AMUSE. The main difference being it contains resources, pilots, and jobs, instead of particles.
In general, a user will first define resources, then deploy pilots on these resources, and finally create codes that make use of the machines offered by the pilots.
Distributed Amuse can be initialized like any other code:
>>> from amuse.community.distributed.interface import DistributedAmuseInterface, DistributedAmuse >>> from amuse.community.distributed.interface import Resource, Resources, Pilot, Pilots >>> >>> #initialize code, print output of code to console >>> instance = DistributedAmuse(redirection='none')
Distributed AMUSE supports a few parameters to adjust settings. All parameters need to be set before any resource, pilot or job is made to have effect.
Overview of settings:
The distributed code starts AMUSE workers running the actual codes. This can take a while on some machines. If needed, this parameter can be used to increase the time waited.
>>> instance.parameters.debug = True >>> instance.parameters.webinterface_port = 5555 >>> instance.commit_parameters() >>> >>> print instance.parameters.webinterface_port
Distributed Amuse has a small build-in webinterface for monitoring. A utility function is available to get the url:
>>> import webbrowser >>> >>> webbrowser.open(instance.get_webinterface_url())
In order to use a remote machine, AMUSE needs to have some information about this resource such as the host name, type of machine, username to gain access, etc. This can be specified by creating a “Resource” in Distributed AMUSE. As a side effect, a communication hub is also started on the (frontend of) the resource.
>>> resource = Resource() >>> resource.name = "some.resource" >>> resource.location = "email@example.com" >>> resource.scheduler = "ssh" >>> resource.amuse_dir = "/home/user/amuse" >>> >>> instance.resources.add_resource(resource)
Overview of all options:
The next step in running jobs remotely is to start a so-called pilot job on the resource specified previously. This pilot will submit a job to the resource, create necessary communication channels with the main amuse application, and wait for jobs to be started (currently mostly workers)
Note that pilots may not be started for a while. A function is available to wait until all created pilots have started.
>>> pilot = Pilot() >>> pilot.resource_name='local' >>> pilot.node_count=1 >>> pilot.time= 2|units.hour >>> pilot.slots_per_node=22 >>> pilot.label='local' >>> >>> instance.pilots.add_pilot(pilot) >>> >>> print "Pilots:" >>> print instance.pilots >>> >>> print "Waiting for pilots" >>> instance.wait_for_pilots()
Overview of all options:
When running remote workers, they can be started as normal. However, AMUSE needs to be signalled to use the distributed code to start them instead of the normal process. A function is available to enable and disable this.
>>> print "starting all workers using the distributed code" >>> instance.use_for_all_workers()
>>> print "not using distributed workers any longer" >>> instance.use_for_all_workers(enable=False)
Alternatively, you can also explicitly enable the distributed code per worker
>>> print "using this distributed instance for all distributed workers" >>> instance.use_for_all_distributed_workers(enable=True) >>> worker = Hermite(channel_type='distributed')
Or, even pass the instance of the distributed code you would like to use, in the rare case you have multiple distributed codes
>>> worker = Hermite(channel_type='distributed', distributed_instance=instance)
This section lists all the relavant worker options for Distributed AMUSE. Most are new, some are also supported in the other channel implementations. You are normally not required to use any options.
By default workers are started on any available pilot with enough slots available. However, sometimes you would like to have more control over which worker is started where, for instance if special hardware is present on some machines.
The concept of labels can be used within Distributed AMUSE to get this functionality. If a label is attached to a worker (one of the parameters when starting a worker, see above), only pilots with exactly the same label (specified when the pilot is started) are considered candidates for running the worker. The name of labels is completely up to the user.
For instance, say a simulation uses a number of workers running on a CPU, and a single GPU worker. The following code will put all the cpu workers on one machine, and the single gpu worker on another.
>>> cpu_pilot = Pilot() >>> cpu_pilot.resource_name='machine1' >>> cpu_pilot.node_count=1 >>> cpu_pilot.time= 2|units.hour >>> cpu_pilot.slots_per_node=30 >>> cpu_pilot.label='CPU' >>> instance.pilots.add_pilot(cpu_pilot) >>> >>> gpu_pilot = Pilot() >>> gpu_pilot.resource_name='machine2' >>> gpu_pilot.node_count=1 >>> gpu_pilot.time= 2|units.hour >>> gpu_pilot.slots_per_node=1 >>> gpu_pilot.label='GPU' >>> instance.pilots.add_pilot(gpu_pilot) >>> >>> ... >>> worker1 = Hermite(label='CPU') >>> worker2 = Bonsai(label='GPU') >>> >>> #will not start due to a lack of slots. >>> worker3 = Bonsai(label='GPU')
AMUSE contains a number of examples for the distributed code. See examples/applications/
Gateways can be used in case of connectivity problems between machines, such as firewalls and private IP addresses. This is for instance the case at the LGM. A gateway is started like any other resource (and thus require a valid installation of AMUSE on each gateway). This resource can then be specified to be a “gataway” to another resource. In this case all ssh connections will be made via the gateway, so make sure you can login from the gateway to the target machine without using a password, as well as from your local machine.
Most initial setup problems with the Distributed AMUSE code can be solved by checking:
Can you login to each machine you plan to use using ssh without using a password? See for instance here on how to set this up: http://www.thegeekstuff.com/2008/11/3-steps-to-perform-ssh-login-without-password-using-ssh-keygen-ssh-copy-id/
Did you configure a Java JDK version 1.7 or higher using ./configure? check the contect of config.mk to see which java is used, and what version was detected. Make sure to do a “make clean” and “make” in case you make any changes. This should also be done on all machines.
Is AMUSE configured properly on each and every machine? running the code implementation tests is a good way of spotting issues:
>>> nosetests -v test/codes_tests/test_*_implementation.py
Are the settings provided for each resource correct (username, amuse location, etc)
Have you set the correct mpiexec in ./configure? This setting is normally not used by AMUSE, so you may only now notice it is misconfigured
In case this does not help, it is probably best to check the output for any errors. Normally worker output is discarded by most scripts. Use ‘redirect=none’ to see the output of the workers, a lot of errors show up in this output only. There is also a “debug” parameter in Distributed Amuse. If enabled, output for each pilot will be in a “distributed-amuse-logs” folder in the home of each remote machine used, and additional information is printed to the log from the local AMUSE script.