Creating AWS Parallel Cluster and Running EFDC+ on it

The aim of this page is to describe how to configure an AWS cluster and run EFDC+ on it.

Setting up a cluster on AWS

Steps to setup AWS parallel cluster command line interface. The goal here is to install the necessary tools to setup a cluster from the command line and access the cluster from the command line.

Prerequisites

AWS Account
AWS vCPU limit must be high enough to handle your configuration. You can calculate your requirements and see your limits here: https://console.aws.amazon.com/ec2/home?#LimitsCalculator:
Python 3.6+ Installed
Pip Installed

All of this is assumed to be done with Ubuntu 18.04 and Python 3.6+, and Pip installed.

Followed the guide to install AWS command line tools:
https://docs.aws.amazon.com/cli/latest/userguide/install-virtualenv.html
I installed in a virtual environment, which proved to be a good move for managing dependencies.

Additionally, install the aws-parallelcluster command line tools. Instructions found at:

https://docs.aws.amazon.com/parallelcluster/latest/ug/install.html

If you want a reference for the parallel cluster interface:

https://docs.aws.amazon.com/parallelcluster/latest/ug/aws-parallelcluster-ug.pdf

Prior to running it is helpful to have the AWS console open and logged in to your account, which can be accessed at:
https://console.aws.amazon.com/

You will need some information to connect your command line interface to your AWS account. To find this info click on your username in the top right hand side of the page and select “My Security Credentials”

Then click the “Access keys (access key ID and secret access key” in the center of a page and a drop down will appear. The click the blue “Create New Access Key” button. You will need this info to configure the aws parallel cluster, which we will look at next.

The first step is to connect your AWS account information, to do so run:

$ aws configure

Enter the info you just generated from the new access key

AWS access key ID:
Secret Access Key (unique value, can only have a max of 2 with a single aws user)
default region: use us-east-1
default output format: none

Next, to setup the parallelcluster configuration (this is the config file that determines how the cluster will look)

$ pcluster configure

I forgot all of the options that come up at this prompt. But they can all be changed later anyway. all that matters is running this script creates the parallel_config file. I have attached a sample one that should be used. I think the same one has 3 nodes, we might want to set this system up with 4.

Setting Up Intel MPI

https://docs.aws.amazon.com/parallelcluster/latest/ug/intelmpi.html

Go to the Cluster_Distribution folder

tar -xvf l_mpi_2019.6.166.tgz

tar -xvf l_mpi_2019.6.166.tgz
cd l_mpi_2019.6.166

Run the install script as:

./install.sh

Press enter and read the agreement and type ‘accept’

Install on the default - (Single - Node)

The default install location will be

/home/ubuntu/intel

Lets change that to /fsx/intel

Enter the following for Customize Installation:

2

Customize Installation

Enter (Change install Directory ):

2

type:

/fsx/intel

You should see this updated on the prompt above

Now press enter to continue

The installation process should be complete and you will see the install related files under:

/fsx/intel

Now to ensure the MPI related exectuables are available in the path you need so ‘source’ the relevant intel setup scripts. An example of this is given in the sample_bashrc file under the Cluster_Distribution

source /fsx/intel/bin/compilervars.sh -arch intel64
source /fsx/intel/impi/2019.6.166/intel64/bin/mpivars.sh

To ensure these commands are executed every time you log into the cluster add the two lines to your .bashrc file.

If you make the changes to your bashrc file you should go ahead and source that file

source ~/.bashrc

Now, lets see if that put the new executables in your path. Enter:

which mpiexec

You should see:

/fsx/itel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin/mpiexec

Setting the Path for the OpenMP Runtime libraries

OpenMP requires shared runtime libraries that are accessed by each thread. These .so libraries are not shipped with the MPI distribution. They are available in the Cluster_Distribution folder under intel64_lin

To make these .so libraries available at run time you need to modify the LD_LIBRARY_PATH. This is best accomplished by appending the .bashrc file. Simply add:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/fsx/Cluster_Distribution/intel64_lin

Remember to source the .bashrc

Verify the LD_LIBRARY_PATH was modified

echo $LD_LIBRARY_PATH

You should see the /fsx/Cluster_Distribution/intel64_lin at the end of your path

Running EFDC+ from the Command Line on Cluster Systems

Some notes on executing a run from the command line

Generally:

mpiexec -n (# processes) -ppn (# processes per node) -hosts host1,host2 -genv I_MPI_DEBUG=5 -genv I_MPI_PIN_DOMAIN=omp -genv OMP_NUM_THREADS=(# threads) efdc+.exe -NT(# threads)

For example, if I wanted to run with a domain decomposed into 32 subdomains across 2 nodes with 2 threads/process I could execute the following script

mpiexec -n 32 -ppn -hosts node1,node2 -genv I_MPI_DEBUG=5 -genv I_MPI_PIN_DOMAIN=omp -genv OMP_NUM_THREADS=2 efdc+.exe -NT2

This would run 16 processes on node1 and 16 processes on node2 with 2 threads per process. So each node would be utilizing 32 cores during a run.
I think its easiest to think of a process as the "master" thread and any additional threads work with that master thread to provide additional calculation capability.

NOTE - to get the node names on AWS enter the command:

$ qnodes

it should give the node name listed somewhere along with other information about that node.

EFDC+ Computer Implementation Guide