Peaceful Co-existence: High Performance Research Computing Cluster serving both Traditional Batch Jobs as well as Jobs requiring Kubernetes Containers
About a year ago, our campus stood up a high performance computing (HPC) cluster by networking together several different department’s research data centers. This consolidated compute nodes and storage into a single, much larger cluster which could then be utilized by any lab. We followed a co-operative model where a department could donate hardware to this cluster, and their jobs will be “niced” at level proportionate to their contribution value. Researchers could also access the cluster for free where their jobs run at the lowest priority, but they still have access to significant level of computing power and storage.
Primarily all these jobs are batch computing jobs where a user submits a job, the job is queued at a set priority level, and when it is ready to run, it gets parallelized as much as possible across all available nodes, and after some time, the job completes and produces some results.
However, with time, new research computing use cases started emerging. The current environment is not approved for computing with PHI (or P3/P4) data, and there are a number of use cases that require batch computing with PHI data. Then there are data de-identification jobs themselves which could run much more efficiently if they had access to the HPC cluster. Users also want to host application that require continuous use of computing and storage, but with the ability to dial their usage up or down based on demand. For use cases like Machine Learning, Artificial Intelligence, and Deep Learning, the HPC cluster needs to provide access to GPU nodes in the same way as it does with CPU nodes. Which then leads to interactive Data Science type of use cases, where users need an interactive session with the HPC cluster.
For a vast majority of these new cases, a Container-based architecture like Kubernetes is a much better fit compared to submitting batch jobs. But how do you take a HPC cluster that was built for batch computing from ground up, and make it support Containers? Do you spin up an entirely new cluster? And to support PHI hosting requirements, when you add security controls like encryption and audit logging, how do you prevent them from adding significant performance hit to non-PHI jobs? And lastly, how would the co-op model work in the new environment containing GPU, Containers, and shared security responsibilities for working with PHI?
These are the questions our group went through in designing a solution approach to enable our HPC environment to transition from a bath-only model to also accommodate container-based model. This presentation will walk through how our team analyzed the requirements and solution options, and how we have outlined our approach to meet the objectives of supporting these new research use cases and also support research computing with PHI. This will be more of an interactive discussion so the audience can learn from each other’s experiences and hopefully come up with some common themes on how best to support these research use cases within our computing environments.
Basic awareness of HPC, batch computing, and container-based computing will be helpful