Chris Nardi from HPC Support discusses the challenges of high-output memory in an HPC environment.
Everything tends to be more difficult when running applications in a High-Performance Computing environment. In order to require the use of an HPC environment, the task at hand must require an enormous amount of computational resources that simply aren’t available at a personal laptop or workstation. Though there are countless other challenges that HPC applications face, this post focuses on memory usage, as memory has been a roadblock for two particular projects that we’ve worked on in the past year. We tend to classify applications as limited (or “bound”) by the three main components of a computer.
An application can be CPU-bound if the application would be sped up by adding additional cores to the server, memory-bound if the processing speed is limited by the amount of data that can be stored in
RAM and swap, or I/O-bound if the major hurdle is reading and writing data either from persistent storage or RAM. An application can be a mix of these types, as a process that requires reading a lot of data and making use of all of it at the same time could be both I/O-bound and memory-bound. Broadly
classifying applications in this way can help to identify which components of the system (or even the code) need to be improved so that performance is optimized.
While an application that is CPU-bound or I/O-bound would typically run on less powerful machines (though possibly slowly), a memory-bound application will require a certain amount of memory to be
available so that it can even run. This is further complicated by the price of RAM sticks, as they can be one of the most expensive components of a computer or server. Applications that need to run on an HPC environment oftentimes are memory-bound, as they typically are analyzing or creating a large amount of data that cannot (easily) be processed independently. While many applications are also CPU-bound, long-
running computations are acceptable to researchers. What’s not acceptable is a computation that cannot even be finished due to a lack of memory.
We’ve encountered memory-bound applications in two projects that I’ve worked on in the past year. The first one involved a several terabyte dataset that we were attempting to run a Latent Dirichlet
Allocation (LDA) model on in order for topics to be discovered without manual tagging from hundreds of thousands of text posts. The most common implementation of an LDA model requires all of the data to
be loaded into memory at the same time–something that wasn’t feasible locally as the most RAM on an individual server we have currently is 768 GB.
We looked into using AWS to find a cloud node that would have over 1 TB of RAM, but costs can quickly add up as x1e.16xlarge or x1e.32xlarge instances with 1-3 TB of RAM are $14-$25 an hour. We found some implementations of LDA models that incorporated batching, meaning the algorithm was run on subsets of the data and the intermediate data was combined to generate a final output, but this would require additional coding and implementation work.
Though the second application was performing calculations only on a moderately sized dataset in R, it required an exponential expansion of probabilities to reach a final output. As a result, it too would
require more than 768 GB of RAM in order to complete. This application proved trickier to find a solution, as it would be hard to batch out the work of the associated algorithm in the same way that could be done with LDA.
So, what can be done when you find an application that is limited by memory? Generally speaking, there are a few possible solutions. As discussed, a simple one is going to the cloud, though this can be expensive. For academic use, XSEDE (a partnership of colleges and universities that allows sharing of computing resources) can be a great resource, but submitting a job on it requires that the data and code be shareable with no restrictions. More complicated solutions likely require a rework of the application itself.
In some instances, data can be compressed without losing any (or much) information, meaning that a similar output could be generated without needing as much memory. If the data can be processed independently, tools like Apache Spark can also help by dispatching tasks to worker nodes and bringing the data together at the end to create a final output.
Of course, buying more memory is also a possibility. However, at $1000+ per 128 GB stick, money can limit the feasibility of this option. As with our two examples, it’s usually possible to find ways to rework the application if you search hard enough (and have people with the expertise to implement the required changes).
Memory usage is a pitfall in HPC, but it is surmountable with the appropriate tools.
By Chris Nardi