[H-GEN] Linux and high-performance computing

Tue Sep 1 21:09:47 EDT 2015

I work in the Scientific Computing Platforms team in CSIRO. Presently I work in the data area and around 12 months ago I was in the Systems area. As such I am more familiar with HPC architecture than HPC applications. I'd start with the top 500 and green 500 lists and look at the systems towards the top to start with. That will give you an idea of what is there now.

http://top500.org/
http://www.green500.org/

GPUs and Xeon Phi are the current best bang per watt so figure prominently in the Green 500. There are some nasty limitations with GPUs at present. GPUs have many cores but not much memory for those cores. They generally sit on a PCIe bus so have slow access to the hosts memory. Many codes are not suited to run on GPUs because of the parallelism and problems getting the algorithm and/or existing code parallelised. Architecture wise things that are coming in this space - faster NVlink to replace PCIe, however Intel are not adopting this, but ARM and IBM Power are. In the Phi space, a future generation will fit into a motherboard socket, so have access to RAM an normal memory speeds.

CSIRO has a GPU cluster called bragg. It's relatively old now, but still in both lists. It's around 130 1RU dual socket intel servers with 3GPUs per host. The GPUs tend to be underutilised because of lack of codes that exploit them. Bragg also runs more conventional jobs on its Intel CPUs. When the GPUs get used it gets a lot of work done quickly and can cook you if you are standing behind it! Probably the most interesting code that can utilise the entire cluster is the Victor Chang Cardiac Research Institute heart model. They pre-test drugs in this model and what possible affects there could be to the heart.

We also have 4 large memory nodes with 3TB each which are useful in the gene sequencing area. There is our new Haswell based conventional Beowulf cluster called pearcey with around 260 nodes. We have a small 16 node Xeon phi cluster still in development. We also have an SGI UV 3000 being installed, probably this month. The UV 3000 is very similar to pearcey architecture wise, but uses a custom interconnect instead of Infiniband (used on bragg and pearcey) and a customised Linux (SLES) that treats the system as a single entity, rather than a bunch of nodes (6TB of RAM I think, not sure of core count.) We also have a small 16 node Hadoop cluster in development. There is other HPC equipment as well, but this is the main kit. I think we cover most of the major current architectures with these.

Pearcey and bragg nodes can run either windows HPC or Linux. Most work is done on Linux with maybe 32 nodes running windows.

HPC is about wringing the most out of your hardware, so understanding the architecture is critical to algorithm and code development. Architecture evolution is generally about fixing bottlenecks. As soon as you fix one, something else becomes the next bottleneck. One principle you should keep in mind - keep the expensive CPU/GPU computing as much as possible. CPU and RAM are the major cost components in a computer. Don't let the CPU spin waiting for data. This applies at all sorts of levels. Even the 3TB RAM servers can slow if a core needs to access memory via QPI that is attached to a socket different from the one the core is in. At the other end of the scale disk IO is also critical. If you can avoid doing it, other than at start and completion, great. If you do lots of things in parallel you need to understand the filesystem architecture too. Hopefully it is a parallel filesystem like Lustre, GPFS or HDFS and you can exploit the parallelism.

There are also a myriad of existing applications out there. Before writing your own code, investigate what's there and whether it can be used or modified. There are new workflow oriented packages now that allow code illiterate scientists to plug together pre-coded modules to build a workflow to process their data. Building you own modules is also possible. I have heard  of Moose talked about in the physics space, but as I've said I'm not an apps guy. We have a separate team that handles that. The first talk Russell mentioned on OpenCL also mentions some of its competitors. They are worth looking at too. Looking at what is available to you in the University sector is a good suggestion. Check out their support and ask about the codes they have available. Find fellow HPC users and talk to them. eResearch 2015 is in Brisbane this year. Maybe go to that?

Cheers,

Greg

-----Original Message-----
From: General [mailto:general-bounces at lists.humbug.org.au] On Behalf Of Benjamin Fowler
Sent: Tuesday, 1 September 2015 6:56 AM
To: general <general at lists.humbug.org.au>
Subject: [H-GEN] Linux and high-performance computing

[ Humbug *General* list - semi-serious discussions about Humbug and     ]
[ Unix-related topics. Posts from non-subscribed addresses will vanish. ]