[Q] Hardware requirements for head node?

Thumbnail image: Looking for a cloud platform purpose-built for AI? CoreWeave has the performance, scale, and expertise needed to unleash AI's full potential.

There are a few roles you are hinting at including in this "master node":

Login node for users including interactive use
System node (dhcp, provisioning, configuration management, queue system, etc.)
I/O node (serve file systems for software, user files etc.)

Each have different requirements but you could easily end up being dominated by the first bullet point (especially if it includes virtual desktop with jupyter notebooks, matlab etc.).

I would get a reasonable base config and then double (or more) the RAM. If it is going to handle I/O serving you'll have to sort that out too.

More replies

You do not need a lot spec wise if all you plan on doing with the head node is an ssh point and launch jobs with slurm. A VM could handle that. You run into problems though when people run programs or build on the head node, which will happen even if you do not want them to. As others have said the more you need the head node to do the more hardware is needed.

Will you do any network shares from the head node or do you have a separate fs?

More replies

Is this just for hosting the batch controller, or also an user-facing login node?

The latter tends to require a bit more resources, particularly if you expect users to build software or do light pre/post-processing on it. Net connectivity would also be more important then for staging data from the outside.

More replies

You don't schedule via ssh, you login via ssh and schedule via Slurm in the OpenHPC recipe and most others. Provisioning, monitoring, and scheduling take minimal resources in a small cluster. File serving can take a lot of resources, and login serving will have users taking all the resources you give them as new users will run compute jobs on the login node. So you need to limit the resources available to interactive login users using VMs or cgroups or whatever method you like. So if one node does all of these jobs, it needs to be fairly substantial. If you separate file serving and login to other servers, not so much.

More replies

For single-digit nodes and users, without IO, a node with 8 cores, as high as possible single-threaded performance, 2x1 TB RAID1 flash, 16-32 GB RAM, and a 10 gigabit NIC would be generous. You could prototype it on a fairly small VM on the compute host with virsh. (This would make provisioning that node tricky.)

In general practice I strongly recommend not putting data or login on the same equipment as the head node, but for a small scale deployment it's not the end of the world for small home directory or login usage as long as you don't do any substantial I/O (more than ~1 GB.)

I run small cluster with HP DL360 G5 as a master node. It is equipped with 1x Xeon L5630 @ 2.13GHz, 12GB DDR3 memory and 4x 72GB HDDs. It manages cluster of 15 nodes. It works, although don't take it as a reference.

I haven't seen any needs posted yet that would prevent you from using CloudyCluster and running scalable HPC / HTC jobs in the cloud. Unless you really want to build something to maintain and outlay a bunch of cash up front... might be worth looking into. cloudycluster.com

More replies