Core and Memory Binding of Jobs in Univa Grid Engine 8.1

Execution nodes in Grid Engine clusters usually have multiple sockets and multiple cores with a hierarchy of different caches. This hardware architecture will provide performance benefit for jobs and therefore improve the overall throughput of a cluster if it is handled correctly.

Univa Grid Engine is not only aware of the underlaying hardware architecture of compute resources. It provides also the necessary semantics to give managers and users of a cluster full control where jobs should be executed and how they should be handled.

Especially Univa Grid Engine 8.1 is extremely powerful. Within this version of Univa Grid Engine the scheduler component is completely responsible for the socket and core selection. Due to that it is possible to guarantee core binding specific requests. This was different in UGE 8.0 and it is still in other available Grid Engine versions.

The scheduler is also aware of the memory allocation capabilities of the underlaying hardware. As result particular memory allocation strategies can be selected so that jobs and underlaying applications will have accelerated access to available memory. Also this feature is new in UGE 8.1.

After the installation of UGE 8.1 it is possible to retrieve several parameters of the underlying execution nodes with the qhost and loadcheck command:

> qhost
HOSTNAME ARCH   NCPU NSOC NCOR NTHR LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
------------------------------------------------------------------------
global   -        -    -    -    -     -       -       -       -       -
vm00     lx-amd64 1    1    1    1  0.01  237.9M   62.9M    2.0G   31.8M
vm01     lx-amd64 4    1    4    1  0.01  111.9M   47.4M    2.0G   24.6M

You can see that host vm01 has one socket (NSOC = 1) with four cores (NCOR = 4).

vm01 > $SGE_ROOT/utilbin/lx-amd64/loadcheck
arch            lx-amd64
num_proc        4
m_socket        1
m_core          4
m_thread        1
m_topology      SCCCC
load_short      0.00
load_medium     0.02
load_long       0.05
mem_free        64.000000M
swap_free       2029.355469M
virtual_free    2093.355469M
mem_total       111.925781M
swap_total      2053.996094M
virtual_total   2165.921875M
mem_used        47.925781M
swap_used       24.640625M
virtual_used    72.566406M
cpu             0.0%

The m_topology string of the load check command shows also one socket with four cores. Each S represents one socket and C’s that follow represent a core. If a machine would also support hyper-threading then a C might be followed by T’s where each T represents a supported thread.

SCTTCTTSCTTCTT would therefore represent a two socket system with dual core CPU’s where each core supports up to two threads.

The scheduler in the Univa Grid Engine systems is able to see even more information. You can use the qstat -F command to get an impression what this is:

queuename               qtype resv/used/tot. load_avg arch   states
-----------------------------------------------------------------------
all.q@vm01.localnet     BIPC  0/0/40         0.03     lx-amd64      
	…
	hl:num_proc=4
	…       
	hl:m_topology=SCCCC
	hl:m_topology_inuse=SCcCc
	hl:m_socket=1
	hl:m_core=4
	hl:m_thread=4
	…
	hl:m_cache_l1=32.000K
	hl:m_cache_l2=256.000K
	hl:m_cache_l3=4.000M
	hl:m_mem_total=111.000M
	hl:m_mem_used=101.000M
	hc:m_mem_free=10.000M
	hl:m_numa_nodes=1
	hl:m_topology_numa=[SCCCC]
	…
 

Especially important for the scheduler is the m_topology_inuse attribute. It will show if cores ares in use by Univa Grid Engine. If this is the case then corresponding cores will be shown lower case. SCcCc show that the second and the fourth core of the first socket are in use.

m_topology_numa is an enhanced topology string. In addition to the S, C, and T characters there are [ and ] brackets which are marking a specific NUMA node on the execution host. A NUMA (non-uniform memory access) node is a particular area for which the memory latency is the same. Usually it is per socket memory.

To achieve more stable and faster job run-times and better job isolation it is necessary to consider the memory allocation strategy of a job. Memory specific information will be available trough following complex attributes (depending on the underlying hardware)

> qconf -sc | grep "^m_[cm][ae]"
m_cache_l1        mcache1    MEMORY   <=    YES         NO      0     0
m_cache_l2        mcache2    MEMORY   <=    YES         NO      0     0
m_cache_l3        mcache3    MEMORY   <=    YES         NO      0     0
m_mem_free        mfree      MEMORY   <=    YES         YES     0     0
m_mem_free_n0     mfree0     MEMORY   <=    YES         YES     0     0
m_mem_free_n1     mfree1     MEMORY   <=    YES         YES     0     0
m_mem_free_n2     mfree2     MEMORY   <=    YES         YES     0     0
m_mem_free_n3     mfree3     MEMORY   <=    YES         YES     0     0
m_mem_total       mtotal     MEMORY   <=    YES         YES     0     0
m_mem_total_n0    mmem0      MEMORY   <=    YES         YES     0     0
m_mem_total_n1    mmem1      MEMORY   <=    YES         YES     0     0
m_mem_total_n2    mmem2      MEMORY   <=    YES         YES     0     0
m_mem_total_n3    mmem3      MEMORY   <=    YES         YES     0     0
m_mem_used        mused      MEMORY   >=    YES         YES     0     0
m_mem_used_n0     mused0     MEMORY   >=    YES         YES     0     0
m_mem_used_n1     mused1     MEMORY   >=    YES         YES     0     0
m_mem_used_n2     mused2     MEMORY   >=    YES         YES     0     0
m_mem_used_n3     mused3     MEMORY   >=    YES         YES     0     0

m_cache_l1, m_cache_l2 and m_cache_l3 show different cache sizes. The *_n1, *_n2, *_n3 and *_n4 attributes represent the total, free and used amount of memory for the NUMA nodes. The Univa Grid Engine scheduler receives the different values from all execution nodes and it does the internal accounting so that it is always aware how much memory is available for each node.

Now lets see some simple examples how all this information can be used. Submit commands provide the -binding and -mbind switches to request a specific core and memory binding. Several additional parameters specify the binding type and strategy. Please note that I show only some examples to get a clue. This is by far not complete. The UGE 8.1 documentation will explain all available binding types and strategies in detail.

> qsub -mbind round_robin -binding striding:2:4 job.sh

This job will run on a two quad core socket with memory affinity set to interleaved (to all memory banks on the host) for best possible memory throughput for certain job types.

> qsub -mbind cores:strict -binding striding:2:4 \
       -pe pe_name 2 -l m_mem_free=2G job.sh

The parallel job requests 2G * 2 slots (4GB) memory and 2 cores on two sockets (quad core processors). The job gets scheduled to a particular host if both NUMA nodes (here both sockets) offer each 2GB m_mem_free_nX. The particular consumables are decremented by that amount.

When there is more time than I will try to explain a more detailed example. Watch also the blog of my colleague Daniel. He has implemented core binding for Sun Grid Engine / Oracle Grid Engine and he was also responsible for the enhancement and implementation of memory binding in Univa Grid Engine 8.1.

blog comments powered by Disqus