OS and BIOS Settings Descriptions for Cisco UCS AMD-based systems

Operating System Tuning Parameters

Operating System and Software Tuning Parameters

sched_cfs_bandwidth_slice_us
This OS setting controls the amount of run-time(bandwidth) transferred to a run queue from the task's control group bandwidth pool. Small values allow the global bandwidth to be shared in a fine-grained manner among tasks, larger values reduce transfer overhead. The default value is 5000 (ns).
sched_latency_ns
This OS setting configures targeted preemption latency for CPU bound tasks. The default value is 24000000 (ns).
sched_migration_cost_ns
Amount of time after the last execution that a task is considered to be "cache hot" in migration decisions. A "hot" task is less likely to be migrated to another CPU, so increasing this variable reduces task migrations. The default value is 500000 (ns).
sched_min_granularity_ns
This OS setting controls the minimal preemption granularity for CPU bound tasks. As the number of runnable tasks increases, CFS(Complete Fair Scheduler), the scheduler of the Linux kernel, decreases the timeslices of tasks. If the number of runnable tasks exceeds sched_latency_ns/sched_min_granularity_ns, the timeslice becomes number_of_running_tasks * sched_min_granularity_ns. The default value is 8000000 (ns).
sched_wakeup_granularity_ns
This OS setting controls the wake-up preemption granularity. Increasing this variable reduces wake-up preemption, reducing disturbance of compute bound tasks. Lowering it improves wake-up latency and throughput for latency critical tasks, particularly when a short duty cycle load component must compete with CPU bound components. The default value is 10000000 (ns).
numa_balancing
This OS setting controls automatic NUMA balancing on memory mapping and process placement. NUMA balancing incurs overhead for no benefit on workloads that are already bound to NUMA nodes. Possible settings: For more information see the numa_balancing entry in the Linux sysctl documentation.
kernel.randomize_va_space (ASLR)
This setting can be used to select the type of process address space randomization. Defaults differ based on whether the architecture supports ASLR, whether the kernel was built with the CONFIG_COMPAT_BRK option or not, or the kernel boot options used.
Possible settings: Disabling ASLR can make process execution more deterministic and runtimes more consistent. For more information see the randomize_va_space entry in the Linux sysctl documentation.
ulimit -s <n>

Sets the stack size to n kbytes, or unlimited to allow the stack size to grow without limit.

numactl --interleave=all "runspec command"

Launching a process with numactl --interleave=all sets the memory interleave policy so that memory will be allocated using round robin on nodes. When memory cannot be allocated on the current interleave target fall back to other nodes.

CPUfreq governor
The CPUfreq subsystem offers several tuning options for P-states: You can switch between the different governors, influence minimum or maximum CPU frequency to be used or change individual governor parameters. To switch to another governor at runtime, use "cpupower frequency-set with the -g" option.
Possible settings:
Free the file system page cache

The command "echo 1> /proc/sys/vm/drop_caches" is used to free up the filesystem page cache.

Using numactl to bind processes and memory to cores

For multi-copy runs or single copy runs on systems with multiple sockets, it is advantageous to bind a process to a particular core. Otherwise, the OS may arbitrarily move your process from one core to another. This can affect performance. To help, SPEC allows the use of a "submit" command where users can specify a utility to use to bind processes. We have found the utility 'numactl' to be the best choice.

numactl runs processes with a specific NUMA scheduling or memory placement policy. The policy is set for a command and inherited by all of its children. The numactl flag "--physcpubind" specifies which core(s) to bind the process. "-l" instructs numactl to keep a process memory on the local node while "-m" specifies which node(s) to place a process memory. For full details on using numactl, please refer to your Linux documentation, 'man numactl'

dirty_background_ratio

This is the percentage of the total amount of free and reclaimable memory. When the amount of dirty pagecache exceeds this percentage, writeback threads start writing back dirty memory. This setting can help Linux disk caching and performance by setting the percentage of system memory that can be filled with dirty pages. This can be set through a command like "echo 40 > /proc/sys/vm/dirty_background_ratio".

swappiness

This control is used to define how aggressively the kernel swaps out anonymous memory relative to pagecache and other caches. Increasing the value increases the amount of swapping. The default value is 60. A value of 1 tells the kernel to only swap processes to disk if absolutely necessary. This can be set through a command like "echo 1 > /proc/sys/vm/swappiness".

Zone Reclaim Mode

This parameter controls whether memory reclaim is performed on a local NUMA node even if there is plenty of memory free on other nodes. This parameter is automatically turned on on machines with more pronounced NUMA characteristics. To tell the kernel to free local node memory rather than grabbing free memory from remote nodes, use a command like "echo 1 > /proc/sys/vm/zone_reclaim_mode".

dirty_ratio

A percentage value. When this percentage of total system memory is modified, the system begins writing the modifications to disk with the pdflush operation. The default value is 20 percent. To tell the kernel to free local node memory rather than grabbing free memory from remote nodes, use a command like "echo 1 > /proc/sys/vm/zone_reclaim_mode". This can be set through a command "echo 8 > /proc/sys/vm/dirty_ratio".

Linux Huge Page settings

In order to take advantage of large pages, your system must be configured to use large pages. To configure your system for huge pages perform the following steps:

Create a mount point for the huge pages: "mkdir /mnt/hugepages" The huge page file system needs to be mounted when the systems reboots. Add the following to a system boot configuration file before any services are started: "mount -t hugetlbfs nodev /mnt/hugepages" Set vm/nr_hugepages=N in your /etc/sysctl.conf file where N is the maximum number of pages the system may allocate. Reboot to have the changes take effect. (Not necessary on some operating systems like RedHat Enterprise Linux 5.5).

Note that further information about huge pages may be found in your Linux documentation file: /usr/src/linux/Documentation/vm/hugetlbpage.txt

Transparent Huge Pages

On RedHat EL 6 and later, Transparent Hugepages increases the memory page size from 4 kilobytes to 2 megabytes. Transparent Hugepages provides significant performance advantages on systems with highly contended resources and large memory workloads. If memory utilization is too high or memory is badly fragmented which prevents hugepages being allocated, the kernel will assign smaller 4k pages instead. Hugepages are used by default if /sys/kernel/mm/redhat_transparent_hugepage/enabled is set to always.

HUGETLB_MORECORE

Set this environment variable to "yes" to enable applications to use large pages.

KMP_STACKSIZE

Specify stack size to be allocated for each thread.

KMP_AFFINITY

KMP_AFFINITY = < physical | logical >, starting-core-id specifies the static mapping of user threads to physical cores. For example, if you have a system configured with 8 cores, OMP_NUM_THREADS=8 and KMP_AFFINITY=physical,0 then thread 0 will mapped to core 0, thread 1 will be mapped to core 1, and so on in a round-robin fashion. KMP_AFFINITY = granularity=fine,scatter The value for the environment variable KMP_AFFINITY affects how the threads from an auto-parallelized program are scheduled across processors. Specifying granularity=fine selects the finest granularity level, causes each OpenMP thread to be bound to a single thread context. This ensures that there is only one thread per core on cores supporting HyperThreading Technology Specifying scatter distributes the threads as evenly as possible across the entire system. Hence a combination of these two options, will spread the threads evenly across sockets, with one thread per physical core.

OMP_NUM_THREADS

Sets the maximum number of threads to use for OpenMP* parallel regions if no other value is specified in the application. This environment variable applies to both -openmp and -parallel (Linux and Mac OS X) or /Qopenmp and /Qparallel (Windows). Example syntax on a Linux system with 8 cores: export OMP_NUM_THREADS=8


Firmware / BIOS / Microcode Settings

Determinism Slider:

This option allows the processor to use a given performance level as the max cap, or to let the processor operate as close to the thermal design point (TDP) as possible. Values for this BIOS option can be: Power: Processor operates as close to the TDP as possible. Performance: Processor operates at a capped performance level as the max operating state.

NUMA Nodes Per Socket:

NUMA nodes per socket (NPS) field allows you to configure the memory NUMA domains per socket. The configuration can consist of one whole domain (NPS1), two domains (NPS2), or four domains (NPS4). In the case of a two-socket platform, an additional NPS profile is available to have whole system memory to be mapped as single NUMA domain (NPS0).

4-Link xGMI Max Speed:
Setting this to a lower speed can save uncore power that can be used to increase core frequency or reduce overall power. It will also decrease cross socket bandwidth and increase cross socket latency. Available settings are 20 Gbps, 25 Gbps and 32 Gbps. Default is 25 Gbps.
xGMI Link Config:
Allows to set the number of interconnects between processor sockets. Available settings are Auto, 1, 2 , 3 and 4. Default is Auto.
TDP Control:
Supports Manual and Auto configuration. Manual: Set customized configurable TDP. Auto: Use platform and OPN default TDP.
TDP:
Sets the maximum power consumption for CPU. Configurable Thermal Design Power (TDP) allows the user to modify the platform CPU cooling limit and the Package Power Limit (PPL) allows the user to modify the CPU Power Dissipation Limit. The CPU will control CPU boost to keep socket power dissipation at or below the specified Package Power Limit.
PPT Control:
Supports Manual and Auto configuration. Manual: Manual: Specify a custom PPL (Package Power Limit). Auto: Automatically set PPL in watts.
PPT:
This option appears once the user sets the PPT Control to Manual. Values 70-225: Set configurable PPT, in watts.
EDC:
Electrical Design Current (EDC): Indicates the maximum current the voltage rail can demand for a short, thermally insignificant time. EPYC models in infrastructure group X support configurable EDC up to 300 A. By default, the EDC limit is set to 255 A for this infrastructure group. Raising it can add additional frequency headroom for these models, at the cost of additional power consumption.
SMT Mode:
Allows enabling or disabling symmetric multithreading. Available options Auto and Disable. Default is Auto.
ACPI SRAT L3 Cache as NUMA Domain:
Controls automatic or manual generation of distance information in the ACPI System Locality Information Table (SLIT) and NUMA proximity domains in the System Resource Affinity Table (SRAT). Some operating systems and hypervisors do not perform L3 aware scheduling, and some workloads will benefit from having the L3 declared as a NUMA domain. When enabled, the last level cache in each CCX in the system will be declared as a separate NUMA domain. It can improve performance for highly NUMA optimized workloads if workloads or components of workloads can be pinned to cores in a CCX and if they can benefit from sharing an L3 cache.
L1 Stream HW Prefetcher:
Enable/Disable L1 Stream HW Prefetcher. Most workloads will benefit from the L1 Stream Hardware prefetchers gathering data and keeping the core pipeline busy. By default, L1 Stream HW Prefetche is enabled.
L2 Stream HW Prefetcher:
Enable/Disable L2 Stream HW Prefetcher. Most workloads will benefit from the L2 Stream Hardware prefetchers gathering data and keeping the core pipeline busy. By default, L2 Stream HW Prefetche is enabled.
APBDIS:
Enable or disable Algorithm Performance Boost (APB). In the default state, the Infinity Fabric selects between a full-power and low-power fabric clock and memory clock based on fabric and memory usage. However, under certain scenarios, involving low bandwidth but latency-sensitive traffic (and memory latency checkers), the transition from low power to full power can adversely impact latency. Setting APBDIS to 1 (to disable APB) and specifying a fixed Infinity Fabric P-state of 0 will force the Infinity Fabric and memory controllers into full-power mode, eliminating any such latency jitter. Available settings are:
IOMMU:
The IOMMU provides several benefits and is required when using x2APIC. Enabling the IOMMU allows devices (such as the EPYC integrated SATA controller) to present separate IRQs for each attached device instead of one IRQ for the subsystem. The IOMMU also allows operating systems to provide additional protection for DMA capable I/O devices. IOMMU also helps filter and remap interrupts from peripheral devices. Available settings are:
DRAM Scrub Time:
This option sets the period of time between successive DRAM scrub events. Performance may be reduced with more frequent DRAM scrub events.
DF C-States:
Much like CPU cores, the AMD Infinity Fabric can enter lower-power states while idle, but a delay occurs when transitioning back to full-power mode that causes some latency jitter. Disabling this feature for workloads requiring low latency and/or bursty I/O will increase both performance and power consumption. This option only applies to dual-socket systems. Available settings are:
DLWM Support:
This feature reduces xGMI lane width from x16 to x8 or x2 if xGMI links have limited traffic. DLWM feature is optimized to trade power between CPU core intensive workloads (SPECCPU) and I/O bandwidth intensive workloads (Kernel IP Forward or iPerf). Available options are Disable and Auto. Default value is Auto.
Memory interleaving:
Memory interleaving is a technique that CPUs use to increase the memory bandwidth available for an application. By enabling memory interleaving, consecutive memory blocks are in different banks and can all contribute to the overall memory bandwidth, thus increasing throughput and lowering latency. Available options are Disable and Auto. Default value is Auto(Enable).
High Bandwidth:

Enabling this option allows the chipset to defer memory transactions and process them out of order for optimal performance.

submit= MYMASK=`printf '0x%x' \$((1<< \$SPECCOPYNUM))`; /usr/bin/taskset \$MYMASK $command

When running multiple copies of benchmarks, the SPEC config file feature submit is sometimes used to cause individual jobs to be bound to specific processors. This specific submit command is used for Linux. The description of the elements of the command are:

/usr/bin/taskset [options] [mask] [pid | command [arg] ... ] :
taskset is used to set or retreive the CPU affinity of a running process given its PID or to launch a new COMMAND with a given CPU affinity. The CPU affinity is represented as a bitmask, with the lowest order bit corresponding to the first logical CPU and highest order bit corresponding to the last logical CPU. When the taskset returns, it is guaranteed that the given program has been scheduled to a legal CPU.
:
The default behaviour of taskset is to run a new command with a given affinity mask: :
taskset [mask] [command] [arguments]