Thursday, 24 March 2016

NUMA Nodes and SQL Scheduler Configuration Performance Impact


I recently built a new SQL server for one of my client with 72 cores, It was an SQL 2014 enterprise edition and was an upgrade from Amazon Web services to Physical infrastructure with higher processing power, Memory and faster disk. The SQL Version was the same (SQL 2014 SP1 on AWS and Physical Infrastructure) but surprising the physical infrastructure was slower than AWS in some instances. Some queries were taking long time in the new server and waits pointed to CPU Schedulers.

Thanks to our Performance Auditors(Mike Walsh from Linchpin People) who found out the following message from the SQL error log,

“SQL Server detected 2 sockets with 18 cores per socket and 36 logical processors per socket, 72 total logical processors; using 40 logical processors based on SQL Server licensing. This is an informational message; no user action is required.”

We found the client was licenced SQL for only 40 cores even though the hardware had 72 cores SQL can use only the licenced 40 cores. This was still an non-issue as this was still more processing power to AWS and that’s when Mike pointed me to Glenn Berry's Blog http://www.sqlskills.com/blogs/glenn/balancing-your-available-sql-server-core-licenses-evenly-across-numa-nodes/

As mention in the above blog post, I ran

-- SQL Server NUMA node information
SELECT node_id, node_state_desc, memory_node_id, processor_group, online_scheduler_count
FROM sys.dm_os_nodes WITH (NOLOCK)
WHERE node_state_desc <> N'ONLINE DAC' OPTION (RECOMPILE);


node_id
node_state_desc
memory_node_id
processor_group
online_scheduler_count
0
ONLINE
0
0
36
1
ONLINE
1
1
4

As you can see in the above result,  36 Schedulers are in use on the first NUMA node with the first 36 logical processors and then 4 are in use on the second NUMA node. This is not an optimal configuration If we have had SQL licensed for all the 72 cores we would have had equal balance with the schedulers on the NUMA nodes. This might be a performance bottleneck.

I followed Glenn's blog again and ran

-- SQL Server NUMA node and cpu_id information
SELECT parent_node_id, scheduler_id, cpu_id
FROM sys.dm_os_schedulers WITH (NOLOCK)
WHERE [status] = N'VISIBLE ONLINE';

From the above query result I found the cpu id’s in use were from 0 to 19 and 64 to 83 after which I ran the below ALTER query as suggested in the blog with right CPU id.

ALTER SERVER CONFIGURATION
SET PROCESS AFFINITY CPU = 0 TO 19, 64 TO 83;

Now I ran first query again to find NUMA node information,

-- SQL Server NUMA node information
SELECT node_id, node_state_desc, memory_node_id, processor_group, online_scheduler_count
FROM sys.dm_os_nodes WITH (NOLOCK)
WHERE node_state_desc <> N'ONLINE DAC' OPTION (RECOMPILE);

node_id
node_state_desc
memory_node_id
processor_group
online_scheduler_count
active_worker_count
0
ONLINE
0
0
20
41
1
ONLINE
1
1
20
56


As you can see now we reached a balance now with CPU schedulers and the NUMA nodes. This change resolved our slowness issue and the server is now much faster than our lower hardware system.