2007/12/05

Hyper-Threading speeds Linux

The report of IBM says hyper-thread speeds up linux. But based my oracle database report, the improvement is not obviously and not stable.

------
Hyper-Threading speeds Linux

Multiprocessor performance on a single processor

Document options

Print this page

E-mail this page


Rate this page

Help us improve this content





Level: Introductory

Duc Vianney (dvianney@us.ibm.com), Linux Kernel Performance Group, Linux Technology Center, IBM

01 Jan 2003
The Intel Xeon processor introduces a new technology called Hyper-Threading (HT) that, to the operating system, makes a single processor behave like two logical processors. When enabled, the technology allows the processor to execute multiple threads simultaneously, in parallel within each processor, which can yield significant performance improvement. We set out to quantify just how much improvement you can expect to see.

The current Linux symmetric multiprocessing (SMP) kernel at both the 2.4 and 2.5 versions was made aware of Hyper-Threading, and performance speed-up had been observed in multithreaded benchmarks (see Resources later in this article for articles with more details).

This article gives the results of our investigation into the effects of Hyper-Threading (HT) on the Linux SMP kernel. It compares the performance of a Linux SMP kernel that was aware of Hyper-Threading to one that was not. The system under test was a multithreading-enabled, single-CPU Xeon. The benchmarks used in the study covered areas within the kernel that could be affected by Hyper-Threading, such as the scheduler, low-level kernel primitives, the file server, the network, and threaded support.

The results on Linux kernel 2.4.19 show Hyper-Threading technology could improve multithreaded applications by 30%. Current work on Linux kernel 2.5.32 may provide performance speed-up as much as 51%.

Introduction

Intel's Hyper-Threading Technology enables two logical processors on a single physical processor by replicating, partitioning, and sharing the resources within the Intel NetBurst microarchitecture pipeline.

Replicated resources create copies of the resources for the two threads:
All per-CPU architectural states
Instruction pointers, renaming logic
Some smaller resources (such as return stack predictor, ITLB, etc.)

Partitioned resources divide the resources between the executing threads:
Several buffers (Re-Order Buffer, Load/Store Buffers, queues, etc.)

Shared resources make use of the resources as needed between the two executing threads:
Out-of-Order execution engine
Caches

Typically, each physical processor has a single architectural state on a single processor core to service threads. With HT, each physical processor has two architectural states on a single core, making the physical processor appear as two logical processors to service threads. The system BIOS enumerates each architectural state on the physical processor. Since Hyper-Threading-aware operating systems take advantage of logical processors, those operating systems have twice as many resources to service threads.

Back to top





Hyper-Threading support in the Xeon processor

The Xeon processor is the first to implement Simultaneous Multi-Threading (SMT) in a general-purpose processor. (See Resources for more information on the Xeon family of processors.) To achieve the goal of executing two threads on a single physical processor, the processor simultaneously maintains the context of multiple threads that allow the scheduler to dispatch two potentially independent threads concurrently.

The operating system (OS) schedules and dispatches threads of code to each logical processor as it would in an SMP system. When a thread is not dispatched, the associated logical processor is kept idle.

When a thread is scheduled and dispatched to a logical processor, LP0, the Hyper-Threading technology utilizes the necessary processor resources to execute the thread.

When a second thread is scheduled and dispatched on the second logical processor, LP1, resources are replicated, divided, or shared as necessary in order to execute the second thread. Each processor makes selections at points in the pipeline to control and process the threads. As each thread finishes, the operating system idles the unused processor, freeing resources for the running processor.

The OS schedules and dispatches threads to each logical processor, just as it would in a dual-processor or multi-processor system. As the system schedules and introduces threads into the pipeline, resources are utilized as necessary to process two threads.

Back to top





Hyper-Threading support in Linux kernel 2.4

Under the Linux kernel, a Hyper-Threaded processor with two virtual processors is treated as a pair of real physical processors. As a result, the scheduler that handles SMP should be able to handle Hyper-Threading as well. The support for Hyper-Threading in Linux kernel 2.4.x began with 2.4.17 and includes the following enhancements:
128-byte lock alignment
Spin-wait loop optimization
Non-execution based delay loops
Detection of Hyper-Threading enabled processor and starting the logical processor as if machine was SMP
Serialization in MTRR and Microcode Update driver as they affect shared state
Optimization to scheduler when system is idle to prioritize scheduling on a physical processor before scheduling on logical processor
Offset user stack to avoid 64K aliasing

Back to top





Kernel performance measurement

To assess the effects of Hyper-Threading on the Linux kernel, we measured the performance of kernel benchmarks on a system containing the Intel Xeon processor with HT. The hardware was a single-CPU, 1.6 GHz Xeon MP processor with SMT, 2.5 GB of RAM, and two 9.2 GB SCSI disk drives. The kernel under measurement was stock version 2.4.19 configured and built with SMP enabled. The kernel Hyper-Threading support was specified by the boot option acpismp=force for Hyper-Threading and noht for no Hyper-Threading. The existence of Hyper-Threading support can be seen by using the command cat /proc/cpuinfo to show the presence of two processors, processor 0 and processor 1. Note the ht flag in Listing 1 for CPUs 0 and 1. In the case of no Hyper-Threading support, the data will be displayed for processor 0 only.

Listing 1. Output from cat /proc/cpuinfo showing Hyper-Threading support
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 1
model name : Intel(R) Genuine CPU 1.60GHz
stepping : 1
cpu MHz : 1600.382
cache size : 256 KB
. . .
fpu : yes
fpu_exception: yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht
tm
bogomips : 3191.60
processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 1
model name : Intel(R) Genuine CPU 1.60GHz
stepping : 1
cpu MHz : 1600.382
cache size : 256 KB
. . .
fpu : yes
fpu_exception: yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht
tm
bogomips : 3198.15



Back to top





Linux kernel benchmarks

To measure Linux kernel performance, five benchmarks were used: LMbench, AIM Benchmark Suite IX (AIM9), chat, dbench, and tbench. The LMbench benchmark times various Linux application programming interfaces (APIs), such as basic system calls, context switching latency, and memory bandwidth. The AIM9 benchmark provides measurements of user application workload. The chat benchmark is a client-server workload modeled after a chat room. The dbench benchmark is a file server workload, and tbench is a TCP workload. Chat, dbench, and tbench are multithreaded benchmarks, while the others are single-threaded benchmarks.

Back to top





Effects of Hyper-Threading on Linux APIs

The effects of Hyper-Threading on Linux APIs were measured by LMbench, which is a microbenchmark containing a suite of bandwidth and latency measurements. Among these are cached file read, memory copy (bcopy), memory read/write (and latency), pipe, context switching, networking, filesystem creates and deletes, process creation, signal handling, and processor clock latency. LMbench stresses the following kernel components: scheduler, process management, communication, networking, memory map, and filesystem. The low level kernel primitives provide a good indicator of the underlying hardware capabilities and performance.

To study the effects of Hyper-Threading, we focused on latency measurements that measure time of message control, (in other words, how fast a system can perform some operation). The latency numbers are reported in microseconds per operation.

Table 1 shows a partial list of kernel functions tested by LMbench. Each data point is the average of three runs, and the data have been tested for their convergence to assure that they are repeatable when subjected to the same test environment. In general, there is no performance difference between Hyper-Threading and no Hyper-Threading for those functions that are running as a single thread. However, for those tests that require two threads to run, such as the pipe latency test and the three process latency tests, Hyper-Threading seems to degrade their latency times. The configured stock SMP kernel is denoted as 2419s. If the kernel was configured without Hyper-Threading support, it is denoted as 2419s-noht. With Hyper-Threading support, the kernel is listed as 2419s-ht.

Table 1. Effects of Hyper-Threading on Linux APIs Kernel function 2419s-noht 2419s-ht Speed-up
Simple syscall 1.10 1.10 0%
Simple read 1.49 1.49 0%
Simple write 1.40 1.40 0%
Simple stat 5.12 5.14 0%
Simple fstat 1.50 1.50 0%
Simple open/close 7.38 7.38 0%
Select on 10 fd's 5.41 5.41 0%
Select on 10 tcp fd's 5.69 5.70 0%
Signal handler installation 1.56 1.55 0%
Signal handler overhead 4.29 4.27 0%
Pipe latency 11.16 11.31 -1%
Process fork+exit 190.75 198.84 -4%
Process fork+execve 581.55 617.11 -6%
Process fork+/bin/sh -c 3051.28 3118.08 -2%
Note: Data are in microseconds: smaller is better.


The pipe latency test uses two processes communicating through a UNIX pipe to measure interprocess communication latencies via socket. The benchmark passes a token back and forth between the two processes. The degradation is 1%, which is small to the point of being insignificant.

The three process tests involve process creation and execution under Linux. The purpose is to measure the time taken to create a basic thread of control. For the process fork+exit test, the data represents the latency time taken to split a process into two (nearly) identical copies and have one exit. This is how new processes are created -- but it is not very useful since both processes are doing the same thing. In this test, Hyper-Threading causes a 4% degradation.

In the process fork+execve, the data represents the time it takes to create a new process and have that new process run a new program. This is the inner loop of all shells (command interpreters). This test sees 6% degradation due to Hyper-Threading.

In the process fork+/bin/sh -c test, the data represents the time taken to create a new process and have that new process run a new program by asking the system shell to find that program and run it. This is how the C library interface called system is implemented. This call is the most general and the most expensive. Under Hyper-Threading, this test runs 2% slower compared to non-Hyper-Threading.

Back to top





Effects of Hyper-Threading on Linux single-user application workload

The AIM9 benchmark is a single user workload designed to measure the performance of hardware and operating systems. The results are shown in Table 2. Most of the tests in the benchmark performed identically in Hyper-Threading and non-Hyper-Threading, except for the sync file operations and Integer Sieves. The three operations, Sync Random Disk Writes, Sync Sequential Disk Writes, and Sync Disk Copies, are approximately 35% slower in Hyper-Threading. On the other hand, Hyper-Threading provided a 60% improvement over non-Hyper-Threading in the case of Integer Sieves.

Table 2. Effects of Hyper-Threading on AIM9 workload 2419s-noht 2419s-ht Speed-up
add_double Thousand Double Precision Additions per second 638361 637724 0%
add_float Thousand Single Precision Additions per second 638400 637762 0%
add_long Thousand Long Integer Additions per second 1479041 1479041 0%
add_int Thousand Integer Additions per second 1483549 1491017 1%
add_short Thousand Short Integer Additions per second 1480800 1478400 0%
creat-clo File Creations and Closes per second 129100 139700 8%
page_test System Allocations & Pages per second 161330 161840 0%
brk_test System Memory Allocations per second 633466 635800 0%
jmp_test Non-local gotos per second 8666900 8694800 0%
signal_test Signal Traps per second 142300 142900 0%
exec_test Program Loads per second 387 387 0%
fork_test Task Creations per second 2365 2447 3%
link_test Link/Unlink Pairs per second 54142 59169 9%
disk_rr Random Disk Reads (K) per second 85758 89510 4%
disk_rw Random Disk Writes (K) per second 76800 78455 2%
disk_rd Sequential Disk Reads (K) per second 351904 356864 1%
disk_wrt Sequential Disk Writes (K) per second 154112 156359 1%
disk_cp Disk Copies (K) per second 104343 106283 2%
sync_disk_rw Sync Random Disk Writes (K) per second 239 155 -35%
sync_disk_wrt Sync Sequential Disk Writes (K) per second 97 60 -38%
sync_disk_cp Sync Disk Copies (K) per second 97 60 -38%
disk_src Directory Searches per second 48915 48195 -1%
div_double Thousand Double Precision Divides per second 37162 37202 0%
div_float Thousand Single Precision Divides per second 37125 37202 0%
div_long Thousand Long Integer Divides per second 27305 27360 0%
div_int Thousand Integer Divides per second 27305 27332 0%
div_short Thousand Short Integer Divides per second 27305 27360 0%
fun_cal Function Calls (no arguments) per second 30331268 30105600 -1%
fun_cal1 Function Calls (1 argument) per second 112435200 112844800 0%
fun_cal2 Function Calls (2 arguments) per second 97587200 97843200 0%
fun_cal15 Function Calls (15 arguments) per second 44748800 44800000 0%
sieve Integer Sieves per second 15 24 60%
mul_double Thousand Double Precision Multiplies per second 456287 456743 0%
mul_float Thousand Single Precision Multiplies per second 456000 456743 0%
mul_long Thousand Long Integer Multiplies per second 167904 168168 0%
mul_int Thousand Integer Multiplies per second 167976 168216 0%
mul_short Thousand Short Integer Multiplies per second 155730 155910 0%
num_rtns_1 Numeric Functions per second 92740 92920 0%
trig_rtns Trigonometric Functions per second 404000 405000 0%
matrix_rtns Point Transformations per second 875140 891300 2%
array_rtns Linear Systems Solved per second 579 578 0%
string_rtns String Manipulations per second 2560 2564 0%
mem_rtns_1 Dynamic Memory Operations per second 982035 980019 0%
mem_rtns_2 Block Memory Operations per second 214590 215390 0%
sort_rtns_1 Sort Operations per second 481 472 -2%
misc_rtns_1 Auxiliary Loops per second 7916 7864 -1%
dir_rtns_1 Directory Operations per second 2002000 2001000 0%
shell_rtns_1 Shell Scripts per second 95 97 2%
shell_rtns_2 Shell Scripts per second 95 96 1%
shell_rtns_3 Shell Scripts per second 95 97 2%
series_1 Series Evaluations per second 3165270 3189630 1%
shared_memory Shared Memory Operations per second 174080 174220 0%
tcp_test TCP/IP Messages per second 65835 66231 1%
udp_test UDP/IP DataGrams per second 111880 112150 0%
fifo_test FIFO Messages per second 228920 228900 0%
stream_pipe Stream Pipe Messages per second 170210 171060 0%
dgram_pipe DataGram Pipe Messages per second 168310 170560 1%
pipe_cpy Pipe Messages per second 245090 243440 -1%
ram_copy Memory to Memory Copy per second 490026708 492478668 1%


Back to top





Effects of Hyper-Threading on Linux multithreaded application workload

To measure the effects of Hyper-Threading on Linux multithreaded applications, we use the chat benchmark, which is modeled after a chat room. The benchmark includes both a client and a server. The client side of the benchmark will report the number of messages sent per second; the number of chat rooms and messages will control the workload. The workload creates a lot of threads and TCP/IP connections, and sends and receives a lot of messages. It uses the following default parameters:
Number of chat rooms = 10
Number of messages = 100
Message size = 100 bytes
Number of users = 20

By default, each chat room has 20 users. A total of 10 chat rooms will have 20x10 = 200 users. For each user in the chat room, the client will make a connection to the server. So since we have 200 users, we will have 200 connections to the server. Now, for each user (or connection) in the chat room, a "send" thread and a "receive" thread are created. Thus, a 10-chat-room scenario will create 10x20x2 = 400 client threads and 400 server threads, for a total of 800 threads. But there's more.

Each client "send" thread will send the specified number of messages to the server. For 10 chat rooms and 100 messages, the client will send 10x20x100 = 20,000 messages. The server "receive" thread will receive the corresponding number of messages. The chat room server will echo each of the messages back to the other users in the chat room. Thus, for 10 chat rooms and 100 messages, the server "send" thread will send 10x20x100x19 or 380,000 messages. The client "receive" thread will receive the corresponding number of messages.

The test starts by starting the chat server in a command-line session and the client in another command-line session. The client simulates the workload and the results represent the number of messages sent by the client. When the client ends its test, the server loops and accepts another start message from the client. In our measurement, we ran the benchmark with 20, 30, 40, and 50 chat rooms. The corresponding number of connections and threads are shown in Table 3.

Table 3. Number of chat rooms and threads tested Number of
chat rooms Number of
connections Number of
threads Number of
messages sent Number of
messages received Total number
of messages
20 400 1,600 40,000 760,000 800,000
30 600 2,400 60,000 1,140,000 1,200,000
40 800 3,200 80,000 1,520,000 1,600,000
50 1000 4,000 100,000 1,900,000 2,000,000


Table 4 show the performance impact of Hyper-Threading on the chat workload. Each data point represents the geometric mean of five runs. The data set clearly indicates that Hyper-Threading could improve the workload throughput from 22% to 28% depending on the number of chat rooms. Overall, Hyper-Threading will boost the chat performance by 24% based on the geometric mean of the 4 chat room samples.

Table 4. Effects of Hyper-Threading on chat throughput Number of chat rooms 2419s-noht 2419s-ht Speed-up
20 164,071 202,809 24%
30 151,530 184,803 22%
40 140,301 171,187 22%
50 123,842 158,543 28%
Geometric Mean 144,167 178,589 24%
Note: Data is the number of messages sent by client: higher is better.


Figure 1. Effects of Hyper-Threading on the chat workload


Back to top





Effects of Hyper-Threading on Linux multithreaded file server workload

The effect of Hyper-Threading on the file server was measured with dbench and its companion test, tbench. dbench is similar to the well known NetBench benchmark from the Ziff-Davis Media benchmark program, which lets you measure the performance of file servers as they handle network file requests from clients. However, while NetBench requires an elaborate setup of actual physical clients, dbench simulates the 90,000 operations typically run by a NetBench client by sniffing a 4 MB file called client.txt to produce the same workload. The contents of this file are file operation directives such as SMBopenx, SMBclose, SMBwritebraw, SMBgetatr, etc. Those I/O calls correspond to the Server Message Protocol Block (SMB) that the SMBD server in SAMBA would produce in a netbench run. The SMB protocol is used by Microsoft Windows 3.11, NT and 95/98 to share disks and printers.

In our tests, a total of 18 different types of I/O calls were used including open file, read, write, lock, unlock, get file attribute, set file attribute, close, get disk free space, get file time, set file time, find open, find next, find close, rename file, delete file, create new file, and flush file buffer.

dbench can simulate any number of clients without going through the expense of a physical setup. dbench produces only the filesystem load, and it does no networking calls. During a run, each client records the number of bytes of data moved and divides this number by the amount of time required to move the data. All client throughput scores are then added up to determine the overall throughput for the server. The overall I/O throughput score represents the number of megabytes per second transferred during the test. This is a measurement of how well the server can handle file requests from clients.

dbench is a good test for Hyper-Threading because it creates a high load and activity on the CPU and I/O schedulers. The ability of Hyper-Threading to support multithreaded file serving is severely tested by dbench because many files are created and accessed simultaneously by the clients. Each client has to create about 21 megabytes worth of test data files. For a test run with 20 clients, about 420 megabytes of data are expected. dbench is considered a good test to measure the performance of the elevator algorithm used in the Linux filesystem. dbench is used to test the working correctness of the algorithm, and whether the elevator is aggressive enough. It is also an interesting test for page replacement.

Table 5 shows the impact of HT on the dbench workload. Each data point represents the geometric mean of five runs. The data indicates that Hyper-Threading would improve dbench from as little as 9% to as much as 29%. The overall improvement is 18% based on the geometric mean of the five test scenarios.

Table 5. Effects of Hyper-Threading on dbench throughput Number of clients 2419s-noht 2419s-ht Speed-up
20 132.82 171.23 29%
30 131.43 169.55 29%
60 119.95 133.77 12%
90 111.89 121.81 9%
120 99.31 114.92 16%
Geometric Mean 118.4 140.3 18%
Note: Data are throughput in MB/sec: higher is better.


Figure 2. Effects of Hyper-Threading on the dbench workload


Back to top





tbench

tbench is another file server workload similar to dbench. However, tbench produces only the TCP and process load. tbench does the same socket calls that SMBD would do under a netbench load, but tbench does no filesystem calls. The idea behind tbench is to eliminate SMBD from the netbench test, as though the SMBD code could be made fast. The throughput results of tbench tell us how fast a netbench run could go if we eliminated all filesystem I/O and SMB packet processing. tbench is built as part of the dbench package.

Table 6 depicts the impact of Hyper-Threading on the tbench workload. As before, each data point represents the geometric mean of five runs. Hyper-Threading definitely would improve tbench throughput, from 22% to 31%. The overall improvement is 27% based on the geometric mean of the five test scenarios.

Table 6. Effects of Hyper-Threading on tbench throughput Number of clients 2419s-noht 2419s-ht Speed-up
20 60.98 79.86 31%
30 59.94 77.82 30%
60 55.85 70.19 26%
90 48.45 58.88 22%
120 37.85 47.92 27%
Geometric Mean 51.84 65.77 27%
Note: Data are throughput in MB/sec: higher is better.


Figure 3. Effects of Hyper-Threading on the tbench workload


Back to top





Hyper-Threading support in Linux kernel 2.5.x

Linux kernel 2.4.x was made aware of HT since the release of 2.4.17. The kernel 2.4.17 knows about the logical processor, and it treats a Hyper-Threaded processor as two physical processors. However, the scheduler used in the stock kernel 2.4.x is still considered naive for not being able to distinguish the resource contention problem between two logical processors versus two separate physical processors.

Ingo Molnar has pointed out scenarios in which the current scheduler gets things wrong (see Resources for a link). Consider a system with two physical CPUs, each of which provides two virtual processors. If there are two tasks running, the current scheduler would let them both run on a single physical processor, even though far better performance would result from migrating one process to the other physical CPU. The scheduler also doesn't understand that migrating a process from one virtual processor to its sibling (a logical CPU on the same physical CPU) is cheaper (due to cache loading) than migrating it across physical processors.

The solution is to change the way the run queues work. The 2.5 scheduler maintains one run queue per processor and attempts to avoid moving tasks between queues. The change is to have one run queue per physical processor that is able to feed tasks into all of the virtual processors. Throw in a smarter sense of what makes an idle CPU (all virtual processors must be idle), and the resulting code "magically fulfills" the needs of scheduling on a Hyper-Threading system.

In addition to the run queue change in the 2.5 scheduler, there are other changes needed to give the Linux kernel the ability to leverage HT for optimal performance. Those changes were discussed by Molnar (again, please see Resources for more on that) as follows.
HT-aware passive load-balancing:
The IRQ-driven balancing has to be per-physical-CPU, not per-logical-CPU. Otherwise, it might happen that one physical CPU runs two tasks while another physical CPU runs no task; the stock scheduler does not recognize this condition as "imbalance." To the scheduler, it appears as if the first two CPUs have 1-1 task running while the second two CPUs have 0-0 tasks running. The stock scheduler does not realize that the two logical CPUs belong to the same physical CPU.

"Active" load-balancing:
This is when a logical CPU goes idle and causes a physical CPU imbalance. This is a mechanism that simply does not exist in the stock 1:1 scheduler. The imbalance caused by an idle CPU can be solved via the normal load-balancer. In the case of HT, the situation is special because the source physical CPU might have just two tasks running, both runnable. This is a situation that the stock load-balancer is unable to handle, because running tasks are hard to migrate away. This migration is essential -- otherwise a physical CPU can get stuck running two tasks while another physical CPU stays idle.

HT-aware task pickup:
When the scheduler picks a new task, it should prefer all tasks that share the same physical CPU before trying to pull in tasks from other CPUs. The stock scheduler only picks tasks that were scheduled to that particular logical CPU.

HT-aware affinity:
Tasks should attempt to "stick" to physical CPUs, not logical CPUs.

HT-aware wakeup:
The stock scheduler only knows about the "current" CPU, it does not know about any sibling. On HT, if a thread is woken up on a logical CPU that is already executing a task, and if a sibling CPU is idle, then the sibling CPU has to be woken up and has to execute the newly woken-up task immediately.

At this writing, Molnar has provided a patch to stock kernel 2.5.32 implementing all the above changes by introducing the concept of a shared runqueue: multiple CPUs can share the same runqueue. A shared, per-physical-CPU runqueue fulfills all of the HT-scheduling needs listed above. Obviously this complicates scheduling and load-balancing, and the effects on the SMP and uniprocessor scheduler are still unknown.

The change in Linux kernel 2.5.32 was designed to affect Xeon systems with more than two CPUs, especially in the load-balancing and thread affinity arenas. Due to hardware resource constraints, we were only able to measure its effects in our one-CPU test environment. Using the same testing process employed in 2.4.19, we ran the three workloads, chat, dbench, and tbench, on 2.5.32. For chat, HT could bring as much as a 60% speed-up in the case of 40 chat rooms. The overall improvement was about 45%. For dbench, 27% was the high speed-up mark, with the overall improvement about 12%. For tbench, the overall improvement was about 35%.

Table 7. Effects of Hyper-Threading on Linux kernel 2.5.32 chat workload
Number of chat rooms 2532s-noht 2532s-ht Speed-up
20 137,792 207,788 51%
30 138,832 195,765 41%
40 144,454 231,509 47%
50 137,745 191,834 39%
Geometric Mean 139,678 202,034 45%
dbench workload
Number of clients 2532s-noht 2532s-ht Speed-up
20 142.02 180.87 27%
30 129.63 141.19 9%
60 84.76 86.02 1%
90 67.89 70.37 4%
120 57.44 70.59 23%
Geometric Mean 90.54 101.76 12%
tbench workload
Number of clients 2532s-noht 2532s-ht Speed-up
20 60.28 82.23 36%
30 60.12 81.72 36%
60 59.73 81.2 36%
90 59.71 80.79 35%
120 59.73 79.45 33%
Geometric Mean 59.91 81.07 35%
Note: chat data is the number of messages sent by the client/sec; dbench and tbench data are in MB/sec.


Back to top





Conclusion

Intel Xeon Hyper-Threading is definitely having a positive impact on Linux kernel and multithreaded applications. The speed-up from Hyper-Threading could be as high as 30% in stock kernel 2.4.19, to 51% in kernel 2.5.32 due to drastic changes in the scheduler run queue's support and Hyper-Threading awareness.
ACKNOWLEDGMENTS:
The author would like to thank Intel's Sunil Saxena for invaluable information gleaned at the LinuxWorld Conference Session Performance tuning for threaded applications -- with a look at Hyper-Threading at the LinuxWorld Conference in San Francisco, August 2002.



Resources
You can download the chat benchmark from the Linux Benchmark Suite Homepage.



The README file from dbench is courtesy of SAMBA.



More information on LMbench can be found at the LMbench home page.



The home of the Ziff-Davis NetBench benchmarking test gives more details of their test suite.



The Linux elevator algorithm is discussed in the November 23, 2000 edition of the Linux Weekly News Kernel Development section.



An August 2002 note on Hyper-Threading posted by Ingo Molnar to the kernel list is reprinted in the Linux Weekly News.



Another August 2002 LWN article also discusses the scheduler and Hyper-Threading (among other things).



Learn about IBM's developer contributions to Linux at the IBM Linux Technology Center.



Find more resources for Linux developers in the developerWorks Linux zone.



About the author


Duc Vianney works in operating system performance evaluation and measurement in computer architectures and Java. He is with the Linux Kernel Performance Group at the IBM Linux Technology Center. Duc has written several articles on Java performance for IBM developerWorks and PartnerWorld. You can contact Duc at dvianney@us.ibm.com.

1 comment:

Anonymous said...

Who knows where to download XRumer 5.0 Palladium?
Help, please. All recommend this program to effectively advertise on the Internet, this is the best program!