Lbench User Guide  v.3.7
------------------------

Parameters Dialog - adjustable parameter definitions
-----------------
parallel threads           number of parallel threads to run, 1-10
benchmark run time         duration of benchmark tests, seconds
memory bench size          memory benchmark region sizes
fibonacci number           Nth fibonacci number to compute
disk I/O file size         file size for disk I/O benchmark
disk I/O record size       transfer size for disk I/O benchmark
memory test size           memory test region size


Menu Buttons
------------
bench    popup menu to select and run a benchmark
clear    erase the outputs in the main window
kill     stop any benchmark in progress
parms    start a dialog to edit the parameters
quit     exit the program
help     show the user guide (this document)


Bench Menu
----------
all                     do all except rpm and mem test
cpu                     CPU performance, various math functions
mem speed               memory performance, cache and main memory
func call               call/return rate for minimal function
matrix math             matrix math calculation rate
smp1                    thread switch rate using shared mutex lock
smp2                    thread creation/termination rate
smp3                    process creation/termination rate
global lock             thread switch rate using global resource lock
fibo                    fibonacci compute time (recursion method)
whetstone               Whetstone MIPs (double precision)
linpack                 Linpack Mflops (double precision)
disk                    disk I/O rates, sequential/random, by block size
rpm                     CPU temperature and thermal throttling test
mem test                memory test and burn-in
   
All of the above benchmarks can be run with 1-10 threads in parallel, 
as determined by the 'threads' parameter. The linpack benchmark is
limited to one thread. 


Benchmark Descriptions
----------------------

ALL
The following benchmarks are done in sequence: 
cpu, memory speed, func call, matrix math, smp1, smp2, smp3, 
fibo, whetstone, linpack, disk.

CPU
The MOPS rate (millions of operations per second) is calculated for 
several different mixes of arithmetic instructions and engineering
functions. Simple loops are used. Compiler optimization is disabled 
to insure that the actual calculations are the same as the apparent 
calculations in the source code. Total MOPS per thread are reported.

MEM SPEED
The memory speed is measured and reported for the five region sizes
defined by the parameters 'memory bench KB'. The smallest should be 
much less than the processor cache, and the largest should be much 
greater (10x or more), so that main memory performance is measured 
and not cache performance. The small program loop runs in the cache. 
Performance is measured for three types of memory access: 
   * block moves using the C-lib memmove() function
   * int-32 (4 byte) get/put moves for sequential locations
   * int-32 (4 byte) get/put moves for random locations
A pre-generated set of random locations is used: this overhead is 
not included in the measurement.

FUNC CALL
The time required to call and return from a minimal function is
measured. This is output as a call rate in millions per second.

MATRIX MATH
The time to perform an add + multiply on a 1000 x 1000 matrix
(64 bit floating point) is reported as iterations per second.

SMP1
All threads contend for the same mutex lock, increment a private
counter when the lock is granted, and release the lock. The rate that 
each thread can increment its counter is reported along with the total 
rate. With 1 thread the rate is highest, since there is no contention 
for the lock. With 2 or more contending threads, the benchmark measures 
how fast the operating system can pass control to a waiting thread when 
a locking thread releases the lock. Since the threads are doing almost 
no work, the OS overhead for lock management is being measured.

SMP2
All threads are started in parallel. Each thread immediately exits. 
As soon as the main program detects that a thread has exited, another 
thread is started to take its place, so that the number of running 
threads is sustained at the target level. With 2 or more threads, the 
benchmark measures how fast threads can be started and completed when 
they do no work, effectively the OS overhead for threads.

SMP3
Each thread does the following: start a sub-process using the 'system'
command, wait for its completion, loop. The sub-process does nothing 
but exit. Thus the overhead for sub-process creation and completion is 
being measured. The total sub-process creation rate for all threads is 
reported.

GLOBAL LOCK
This is similar to the SMP1 benchmark, except that the lock is a global
resource lock using a shared file lock. This is a system-wide locking
mechanism that can be used across multiple jobs/processes/threads.

FIBO
The Nth fibonacci number is computed using the recursive method. This
basically measures the speed of calling a trivial function with two 
input arguments and a returned value. The compiled C-code overhead for 
function calls is being measured (stack allocation, argument passing, 
cleanup). 

WHETSTONE
This is the classic Whetstone benchmark for double precision float.
The code was adapted for the GCC compiler and made thread-safe. Some
global parameters were moved into stack storage and became arguments
for the functions needing them. The result is that the measured MIPS 
rate is about 11% less than before.

LINPACK
This is the classic Linpack benchmark for double precision float.
The run time is fixed and the 'run time' parameter has no effect.
The run time is about 10 seconds for a 3 GHz CPU. Only one thread is
used.

DISK
A file is written using disk writes with a defined block size and 
defined total file size. The same file is read back and then deleted. 
I/O speeds are reported in ops/sec and MB/sec for sequential write, 
sequential read, and random read. These are measured using the four 
I/O block sizes defined by the parameter 'disk I/O KB' (kilobytes) 
and for a total file size as defined by the parameter 'disk I/O MB' 
(megabytes). If multiple threads are run, seek contention will reduce 
the overall throughput. Direct I/O is used to avoid OS memory caching, 
so that true disk speed is being measured, even if the file size is 
fairly small. The parameter 'disk I/O file' specifies the scratch file 
(default: /tmp/lbench-scratch). Change this name to use a disk other 
than the one where /tmp resides. Each thread adds a thread number 0-9 
to the file name (e.g. lbench-scratch-0).

RPM
The time to run a mix of arithmetic operations is measured repeatedly.
This time is continuously reported as a value scaled to 100% = the 
original time = full processor speed. Run this function with a thread 
count matching the processor 'core' count, so that the CPU is fully 
loaded. The reported data can be used to monitor processor overheating
and throttling (automatic CPU clock slowdown to prevent overheating), 
and to check if system cooling is adequate to sustain a full processor 
load over an extended period. If the reported values fall below 90% 
for more than a few samples, then the CPU is likely being throttled to 
prevent overheating. The temperature of each processor core is read 
once per cycle, and the max. and min. values are reported. 
NOTE: The initial speed measurement is made with one thread, and may be 
inflated by the "turbo boost" feature which increases the clock speed 
when only one processor is running. If this happens, all the values 
reported later will be lower. If the CPU overheats, these values will 
drop further.

MEM TEST
A block of memory is allocated with size = parameter 'memory test region'.
Random values are written, read back and checked. The memory is released 
and the process is continued indefinitely. If the block size is large 
enough, swapping will occur and the test will run very slow. Run this 
for hours as a burn-in test for new memory, or if you suspect a problem.


