BASM - Fastcode project::Documentation for Move BVTool

You are here: Home > Fastcode project > Documentation for Move BVTool

The Fastcode Project

Documentation for Move Benchmark & Validation Tool

Author: Dennis Christensen

Validation

Philosophy

The validation code must ensure that Move functions handle all possible moves correctly. The Move function has three parameters

procedure Move (const Source; var Dest; Count : Integer);

Source and dest are pointers, Count is an integer. Source and dest can point anywhere in the 32 bit Win 32 addressspace. There is no restriction on their relationship to each other. Count is 32 bit signed and can hold all integer values in the interval [-2G;2G]. Negative values are meaningless in the Move context. A Win32 application can allocate a memory block of all sizes up to 2G, and the max range of Count is sufficient. If Count was a Cardinal the test for zero or negative would translate into a test for Count bigger than 2G. Calling Move with parameter combinations through the full parameterspace could make a complete validation of Move. These 3 nested loops would do the job

for Source = 1 to 2G do
for Dest = 1 to 2G do
for Count = -2G to 2G do
Move

This code would run nearly forever and we need to do an incomplete validation by carefully selecting a subset of the parameter space. Looking at the function implementation under test can help us do this. As a minimum all code blocks selected by conditionals should be validated.A special case is Source=Dest. I would not consider it an error calling Move this way eventhough moving data to it self seems meaningless at first sigth, but we could actually implement a cache flush this way. However the original Move tests this and exit if Source equals Dest and so must we. Just exiting is bad error handling and raising an exception instead is recommended. The original Move does not test that count is above zero and therefore we do not have to do it either. I recommend however that we do it, thus making our Moves more robust than the RTL Move.

Overlapped Move

Create an array of size 2->Max, delete one or more elements and move the right or left "half" into place. Control that overlapped Moves from 1->Max/2 is done properly. Max is currently 254 byte. I consider this size far too small. Adding a validation on an array of Cardinal it is possible to have a huge array where all elements have a different value. The runtime will limit the upper array size. An array size of 5000 gives a runtime of a few minutes on a P4 1600A. This subvalidation validates all cases of alignment.

Array Copy

Having two static arrays of byte, one is source and the other is destination. Initialize source with values 1, 2, 3….256 and destination with zeroes. Call Move with count 1->Max (600000 or so). Control that destination equals source through all elements up to count. Control that destination is not changed from count to max. Source must not change either. This ensures that Move can copy data blocks of "all" sizes and do not change memory out of range.

The const

ARRAYSIZEMAX : Cardinal = 600000;

defines the max size of arrays. Be prepared to wait for a long time if you use the default value. On P4 1600A it takes nearly 2 hours.

This validation only validates the alignment case chosen by the compiler, which is 4, 8, 12, 16…..

Selecting the Function to Validate

Selecting the function to validate is done by naming it MoveFUT (Function Under Test).
MoveFUT is not called directly by validation code. Validation code calls MoveValidate, which is a wrapper around calls to MoveFUT. The RTL Move is called by defining the compiler directive MOVE.
The "MoveValidate" function adds another test. This test validates that registers edx, edi and esi is preserved. It is implemented like this

case TestNumber of
1 :
   begin
    asm
     mov EBXRegisterBefore, ebx
     mov EDIRegisterBefore, edi
     mov ESIRegisterBefore, esi
    end;
    Move(Source, Dest, Count);
    asm
     mov EBXRegisterAfter, ebx
     mov EDIRegisterAfter, edi
     mov ESIRegisterAfter, esi
    end;
   end;

if ((EBXRegisterAfter <> EBXRegisterBefore) or
(EDIRegisterAfter <> EDIRegisterBefore) or
(ESIRegisterAfter <> ESIRegisterBefore)) then
raise Exception.Create('EBX, EDI or ESI register not preserved');

The line

Move (Source, Dest, Count);

is compiled into

mov eax, edx
mov edx, ecx
mov ecx, [ebp+$08]
Call Move

If the compiler chooses to use one or more of the ebx, edi, esi registers to fill parameters in eax, edx and ecx prior to the call we will get a false error notification. Coding the function call in BASM would remedy the situation.

Benchmark

Philosophy

In general a benchmark must mimic real world use of Move. One way to go (probably the best) is to analyze many (hundreds) applications for their use of Move. The benchmark should be designed to include calls of Move with parameter value combinations as the applications. Creating a benchmark this way would be a very time consuming task and we must proceed in another way.
Our benchmark must include as much of the parameter space as possible. It must include overlapped moves with varying overlap, many different source and destination alignment combinations and many different sizes of moves. The benchmark must not focus in particular on any special moves.
It is very difficult to create a benchmark where all subbenchmarks contribute evenly, but we must carefully adjust subbenchmark values so that they contribute as equally as possible. If we sum up move speeds as MB/sec, big moves will contribute too much because they will bee considerably faster than small moves. This can be corrected by doing more small moves than big moves. The biggest moves will on the other hand be slow again because they will not benefit from caching.
Caching is a big issue. We do not have information on cache hitrates from real world applications. Because of this we must construct the benchmark to include moves with all possible hit rates from zero to a hundred percent. Moves with high hit rates will obtain larger MB/sec values than moves with low or zero hitrates and this must be counter weighed by including more low hit rate moves than high hitrate moves.

Subbenchmark1

Copy N elements from static source array to static destination array. Let source and destination bee 1, 2, 4, 8, 16 byte aligned. N grows exponentially from 2 bytes to 2 Mbytes. This way we have more small moves than big ones.
Before doing these calls SourceArray and DestinationArray are 16 byte aligned.

Move(SourceArray[0], DestinationArray[0], ArraySize);//16, 16 byte aligned move
Move(SourceArray[0], DestinationArray[1], ArraySize);//16, 1 byte aligned move
Move(SourceArray[0], DestinationArray[2], ArraySize);//16, 2 byte aligned move
Move(SourceArray[0], DestinationArray[4], ArraySize);//16, 4 byte aligned move
Move(SourceArray[0], DestinationArray[8], ArraySize);//16, 8 byte aligned move

Move(SourceArray[1], DestinationArray[0], ArraySize);//1, 16 byte aligned move
Move(SourceArray[1], DestinationArray[1], ArraySize);//1, 1 byte aligned move
Move(SourceArray[1], DestinationArray[2], ArraySize);//1, 2 byte aligned move
Move(SourceArray[1], DestinationArray[4], ArraySize);//1, 4 byte aligned move
Move(SourceArray[1], DestinationArray[8], ArraySize);//1, 8 byte aligned move

Move(SourceArray[2], DestinationArray[0], ArraySize);//1, 16 byte aligned move
Move(SourceArray[2], DestinationArray[1], ArraySize);//1, 1 byte aligned move
Move(SourceArray[2], DestinationArray[2], ArraySize);//1, 2 byte aligned move
Move(SourceArray[2], DestinationArray[4], ArraySize);//1, 4 byte aligned move
Move(SourceArray[2], DestinationArray[8], ArraySize);//1, 8 byte aligned move

……..

Move(SourceArray[8], DestinationArray[0], ArraySize);//8, 16 byte aligned move
Move(SourceArray[8], DestinationArray[1], ArraySize);//8, 1 byte aligned move
Move(SourceArray[8], DestinationArray[2], ArraySize);//8, 2 byte aligned move
Move(SourceArray[8], DestinationArray[4], ArraySize);//8, 4 byte aligned move
Move(SourceArray[8], DestinationArray[8], ArraySize);//8, 8 byte aligned move

all 25 combinations of source and destination alignment are benchmarked.

Subbenchmark2

This benchmark is like the first one but includes cache flushing between moves and excludes different alignment combinations. Alignment is 4 byte. The cache hierarchy is flushed/invalidated by running this function

procedure InvalidateCaches(Step : Cardinal);
var
PolluteArray : array[1..512000] of Byte;
I : Integer;
const
ArraySize : Integer = 512000;//512K

begin
I := 1;
repeat
PolluteArray[I] := 6;
Inc(I, Step*32);
until(I >= ArraySize);
end;

It declares an array and fills it with garbage forcing memory destinations being loaded into L1 and L2. This evicts data used in the benchmark. The array size must be bigger than the biggest cache of target processors. This is currently 512 KB.
The function has one parameter that controls the degree of flushing. If Step = 1 (Step is a bad name) every 32'Th byte is evicted which includes all cachelines on P2 and P3. P4 has 64 byte L1 cache lines, and 128 byte L2 cache lines. A Step value of 2 will flush half of the cache on P2/P3 but all on P4. Increasing Step will decrease the number of cache lines evicted. The only Step value that is processor neutral is 1.

The calls to

InvalidateCaches(CACHEFLUSHADJUST);

are timed and this time is subtracted from the overall runtime. Calling

QueryPerformanceCounter(lpPerformanceCount);

before and after " InvalidateCaches" has some overhead and the timing of Move is not exact. But it is the best we can do and overhead is equal for all functions.

CACHEFLUSHADJUST = 6 is used.

Subbenchmark3

This benchmark is constructed to run on non cached moves. Letting two subbenchmarks evict each other's data from the cache hierarchy destroys caching.

Move(SourceArray1, DestinationArray1, ArraySize);
Move(SourceArray2, DestinationArray2, ARRAYSIZEMAX - ArraySize);

The benchmark is like the first one with the difference that one move run through different arraysizes from 1 to max and another goes in the opposite direction. This way most moves are RAM to RAM. ("Most" is a very imprecise term and must be tidied up).
The compiler controls source and destination alignment as nothing has been done to control it. The compiler 4 byte aligns stack frames and this will sometimes equal 8, 16 or 32 byte alignment by accident. Therefore this benchmark is not guarantied to be stable across compilations.
From version 4.0 alignment is forced to be on 4 byte boundaries. It is guaranteed that it never is 8, 16 or 32 byte.

Subbenchmark4

This subbenchmark benchmarks tiny moves size 1-50. These small moves will fit into 1, 2 or 3 L1 cachelines depending on processor type and alignment for source and destination blocks respectively. PIII has 32 byte L1 cachelines and moves that are 32 byte aligned will fit into 1 cacheline for sizes up to 32 byte. Moves of size 32 to 50 will fit into 2 cachelines. If data are misaligned the line split(s) will be placed according to the degree of misalignment. A 34 byte move misaligned 31 bytes will span 3 cachelines and split after byte 1 and byte 33.

The same history counts for P4 except for the fact that L1 cachelines are 64 byte and the number of lines split will be less.

For the benchmark to cover all cases it runs on size 1 to 50 moves that are 64 byte aligned. Misalignment in the range 0 to 63 byte is added to this alignment in turn.

The caches are never flushed.

Subbenchmark5

This benchmark is the overlapped move validation algorithms stripped for all validation code. Array sizes are 1000 byte.

SubBenchmark5 := BenchmarkOverlappedLeft + BenchmarkOverlappedRigth;

Total Benchmark

The total benchmark value is calculated by summing subbenchmarks that are scaled.

SUBBENCHMARKH1ADJUST : Double = X;
SUBBENCHMARKH2ADJUST : Double = Y;
SUBBENCHMARKH3ADJUST : Double = Z;
SUBBENCHMARKH4ADJUST : Double = V;
SUBBENCHMARKH5ADJUST : Double = W;

Benchmarking the RTL Move, and then calculating the scale factors such that all subbenchmark values are the same could somewhat arbitrarily choose these factors. The RTL move is not very fast in subbenchmark 4 (tiny moves) and the adjustment value would be too large.
It would perhaps bee more correct to find the maximum possible value for subbenchmarks on different processors and scale subbenchmarks to represent the percentage of maximum possible speed. This would however be a big task and less must suffice.
All subbenchmarks are scaled such that the fastest function in every subbenchmark will have a value of 1000 on P4 1600A. This is done using the example functions in version 4.0.

This way it is necessary to recalibrate the benchmark when a faster function pops up. This is no good and the values are frozen after a short start up phase.

Selecting the Function to Benchmark

Selecting which function to benchmark and validate is done via the ComboBox on the upper panel.

Adding a New Function

Adding a new function is done by entering a line in the FunctionSelectComboBox.Items property. The OnChange eventhandler calls a function named SelectFunction with parameter FunctionID. In this function there is a case. Enter a new index in this too.

Optimization

The Optimizebutton is one I have used while developing functions. It calls the benchmark repeatedly while changing the value of a global variable of type Cardinal. This variable can be used in a function to select code blocks according to the value of Count or whatever you figure out.

The best benchmark value and the variable value leading to this is printed on the GUI.