You are here: Home > Fastcode project > Documentation for Move BVTool
Documentation for Move Benchmark & Validation Tool
Author: Dennis ChristensenValidation
Philosophy
The validation code must ensure that Move functions handle all possible moves correctly. The Move function has three parameters
procedure Move (const Source; var Dest; Count : Integer);
Source and dest are pointers, Count is an integer. Source and dest can point anywhere in the 32 bit Win 32 addressspace. There is no restriction on their relationship to each other. Count is 32 bit signed and can hold all integer values in the interval [-2G;2G]. Negative values are meaningless in the Move context. A Win32 application can allocate a memory block of all sizes up to 2G, and the max range of Count is sufficient. If Count was a Cardinal the test for zero or negative would translate into a test for Count bigger than 2G. Calling Move with parameter combinations through the full parameterspace could make a complete validation of Move. These 3 nested loops would do the job
for Source = 1 to 2G dofor Dest = 1 to 2G do
for Count =
Move This code would run nearly forever and we need to do an incomplete validation by carefully selecting a subset of the parameter space. Looking at the function implementation under test can help us do this. As a minimum all code blocks selected by conditionals should be validated.A special case is Source=Dest. I would not consider it an error calling Move this way eventhough moving data to it self seems meaningless at first sigth, but we could actually implement a cache flush this way. However the original Move tests this and exit if Source equals Dest and so must we. Just exiting is bad error handling and raising an exception instead is recommended. The original Move does not test that count is above zero and therefore we do not have to do it either. I recommend however that we do it, thus making our Moves more robust than the RTL Move.
Overlapped Move
Create an array of size 2->Max, delete one or more elements and move the right or left "half" into place. Control that overlapped Moves from 1->Max/2 is done properly. Max is currently 254 byte. I consider this size far too small. Adding a validation on an array of Cardinal it is possible to have a huge array where all elements have a different value. The runtime will limit the upper array size. An array size of 5000 gives a runtime of a few minutes on a P4 1600A. This subvalidation validates all cases of alignment.Array Copy
Having two static arrays of byte, one is source and the other is destination. Initialize source with values 1, 2, 3….256 and destination with zeroes. Call Move with count 1->Max (600000 or so). Control that destination equals source through all elements up to count. Control that destination is not changed from count to max. Source must not change either. This ensures that Move can copy data blocks of "all" sizes and do not change memory out of range. The const ARRAYSIZEMAX : Cardinal = 600000; defines the max size of arrays. Be prepared to wait for a long time if you use the default value. On P4 1600A it takes nearly 2 hours. This validation only validates the alignment case chosen by the compiler, which is 4, 8, 12, 16…..Selecting the Function to Validate
Selecting the function to validate is done by naming it MoveFUT (Function Under Test).MoveFUT is not called directly by validation code. Validation code calls MoveValidate, which is a wrapper around calls to MoveFUT. The RTL Move is called by defining the compiler directive MOVE.
The "MoveValidate" function adds another test. This test validates that registers edx, edi and esi is preserved. It is implemented like this case TestNumber of
1 :
begin
asm
mov EBXRegisterBefore, ebx
mov EDIRegisterBefore, edi
mov ESIRegisterBefore, esi
end;
Move(Source, Dest, Count);
asm
mov EBXRegisterAfter, ebx
mov EDIRegisterAfter, edi
mov ESIRegisterAfter, esi
end;
end;
if ((EBXRegisterAfter <> EBXRegisterBefore) or
(EDIRegisterAfter
<> EDIRegisterBefore) or
(ESIRegisterAfter
<> ESIRegisterBefore)) then
raise Exception.Create('EBX, EDI or ESI register not preserved');
The line
Move (Source, Dest, Count);
is compiled into
mov eax, edx
mov edx, ecx
mov ecx, [ebp+$08]
Call
Move
If the compiler chooses to use one or more of the ebx, edi, esi registers to fill parameters in eax, edx and ecx prior to the call we will get a false error notification. Coding the function call in BASM would remedy the situation.
Benchmark
Philosophy
In general a benchmark must mimic real world use of Move. One
way to go (probably the best) is to analyze many (hundreds) applications for
their use of Move. The benchmark should be designed to include calls of Move
with parameter value combinations as the applications. Creating a benchmark this
way would be a very time consuming task and we must proceed in another
way.
Our benchmark must include as much of the parameter space as possible.
It must include overlapped moves with varying overlap, many different source and
destination alignment combinations and many different sizes of moves. The
benchmark must not focus in particular on any special moves.
It is very
difficult to create a benchmark where all subbenchmarks contribute evenly, but
we must carefully adjust subbenchmark values so that they contribute as equally
as possible. If we sum up move speeds as MB/sec, big moves will contribute too
much because they will bee considerably faster than small moves. This can be
corrected by doing more small moves than big moves. The biggest moves will on
the other hand be slow again because they will not benefit from
caching.
Caching is a big issue. We do not have information on cache hitrates
from real world applications. Because of this we must construct the benchmark to
include moves with all possible hit rates from zero to a hundred percent. Moves
with high hit rates will obtain larger MB/sec values than moves with low or zero
hitrates and this must be counter weighed by including more low hit rate moves
than high hitrate moves.
Subbenchmark1
Copy N elements from static source array to static
destination array. Let source and destination bee 1, 2, 4, 8, 16 byte aligned. N
grows exponentially from 2 bytes to 2 Mbytes. This way we have more small moves
than big ones.
Before doing these calls SourceArray and DestinationArray are
16 byte aligned.
Move(SourceArray[0], DestinationArray[0], ArraySize);//16, 16
byte aligned move
Move(SourceArray[0], DestinationArray[1], ArraySize);//16,
1 byte aligned move
Move(SourceArray[0], DestinationArray[2],
ArraySize);//16, 2 byte aligned move
Move(SourceArray[0],
DestinationArray[4], ArraySize);//16, 4 byte aligned
move
Move(SourceArray[0], DestinationArray[8], ArraySize);//16, 8 byte
aligned move
Move(SourceArray[1], DestinationArray[0], ArraySize);//1, 16
byte aligned move
Move(SourceArray[1], DestinationArray[1], ArraySize);//1, 1
byte aligned move
Move(SourceArray[1], DestinationArray[2], ArraySize);//1, 2
byte aligned move
Move(SourceArray[1], DestinationArray[4], ArraySize);//1, 4
byte aligned move
Move(SourceArray[1], DestinationArray[8], ArraySize);//1, 8
byte aligned move
Move(SourceArray[2], DestinationArray[0], ArraySize);//1, 16
byte aligned move
Move(SourceArray[2], DestinationArray[1], ArraySize);//1, 1
byte aligned move
Move(SourceArray[2], DestinationArray[2], ArraySize);//1, 2
byte aligned move
Move(SourceArray[2], DestinationArray[4], ArraySize);//1, 4
byte aligned move
Move(SourceArray[2], DestinationArray[8], ArraySize);//1, 8
byte aligned move
……..
……..
……..
Move(SourceArray[8], DestinationArray[0], ArraySize);//8, 16
byte aligned move
Move(SourceArray[8], DestinationArray[1], ArraySize);//8, 1
byte aligned move
Move(SourceArray[8], DestinationArray[2], ArraySize);//8, 2
byte aligned move
Move(SourceArray[8], DestinationArray[4], ArraySize);//8, 4
byte aligned move
Move(SourceArray[8], DestinationArray[8], ArraySize);//8, 8
byte aligned move
all 25 combinations of source and destination alignment are benchmarked.
Subbenchmark2
This benchmark is like the first one but includes cache flushing between moves and excludes different alignment combinations. Alignment is 4 byte. The cache hierarchy is flushed/invalidated by running this function
procedure InvalidateCaches(Step : Cardinal);
var
PolluteArray : array[1..512000] of
Byte;
I :
Integer;
const
ArraySize :
Integer = 512000;//512K
begin
I :=
1;
repeat
PolluteArray[I] := 6;
Inc(I, Step*32);
until(I >= ArraySize);
end;
It declares an array and fills it with garbage forcing memory
destinations being loaded into L1 and L2. This evicts data used in the
benchmark. The array size must be bigger than the biggest cache of target
processors. This is currently 512 KB.
The function has one parameter that
controls the degree of flushing. If Step = 1 (Step is a bad name) every 32'Th
byte is evicted which includes all cachelines on P2 and P3. P4 has 64 byte L1
cache lines, and 128 byte L2 cache lines. A Step value of 2 will flush half of
the cache on P2/P3 but all on P4. Increasing Step will decrease the number of
cache lines evicted. The only Step value that is processor neutral is 1.
The calls to
InvalidateCaches(CACHEFLUSHADJUST);
are timed and this time is subtracted from the overall runtime. Calling
QueryPerformanceCounter(lpPerformanceCount);
before and after " InvalidateCaches" has some overhead and the timing of Move is not exact. But it is the best we can do and overhead is equal for all functions.
CACHEFLUSHADJUST = 6 is used.
Subbenchmark3
This benchmark is constructed to run on non cached moves. Letting two subbenchmarks evict each other's data from the cache hierarchy destroys caching.
Move(SourceArray1, DestinationArray1,
ArraySize);
Move(SourceArray2, DestinationArray2, ARRAYSIZEMAX -
ArraySize);
The benchmark is like the first one with the difference that
one move run through different arraysizes from 1 to max and another goes in the
opposite direction. This way most moves are RAM to RAM. ("Most" is a very
imprecise term and must be tidied up).
The compiler controls source and
destination alignment as nothing has been done to control it. The compiler 4
byte aligns stack frames and this will sometimes equal 8, 16 or 32 byte
alignment by accident. Therefore this benchmark is not guarantied to be stable
across compilations.
From version 4.0 alignment is forced to be on 4 byte
boundaries. It is guaranteed that it never is 8, 16 or 32 byte.
Subbenchmark4
This subbenchmark benchmarks tiny moves size 1-50. These small moves will fit into 1, 2 or 3 L1 cachelines depending on processor type and alignment for source and destination blocks respectively. PIII has 32 byte L1 cachelines and moves that are 32 byte aligned will fit into 1 cacheline for sizes up to 32 byte. Moves of size 32 to 50 will fit into 2 cachelines. If data are misaligned the line split(s) will be placed according to the degree of misalignment. A 34 byte move misaligned 31 bytes will span 3 cachelines and split after byte 1 and byte 33.
The same history counts for P4 except for the fact that L1 cachelines are 64 byte and the number of lines split will be less.
For the benchmark to cover all cases it runs on size 1 to 50 moves that are 64 byte aligned. Misalignment in the range 0 to 63 byte is added to this alignment in turn.
The caches are never flushed.
Subbenchmark5
This benchmark is the overlapped move validation algorithms stripped for all validation code. Array sizes are 1000 byte.
SubBenchmark5 := BenchmarkOverlappedLeft + BenchmarkOverlappedRigth;
Total Benchmark
The total benchmark value is calculated by summing subbenchmarks that are scaled.
SUBBENCHMARKH1ADJUST : Double = X;
SUBBENCHMARKH2ADJUST :
Double = Y;
SUBBENCHMARKH3ADJUST : Double = Z;
SUBBENCHMARKH4ADJUST :
Double = V;
SUBBENCHMARKH5ADJUST : Double = W;
Benchmarking the RTL Move, and then calculating the scale
factors such that all subbenchmark values are the same could somewhat
arbitrarily choose these factors. The RTL move is not very fast in subbenchmark
4 (tiny moves) and the adjustment value would be too large.
It would perhaps
bee more correct to find the maximum possible value for subbenchmarks on
different processors and scale subbenchmarks to represent the percentage of
maximum possible speed. This would however be a big task and less must
suffice.
All subbenchmarks are scaled such that the fastest function in every
subbenchmark will have a value of 1000 on P4 1600A. This is done using the
example functions in version 4.0.
This way it is necessary to recalibrate the benchmark when a faster function pops up. This is no good and the values are frozen after a short start up phase.
Selecting the Function to Benchmark
Selecting which function to benchmark and validate is done via the ComboBox on the upper panel.
Adding a New Function
Adding a new function is done by entering a line in the
FunctionSelectComboBox.Items property. The OnChange eventhandler calls a
function named SelectFunction with parameter FunctionID. In this function there
is a case. Enter a new index in this
too.
Optimization
The Optimizebutton is one I have used while developing functions. It calls the benchmark repeatedly while changing the value of a global variable of type Cardinal. This variable can be used in a function to select code blocks according to the value of Count or whatever you figure out.
The best benchmark value and the variable value leading to this is printed on the GUI.