Libraries

The Fastcode library consists of 3 or 4 units per challenge. Each unit contains the winner functions for one challenge. Each challenge at least spans these three units: Direct Calling, CPUID based function selection, conditional compilation. If the challenge function exist in the RTL or VCL there is also a library unit that supports the patching principle.

Presently the challenges target the following architectures:
Pentium 4 Prescott
Pentium 4 Northwood
Pentium M Dothan
Pentium M Banias
AMD 64
Athlon XP
Blended
RTL Replacement
Pascal

Direct calling

All functions can be called direcly via these function interfaces

function XXXFastcodeP4P;
function XXXFastcodeP4N;
function XXXFastcodePMD;
function XXXFastcodePMB;
function XXXFastcodeAMD64;
function XXXFastcodeXP;
function XXXFastcodeBlended;
function XXXFastcodeRTL;
function XXXFastcodePas;

Sometimes the same function will be called through two or more interfaces if it is optimal in more targets.

Conditional compilation

Conditional compilation is also supported. One of the 9 above mentioned functions will be compiled in as the implementation behind this function interface

function  XXXFastcode;

Compiler directives are named:
P4N, P4P, PMD, PMB, ATHLONXP, AMD64, BLENDED, PASCAL, RTLREPLACEMENT

Only one of these can be set at the same time.

CPU id based function selection

On library initialization a function pointer is initialized to point at the fastest function for the given processor. Call via the function pointer

function  XXXFastcodeCPUID

and the call will be redirected to one of the functions

function  XXXFastcodeP4N;
function  XXXFastcodeP4P;
function  XXXFastcodePMD;
function  XXXFastcodePMB;
function  XXXFastcodeAthlonXP;
function  XXXFastcodeAMD64;

If the processor is none of these, this function will be called

function  XXXFastcodeBlended;

but only if the processor supports IA32 extensions and MMX. IA32 extensions concist of instructions such as  CMOVcc, FCMOVcc, FCOMI.

Otherwise this function will be called

function XXXFastcodeRTLReplacement


The Patching Principle

Each unit contains the winner functions, just like the direct calling unit, but it also contains patching code. This code iterates through the executable image and patches all calls to the RTL function such that said calls are redirected to the Fastcode versions.

Patching, in its simplest form is relatively straightforward.  It is just a matter of finding the address of the system function to be patched, and inserting a new jump instruction at that address to jump to the replacement function.  There are however a few important things to take into account.

  1. If the function being patched is less than 5 bytes is size, a jump cannot be inserted without possibly overwriting another system function.
  2. If the system function to be patched is already small or fast, then unless packages are being used, inserting a jump to a replacement function is very unlikely to produce any performance gain (from the calling programs viewpoint, we would be unnecessarily calling a jump to another function). For this reason fastcode function like the MaxInt, Round, etc are unlikely to see any improvement by patching.
  3. When packages are being used, the inserted jump is simply a replacement for an existing jump.
  4. A few API calls (VirtualProtect, FlushInstructionCache) are needed while performing the actual patching.

The unofficial FastMove unit by John O’Harrow uses patching to select the IA32, MMX or SSE replacement, but with an additional performance tweak:- When not using packages, rather than just inserting a jump at the original system.move location, John actually patch 58 bytes (of the original 64 bytes used by move).  Within these 58 bytes, He can handle all small moves (<36 bytes) more efficiently.

How to modify and recompile a RTL/VCL unit

Directly inserting the RTL replacement function in the Delphi/C++ Builder library is probably the best option.

Recompilation of the RTL units (apart from SYSTEM.PAS) is also very straightforward.

  1. Edit the source code (\program files\borland\delphiX\source\RTL\sys directory).
  2. Ensure that MAKE.EXE (make utility) and DCC.EXE (command line compiler) are in the search path.
  3. In a DOS shell, go to the RTL directory (\program files\borland\delphiX\source\RTL) and type MAKE.  This will create new DCU files in a subdirectory called LIB (you may need to create this directory).
  4. Copy the required DCU file created to the real LIB directory (\program
    files\borland\delphiX\LIB).
  5. Run or Restart Delphi.

Patching SYSTEM.PAS can get more complicated.  If you are just directly replacing a function in SYSTEM.PAS with another, then no problems should occur.  If however you are adding code to detect the CPU type and assign a function pointer to replace a system function etc., then virtually all of the DCU's in the RTL (and in most cases, also the VCL) will need to be replaced.

How to build Libraries

Direct Calling

Template

ANSI StringReplace Library
ArcCos Library
ArcSin Library
Ceil Library
CharPos Library
CharPosEY Library
CompareMem Library
CompareStr
Compare Text Library
FillChar Library
Floor Library Preliminary version 0.1
Int64Div Library version 1.0
IsPrime Library
LowerCase Library
MaxInt Library Preliminary version 0.1
MaxInt64
MaxFP Library Preliminary version 0.1
MinFP Library Preliminary version 0.1
MinInt Library Preliminary version 0.1
MinInt64
Move Library
PosEX Library
Pos Library
Power Library Preliminary version 0.1
RGB To BGR
Round Library Preliminary version 0.1
RoundToEX
StrCopy
StrComp Library
Trunch32

Conditional compilation

Template

CompareStr

CPU ID based function selection

This unit is used by all the library units in this section

Move Library

CPU ID Detection Unit

CPUID Detection Unit

 

Unofficial Versions