From - Fri Jan 31 09:12:01 1997
Path: nntp.farm.idt.net!news.idt.net!cam-news-hub1.bbnplanet.com!news.bbnplanet.com!cpk-news-hub1.bbnplanet.com!news.maxwell.syr.edu!news-feed.inet.tele.dk!news1.tele.dk!cph-1.news.DK.net!dkuug!dknet!usenet
From: Agner@login.dknet.dk (Agner Fog)
Newsgroups: sci.math,sci.math.num-analysis,sci.stat.math,comp.programming,comp.lang.c++,comp.lang.asm.x86
Subject: Excellent pseudo random number generator
Date: Fri, 31 Jan 1997 15:20:02 +0100
Organization: Customer at DKnet
Lines: 625
Distribution: inet
Message-ID: <S+f8y0SlSbyf092yn@login.dknet.dk>
Reply-To: agner@login.dknet.dk
NNTP-Posting-Host: login.dknet.dk
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Xref: nntp.farm.idt.net sci.math:11896 sci.math.num-analysis:2025 sci.stat.math:1533 comp.programming:6008 comp.lang.c++:25917 comp.lang.asm.x86:12838

[copyright (c) 1997 by Agner Fog, noncommercial use allowed]



I couldn't find a random number generator that was good enough for my
very high requirements, so I made one myself. The result exceeded all
my expectations, and therefore I am going to share my findings with
others and hear your opinions.

The contents of this long posting are:

1. Introduction
2. Algorithm
3. C++ source code
4. Optimized assembly language source code
5. Theoretical analysis and discussion
6. Request for feedback
7. Conclusion 


1. Introduction
---------------
I need a good pseudo random number generator for some very heavy
Monte Carlo simulations. My requirements are:

- Floating point output with a flat distribution in the interval
  between 0 and 1.
- Very high resolution
- Extremely high cycle length
- No correlation between consequtive numbers
- The sequence that follows after an extreme number must be perfectly
  random
- Fast execution

Unsatisfied with the codes I could find elsewhere I developed the code
that follows, which more than fulfills my requirements.


2. Algorithm
------------
My method is inspired by the widely used additive generator (ADDGEN)
which has the following simple algorithm:

X[n] = X[n-j] + X[n-k]

The X-values are binary numbers of b bits, where the overflow from
the addition is ignored. So to be exact, we have:

X[n] = (X[n-j] + X[n-k]) mod 2^b

This algorithm is easily implemented with a circular buffer of k b-bit
numbers.

The cycle length of the ADDGEN generator is far smaller than the size
of the circular buffer permits because there is no flow of information
from the high bits to the low bits, as explained in section 5. I have
therefore improved this algorithm by shuffling the bits.

The improved algorithm is:

Hi(X[n]) = (Lo(X[n-j]) + Lo(X[n-k])rotr r) mod 2^(b-1)

Lo(X[n]) = (Hi(X[n-j]) + Hi(X[n-k])) mod 2^(b-1)

Where Hi(X) means the upper b/2 bits of X, and Lo(X) means the lower
half of X. Y rotr r means the bits of Y rotated r places to the right:
00001111(bin) rotr 3 = 11100001(bin)

The choise of b is determined by the desired resolution and the
microprocessor word size. r must be relatively prime to b, j must
be relatively prime to k, and k-j must be odd. The best choise is
when k, j/2, and k-j are all primes, and j is between 0.5*k and 0.7*k.

This algorithm is called SHUFFLADD in the following.

The source code below uses b=64, r=7, j=10, k=17.


3. C++ source code
------------------

#include <math.h>
// define parameters
#define KK 17
#define JJ 10
#define RR 7

typedef class {
  public:
  void RandomInit (double seed);
  long double Random ();
  protected:
  union {
    long double randp1;
    long randbits[3];};
  int p1, p2;
  long randbuffer[KK+1][2];
  } TRandomGenerator;

long double TRandomGenerator::Random() {
  // returns a random number between 0 and 1.
  long a, b;
  // generate next number
  b = _lrotr(randbuffer[p1][0], RR) + randbuffer[p2][0];
  a = randbuffer[p1][1] + randbuffer[p2][1];
  randbuffer[p1][0] = a; randbuffer[p1][1] = b;
  // rotate list pointers
  if (--p1 < 0) p1 = KK - 1;
  if (--p2 < 0) p2 = KK - 1;
  // convert to float
  randbits[0] = a;
#if sizeof(long double) == 10
  randbits[1] = b | 0x80000000;
#elif sizeof(long double) == 8
  randbits[1] = b & 0x000FFFFF | 0x3FF00000;
#else
#error incompatible floating point format
#endif
  // normalize floating point number
  return randp1 - 1.;}

void TRandomGenerator::RandomInit (double seed) {
  // this function initializes the random number generator.
  // must be called before the first call to Random!
  int i, j; long double s;
  // get seed into an acceptable range
  if (seed < 0) seed = 0.147328926107 - seed;
  while (seed >= 1) seed *= 0.0817835148503;
  s = seed;
  // make random numbers and put them into the buffer
  for (i=0; i<KK; i++) {
    for (j=0; j<2; j++) {
      s = (sqrtl(s) + 0.147328926107) * 46.1063140045;
      s = s - floorl(s);
      *(double*)&randbuffer[i][j] = s;}}
  // set exponent of randp1
  randp1 = 1.5;
  // initialize pointers to circular buffer
  p2 = KK - 1; p1 = p2 - JJ;}


// Example showing how to use the random number generator:

TRandomGenerator R1;
double seed; int i;

seed = time(NULL);
R1.RandomInit(seed);
for (i=0; i<100; i++) {
  printf ("\n%14.10f", (double)R1.Random());}


comments:
This code requires that floating point numbers be stored in the
64bit or 80bit IEEE standard format used in Intel processors and
modern compilers.

The conversion of integer bits to a floating point number uses the
following method: The exponent of the floating point number is set
to such a value that it generates a number between 1 and 2, and the
mantissa is set to random bits. Then we subtract 1.0 to get a random
number in the interval 0 <= X < 1, and let the floating point unit
normalize the exponent. This method is much faster than using bit
scan or division for normalizing the floating point exponent.

If the output is rounded to lower precision then it may produce X = 1.

If your system doesn't have the sqrtl and floorl functions, then use
sqrt and floor. If you don't have the _lrotr function then define it
thus:

long _lrotr (unsigned long x, int r) {
  return (x >> r) | (x << (sizeof(x)-r));}



4. Optimized assembly language source code
------------------------------------------

; The random function is optimized for the Pentium microprocessor.
; The RandomInit function does not need optimizing because it is
; only called once.


; 32-bit mode:
; ------------
.386
; define parameters
KK      equ     17            ; size of circular buffer
JJ      equ     10            ; second pointer
RR      equ      7            ; rotate count

RAND_DATA SEGMENT PARA USE32 PUBLIC 'DATA'
align   16
randp1  DT      1.5           ; used for conversion to float
        DW      0             ; alignment
p1      DD      0             ; pointer in circular buffer
p2      DD      0             ; pointer in circular buffer
rbufend DD      (KK-1)*8      ; pointer to end of buffer
randbuf DD      (2*KK) dup(?) ; circular buffer
        DD      ?             ; receives junk
one     DD      1.0           ; constant
RandAdd DQ      0.147328926107; arbitrary constants used by RandomInit
RandFactor DQ   46.1063140045
RandDiv DQ      0.0817835148503
RAND_DATA ENDS
DGROUP GROUP RAND_DATA

_TEXT   SEGMENT DWORD USE32 PUBLIC 'CODE'
ASSUME CS:_TEXT,DS:DGROUP

PUBLIC Random1,_Random1,Random2,_Random2,RandomInit,_RandomInit
Random1 PROC NEAR
_Random1 LABEL NEAR
        FLD     [randp1]        ; load random number
        FSUB    [one]           ; normalize
Random2 LABEL NEAR              ; second entry for integer output
_Random2 LABEL NEAR
        MOV     ECX, [p1]       ; ring buffer pointers
        MOV     EDX, [p2]
        PUSH    EBX
        NOP                     ; avoid imperfect pairing due to AGI
        MOV     EBX, [randbuf][ECX]     ; shuffle-and-add algorithm
        MOV     EAX, [randbuf][ECX+4]
        ROR     EBX, RR         ; unpaired
        ADD     EBX, [randbuf][EDX]
        ADD     EAX, [randbuf][EDX+4]
        MOV     [randbuf][ECX], EAX
        MOV     [randbuf][ECX+4], EBX
        OR      EBX, 80000000H  ; bit 63 must be set
        MOV     DWORD PTR [randp1],EAX   ; low 32 bits of next random number
        SUB     ECX, 8                   ; decrement p1
        JNC     SHORT R1
        MOV     ECX, [rbufend]
R1:     MOV     DWORD PTR [randp1+4],EBX ; bits 32-62 of next random number
        POP     EBX
        SUB     EDX, 8                   ; decrement p2
        JNC     SHORT R2
        MOV     EDX, [rbufend]
R2:     MOV     [p1], ECX
        MOV     [p2], EDX
        RET
random1 ENDP

RandomInit PROC NEAR
_RandomInit LABEL NEAR
        FLD1
        FLD     QWORD PTR [ESP+4]  ; seed
        FTST                    ; check if seed is > 0
        FSTSW   AX
        SAHF
        JNC     SHORT R20
        FSUBR   [RandAdd]
        JMP     SHORT R20
R10:    FMUL    [RandDiv]       ; make seed < 1
R20:    FCOM    ST(1)
        FSTSW   AX
        SAHF
        JAE     R10
        MOV     ECX, OFFSET DS:[randbuf]  ; set up for loop
        MOV     EDX, KK*2
R30:    FSQRT                             ; make random number
        FADD    [RandAdd]
        FMUL    [RandFactor]
        FPREM                             ; get decimals
        FST     QWORD PTR [ECX]  ; only lower 32 bits used, rest overwritten next time
        ADD     ECX, 4
        DEC     EDX
        JNZ     R30
        FSTP    ST(0)           ; discard seed
        FSTP    [randp1]        ; initialize bit 64-79
        MOV     [p1], (KK-JJ-1)*8  ; initialize buffer pointers
        MOV     [p2], (KK-1)*8
        CALL    Random2         ; prepare first random number
        RET     8               ; RET 0 if _cdecl calling
RandomInit ENDP

_TEXT   ENDS
        END


; To optimize for PentiumPro processor: Remove the NOP and replace
; the conditional jumps to R1 and R2 with conditional moves.


; 16-bit mode (this is slower):
; -----------------------------

.386
; define parameters
KK      equ     17            ; size of circular buffer
JJ      equ     10            ; second pointer
RR      equ      7            ; rotate count

RAND_DATA SEGMENT PARA USE16 PUBLIC 'DATA'
align   16
randp1  DT      1.5           ; used for conversion to float
p1      DW      0             ; pointer in circular buffer
p2      DW      0             ; pointer in circular buffer
rbufend DW      (KK-1)*8      ; pointer to end of buffer
align   8
randbuf DD      (2*KK) dup(?) ; circular buffer
        DD      ?             ; receives junk
one     DD      1.0           ; constant
RandAdd DQ      0.147328926107; arbitrary constants used by RandomInit
RandFactor DQ   46.1063140045
RandDiv DQ      0.0817835148503
RAND_DATA ENDS
DGROUP GROUP RAND_DATA

_TEXT   SEGMENT BYTE USE16 PUBLIC 'CODE'
ASSUME CS:_TEXT,DS:DGROUP

PUBLIC Random1,_Random1,Random2,_Random2,RandomInit,_RandomInit

Random1 PROC NEAR               ; FAR if model is medium or large
_Random1 LABEL NEAR
        FLD     [randp1]        ; load random number
        FSUB    [one]           ; normalize
Random2 LABEL NEAR              ; second entry for integer output
_Random2 LABEL NEAR
        MOV     BX, [p1]       ; ring buffer pointers
        PUSH    SI
        MOV     SI, [p2]
        NOP                     ; avoid imperfect pairing due to AGI
        MOV     EDX, [randbuf][BX]     ; shuffle-and-add algorithm
        MOV     EAX, [randbuf][BX+4]
        ROR     EDX, RR
        ADD     EDX, [randbuf][SI]
        ADD     EAX, [randbuf][SI+4]
        MOV     [randbuf][BX], EAX
        MOV     [randbuf][BX+4], EDX
        OR      EDX, 80000000H  ; bit 63 must be set
        MOV     DWORD PTR [randp1],EAX   ; low 32 bits of next random number
        SUB     SI, 8                   ; decrement p2
        JNC     SHORT R2
        MOV     SI, [rbufend]
R2:     MOV     DWORD PTR [randp1+4],EDX ; bits 32-62 of next random number
        SUB     BX, 8                   ; decrement p1
        JNC     SHORT R1
        MOV     BX, [rbufend]
R1:     MOV     [p1], BX
        MOV     [p2], SI
        POP     SI
        RET
random1 ENDP

RandomInit PROC NEAR            ; FAR if model is medium or large
_RandomInit LABEL NEAR
        MOV     BX, SP
        FLD1
        FLD     QWORD PTR SS:[BX-2*(TYPE RandomInit)] ; seed
        FTST                    ; check if seed is > 0
        FSTSW   AX
        SAHF
        JNC     SHORT R20
        FSUBR   [RandAdd]
        JMP     SHORT R20
R10:    FMUL    [RandDiv]       ; make seed < 1
R20:    FCOM    ST(1)
        FSTSW   AX
        SAHF
        JAE     R10
        MOV     BX, OFFSET DS:[randbuf]  ; set up for loop
        MOV     DX, KK*2
R30:    FSQRT                             ; make random number
        FADD    [RandAdd]
        FMUL    [RandFactor]
        FPREM                             ; get decimals
        FST     QWORD PTR [BX]  ; only lower 32 bits used, rest overwritten next time
        ADD     BX, 4
        DEC     DX
        JNZ     R30
        FSTP    ST(0)           ; discard seed
        FSTP    [randp1]        ; initialize bit 64-79
        MOV     [p1], (KK-JJ-1)*8  ; initialize buffer pointers
        MOV     [p2], (KK-1)*8
        CALL    Random2         ; prepare first random number
        RET     8               ; RET 0 if _cdecl calling
RandomInit ENDP

_TEXT   ENDS
        END

; -----------

comment %
Example of calling from C++ program:

   extern "C" void   _stdcall RandomInit (double);
   extern "C" long double _stdcall Random1 (void);
   int i;
   RandomInit (time(NULL));
   for (i=0; i<100; i++) {
     printf ("\n%14.10f", (double)Random1());}


// if _stdcall not supported, use _cdecl in stead and replace 
// RET 8  with  RET 0  in asm code.
%


An important design goal has been to make the code fast. Thanks to
the shuffling of bits, it has been possible to dispense with the
carry from the lower 32 bits to the upper 32 bits and still get a
good performance. This makes it possible to do the two ADD instructions
simultaneously in the two pipelines of the Pentium. This saves 2 clock
cycles which more than compensates for the one clock cycle used by the
(unpairable) rotate instruction. With great joy I have found that the
improved algorithm actually is slightly faster than the ADDGEN algorithm!



5. Theoretical analysis and discussion
--------------------------------------
The additive generator (ADDGEN) was chosen as a starting point because
it is fast and because its state is defined by more than one variable
which gives the potential for a high cycle length. The cycle length is
defined as the number of times the random function can be called before
the sequence is repeated. Obviously, the cycle length cannot exceed the
number of possible states. In many random number generators it is much
less than that. Since cycle lengths ideally should be too long to
measure experimentally, a theoretical analysis is needed.

I will start my discussion of cycle lengths with the case of a system
with N possible states where it can go from one state to another in a
totally chaotic fashion without following any rules, except that it is
deterministic.

Assume that we have called this chaotic function i times going through
i different states. The probability that the next state will be one
that we have seen before is i/N. We continue the process until this
happens, when i=c. The most likely value of the cycle length c will
be close to sqrt(2*N) because the summation of i/N will exceed 1/2 at
this value.

I have found that a system based on the decimals of a squareroot
behaves in this manner (used in the RandomInit function above).
Starting this system with different initial seeds, I found different
trajectories which all sooner or later joined together and ended up
into the same cycle with a length of the expected order of magnitude.
It is very possible that this system has other much smaller cycles
which I didn't find because the probability of getting into them was
small.

In the following, I am using the value of c = sqrt(2*N) as a reference
for evaluating how good other systems are.

Now, let's analyze the ADDGEN system. It consists of a circular
buffer of k binary numbers of b bits each. The number of possible
states is N = 2^(k*b).

This system is based on addition operations, where the carries trickle
from the low bits up through the high bits. There is no information
flowing the other way, from the high bits to the low bits. For this
reason, the least significant bits of the k numbers constitute an
independent system which cannot be influenced by the higher bits.
I will call this system ring 0, i.e. the least significant bits of
the k numbers.

I will now analyze ring 0 alone. The state where it contains all
zeroes will go to itself because adding zeroes can produce nothing
but zeroes. This gives a cycle length of one. The remaining 2^k-1
states may provide one or more cycles. The sum of the lengths of all
the cycles is of course equal to the number of possible states.
I have found experimentally, that the cycles either all have the
same length, or there is a simple relationship between the cycle
lengths. The length of the biggest cycle is always equal to the
least common multiple of all other cycles in the system.

Ring 1, the ring of the least significant but one bits, forms an
identical system to ring 0, except for the fact that the carries
from additions in ring 0 are added to ring 1. If we turn off the
carries for a moment and analyze the combined system of ring 0 and
ring 1, without carries, then we have two identical systems. The two
rings may happen to be in different cycles, so the cycle length of
the combined system should be the least common multiple of the cycle
lengths of the two systems. But since the highest cycle length of a
single ring is allready equal to the least common multiple of all
other cycles in that ring, then the maximum cycle length of the
two-ring system (without carries) can be no longer than the maximum
cycle length of the single-ring system. If we allow carries from
ring 0 to ring 1, then the cycle lengths will be doubled. When we
finally expand the system to b rings, then we have b identical ring
systems where the carries from each ring is added to the next ring
in a manner where the contributions of carries from each ring are
purely additive. The maximum cycle length is doubled for each extra
ring that is added.

Example: 
An ADDGEN system with k=5, j=1, b=1 has the following cycles:
3, 7, 21 (ignoring the trivial state of all zeroes). Expanding this
to two rings (b=2) gives cycle lengths 3, 6, 7, 14, 21, 42. b=3 gives:
3, 6, 12, 7, 14, 28, 21, 42, 84.

The conclusion is that the maximum possible cycle length of an ADDGEN
system is (2^k-1)*2^(b-1) ~= 2^(k+b-1). This is much lower than
sqrt(2*N) = 2^(k*b/2), due to all the regularities in the system.

In order to improve this situation I have developed the SHUFFLADD
algorithm defined above. I am shuffling the bits in such a way that
information goes not only from the low bits to the high bits through
the carries, but also the other way. In fact, I have made the design
in such a way that information can go from any bit in the buffer to
any other bit. This can be verified experimentally by changing a
single bit anywhere in the buffer, and after a sufficient number
of calls you will get a completely different sequence from what you
would if you hadn't changed that single bit. No information is lost.

Testing the SHUFFLADD system experimentally, I found that the cycle
length usually was bigger than sqrt(2*N). The reason for this is as
follows: In the chaotic system described above there is no rule
preventing two different states from going to the same successor state.
But the SHUFFLADD algorithm is a reversible function. Each state has
one and only one precursor state. It is no problem to apply the
algorithm backwards and recall previous random numbers in the sequence.
(The ADDGEN algorithm is also reversible). We therefore have to modify
the argument we used above to calculate the cycle length of the chaotic
system. After having passed through i different states in the SHUFFLADD
algorithm we want to calculate the probability that the next state will
be one that we have seen before. This probability is no longer i/N,
but 1/(N-i) because only our initial state can close the circle, the
i-1 states in between are not possible successors. Summing this
probability, we find that we most likely have to try more than N/2
times before we hit the initial state again. In other words, the cycle
length is likely to be somewhat longer than N/2. (Forgive me for the
laziness of not calculating the exact formula for the sum of 1/(N-i)).

Now, assuming that we have found a cycle of length c1 = N/2, we have
to analyze the remaining states. We can divide the set of possible
states into two subsets: one subset comprised of the states in the
c1 cycle, and another set containing all other states. This second
subset has the same quality as the whole set of states: namely that
all states in this subset must go to a state in the same subset.
Searching for a cycle in the second subset the same way we did before,
we will most likely find a cycle length c2 slightly larger than half
the size of the subset, or N/4. Using this argument recursively, we
should expect to find cycles of lengths approximately N/2, N/4, N/8,
N/16, etc. down to 1.

Verifying this experimentally is only possible for unrealistically
small systems. For example in a SHUFFLADD system with b=6, r=1, k=4, 
j=1, I found the following cycle lengths: 
13,053,066; 2,590,080; 562,305; 247,197; 101,212; 94,527; 90,601; 
16,503; 7,485; 6,739; 3,829; 2,094; 915; 359; 288; 14; 1; 1.

The total number of cycles in SHUFFLADD systems not too big to
analyze experimentally was found to be 0.8*k*b on average. Plotting
the binary logarithms of the cycle lengths, I found that these were
randomly scattered in the interval between 0 and k*b. Unlike the
ADDGEN system there was no evidence of any relationship between
cycle lengths. Only very rarely did I find two cycles of the same
length (except for length 1), and a prime factorization of the cycle
lengths showed that they share no more prime factors than you would
expect from completely random numbers. This, of course, holds only
when the design rules are obeyed, i.e. r must be relatively prime to
b, j must be relatively prime to k, and k-j must be odd.

The only regularity I found was that there was always two cycles of
length one. One of these was the obvious case where the buffer
contained all zeroes. The other one was less obvious.

If the distribution of cycle lengths sketched here also holds for
systems too big to investigate experimentally, then we can calculate
the probability of getting into an unacceptably small cycle as
approximately cmin/2^(k*b-1). For k=17 and b=64 we find that the
probability of getting into a cycle smaller than Cmin = 10^12 is
10^(-315). I think we can live with this risk!

We must conclude that the shuffling of bits has increased the cycle
length tremendously. The only problem is that the more random we make
the system, the more difficult it becomes to analyze theoretically.
We have made a system which seems to be much better, but we cannot
prove theoretically that it is.

A further design goal was to minimize the correllation between
subsequent numbers in a sequence. The shuffling of bits has probably
also improved the performance in this respect, although both the
ADDGEN and SHUFFLADD generators have passed all my tests.


6. Request for feedback
-----------------------
If there is any math genius out there who can refine my analysis then
I would be grateful to hear from you. My heuristic arguments about
cycle lengths rely on the assumption that the SHUFFLADD algorithm has
no regularities which can reduce cycle lengths, such as the ADDGEN has.
Proving this theoretically is beyond my powers. And an experimental
proof is only possible for systems too small to be useful for the
intended purposes.

Also, if anybody can prove, theoretically or statistically, that the
random numbers produced by my algorithm are less than perfectly random,
then please let me know.



7. Conclusion
-------------
The SHUFFLADD algorithm defined here produces pseudo random numbers
with a flat distribution in the interval 0 <= X < 1.

The resolution is 10^(-19). The cycle length is virtually infinite.
I have found no measureable deviations from perfect randomness.

The shuffling of bits also means that we do not need as big a buffer
as for the ADDGEN to obtain a satisfactory performance.

The optimized code executes in just 20 clock cycles on a Pentium
processor. This means that it can produce 10 million random numbers
per second on a 200MHz processor.



This document is also published at my www site together with a list
of 10000 random numbers. You will find it at:

http://announce.com/agner/random

This URL will be updated late this year with more stochastic functions.


===============================================================================
Agner Fog, Ph.D.                            Pentium optimation manual at
agner@login.dknet.dk                        http://announce.com/agner/assem
