I wanted to discuss VectorizedRNG.jl a little. This library provides two vectorized permuted congruential generators (PCG): XSH-RR and RXS-M-XS.
These are used for producing vectors of random bits, which can then be transformed to uniform random floating point numbers, which in turn can be used for normal or exponential random samples.
To start with, what is a PCG? A PCG is built up of two parts: an updating state, and an output transformation.
The updating state is a linear congruential generator (LCG). A valid LCG requires:
$s_{n+1} = \left( a * s_{n} + b \right) \% m$
where $\%$ is the modulus operator (ie, the remainder), and $a$ and $m$ are relatively prime (ie, they have no common factors). Given these conditions, for some values of $a$, the LCG will traverse all values from $0,\ldots,m-1$.
A very convenient choice for $m$ is $2^64$. This is because 64-bit integer operations automatically roll over, mod $2^64$, meaning
- We get the modulus operation for free.
- A good LCG will hit every single 64 bit integer, meaning we get the most bang possible for our 64-bit buck.
Given that $m = 2^64$, $m$ has 64 factors, all of which are 2. Thus, $b$ may not be divisble by $2$. That is, $b$ must be odd.
Lets use $2^4$ for demonstration purposes. It would take a little too long to cycle through 2^64.
mutable struct LCG{T}
a::T
b::T
m::T
end
function Base.iterate(lcg::LCG)
state = 1
next_state = (state * lcg.a + lcg.b) % lcg.m
state, next_state
end
function Base.iterate(lcg::LCG, state)
state == 1 && return nothing # terminate on repeat
next_state = (state * lcg.a + lcg.b) % lcg.m
state, next_state
end
Base.length(lcg::LCG) = lcg.m
for i ∈ LCG(9, 1, 16)
@show i
end
We hit all 16 numbers, but they did not look very random. In fact, we had 8 pairs of increasing (even,odd) pairs. Note that the pattern is $+9, +1, +9, +1,\ldots$
For a multiplier, 5 does much better than 9:
for i ∈ LCG(5, 1, 16)
@show i
end
Note that the pattern is $+5, +9, +13, +1, +5, +9, +13, +1, \ldots$ That is, the sequence is of four repeating increments, instead of 2 we had for 9.
While others, like 7, fail to traverse all 16 numbers:
for i ∈ LCG(7, 1, 16)
@show i
end
for i ∈ LCG(7, 7, 16)
@show i
end
We’re much more interested in the multipliers, because changing the increments (b) only shifts that pattern we saw:
for i ∈ LCG(9, 3, 16)
@show i
end
Now the sequence is $+11, +3, +11, +3, \ldots$.
for i ∈ LCG(9, 5, 16)
@show i
end
Here we see the pattern $+13, +5, +13, +5,\ldots$. Using $7$ results in a sequence of $+15,+7,+15,+7,\ldots$, and so fourth.
Under the idea that the longer the sequence of increments before we hit a repeat, the more random, we would prefer 5
as the multiplier.
For the VectorizedRNG library, I generated thousands of random integers, subjected them to a few statistical tests for randomness, and picked the ones that did the best to be the defaults.
All of these LCG sequences look fairly predictable. This is worse when we look at the actual series of bits:
last_four_bits(x) = bitstring(x)[end-3:end]
for i ∈ LCG(5, 1, 16)
@show last_four_bits(i)
end
The last bit flips between a 0 and a 1 (period of 2), the second last bit has a period of 4, $\ldots$. Only the first bit has the full period. This is a pattern we see consistently:
for i ∈ LCG(9, 7, 16)
@show last_four_bits(i)
end
Therefore, these bits really don’t look random at all. Especially those trailing bits. The LCG is still useful, because it lets us represent 64 bits of state $-$ eventually reaching all of it $-$ with 64 bits of memory, and a very computationally simple way to update it.
But, we can’t just return the state as a “random” number, because it doesn’t look random at all.
What can we do about it? That’s where the output transformation comes in.
The two key tools are the bitshift, and the xor.
The bitshift shifts bits:
@show bitstring(UInt8(25))
@show bitstring(UInt8(25) << 2)
@show bitstring(UInt8(25) >> 2);
That is, the bits “00011001” were shifted to the left by 2, and then to the right by 2. Bits shifted in are all zero.
The xor, or exclusive or, operation returns “1” if and only if one input was true, and the other false.
@show bitstring(0xaa)
@show bitstring(0x0f)
@show bitstring(0xaa ⊻ 0x0f);
So the trick is, if we “xor” the lower bits with the upper bits, we can randomize them.
"""
Assumes a 4-bit integer proxy (ie, Julia doesn't actually have 4 bit integers,
so merely treats a higher-bit integer as though it were 4 bits).
"""
function rxs_m_xs4(i)
count = i >> 3
i = i ⊻ i >> (count + 1)
i = 5i % 16
i ⊻ (i >> 2)
end
for i ∈ LCG(5, 1, 16)
@show rxs_m_xs4(i)
end
We hit all 16 numbers, from 0 through 15.
The incrementing sequence is now $+7,+4,+10,+14,+5,+2,+4,+4,+11,+4,+8,+6,+7,+14,+8$. While a little $4$-heavy, there was no pattern in these differences: it was also 16-long.
The bit patterns also look much more random:
for i ∈ LCG(5, 1, 16)
@show last_four_bits(rxs_m_xs4(i))
end
Perhaps with more work I could get the sequence for the last bit to be 16, but this is a substantial improvement.
So then, now that we have sequences of bits, how do we turn those into floating point numbers? Here is a wikipedia article discussing 64-bit floating point values.
The key is that floating point numbers are made up of three components:
- a sign bit
- an exponent
- a fraction
The final number equals $\text{sign} * \text{fraction} * 2^\text{exponent}$.
Increasing the exponent doubles the value:
bitstring(1.0)
bitstring(2.0)
If the exponential component equals $x$, the fraction component decides where in the range of numbers between $x$ and $2x$ the number is. For example, the biggest 64-bit floating point number smaller than 2.0 has the same exponent as 1.0, but all the fraction-bits are now 1:
bitstring(prevfloat(2.0))
What this means is that we can get a random floating number between $1$ and $2$ by generating a bunch of random bits, and then setting the sign and exponent bits correctly (eg, for 64 bit floats, to $001111111111$).
We can do that via the bitwise “and” (&
) and “or” (|
) operators: we bitwise “and” to set the first two bits to zero, and then bitwise or to set the rest of the exponential bits to 1.
bitstring(0x000fffffffffffff)
bitstring(0x3ff0000000000000)
bitstring(typemax(UInt))
bitstring(typemax(UInt) & 0x000fffffffffffff)
bitstring((typemax(UInt) & 0x000fffffffffffff) | 0x3ff0000000000000)
reinterpret(Float64, (typemax(UInt) & 0x000fffffffffffff) | 0x3ff0000000000000)
This is the largest 64-bit floating point value smaller than 2:
reinterpret(Float64, (typemax(UInt) & 0x000fffffffffffff) | 0x3ff0000000000000) |> nextfloat
Similarly, this operation will convert the smallest integer to $1.0$:
bitstring(typemin(UInt))
bitstring(typemin(UInt) & 0x000fffffffffffff)
bitstring((typemin(UInt) & 0x000fffffffffffff) | 0x3ff0000000000000)
reinterpret(Float64, (typemin(UInt) & 0x000fffffffffffff) | 0x3ff0000000000000)
It is noteworthy that this means wer are generating random numbers on the interval $[1,2)$. That is, the interval is closed-open. To get a random number between 0 and 1, we now have two choices, either:
\begin{align*}
x &\sim \text{uniform}[1,2)\\
u1 &= x – 1.0\\
u1 &\sim \text{uniform}[0,1)\\
u2 &= 2.0 – x\\
u2 &\sim \text{uniform}(0,1]\\
\end{align*}
Because it is more common for functions to be undefined on $0$ than on $1$ (eg, log
), I choose $u_2$.
Now, the simplest way to convert the uniform samples into exponential and normal samples are via the probability integral transform, and box-muller, respectively.
The Box-Muller in particular has been criticized for being slow (for example, here). However, this slow speed is owes to Box-Muller requring the calculation of sqrt
, log
, sin
, cos
. I made comparisons with that library here.
To make our implementation fast, we rely on the fact that all of the operations we’ve discussed $-$ including these elementary functions like log
$-$ are highly vectorizable. With a CPU that has 512-bit vectors, that means it doesn’t take much longer to generate 8 double precision or 16 single precision numbers than it takes to generate just 1.
To vectorize them, all I do is have multiple LCGs, and sample from each of them. By using different multipliers, they’re also totally independent, with different patterns, and therefore their sequences of random numbers will never overlap.
It is also slightly faster to generate 2 or even 4 vectors-worth of LCGs at the same time, thanks to out of order execution / instruction-level-paralellism. That is, we can draw samples of 1024 or 2048 random bits at a time on this machine.
using Random, VectorizedRNG, BenchmarkTools
@benchmark rand(Float64) # random uniform from base Julia
@benchmark rand(VectorizedRNG.GLOBAL_vPCG, NTuple{16,Core.VecElement{Float64}}) # random uniforms with PCG
It took 3 times longer to generate 16 times the floating point numbers.
@benchmark rand(VectorizedRNG.GLOBAL_vPCG, NTuple{32,Core.VecElement{Float64}}) # random uniforms with PCG
and less than 4 times longer to produce 32 times the floating point numbers. The unvectorized rand
sees the same type of improvement we do beyond our vector width, but is slower to produce 8 numbers than the vectorized routine is to produce 32:
@benchmark (rand(Float64),rand(Float64),rand(Float64),rand(Float64),
rand(Float64),rand(Float64),rand(Float64),rand(Float64))
The vectorized routine’s performance doubles when we switch to single precision, but the unvectorized routine does not benefit:
@benchmark (rand(Float32),rand(Float32),rand(Float32),rand(Float32),
rand(Float32),rand(Float32),rand(Float32),rand(Float32))
@benchmark rand(VectorizedRNG.GLOBAL_vPCG, NTuple{32,Core.VecElement{Float32}}) # random uniforms with PCG
@benchmark rand(VectorizedRNG.GLOBAL_vPCG, NTuple{64,Core.VecElement{Float32}}) # random uniforms with PCG
The case is less dramatic with random normals, but the vectorization still gives it a substantial lead in throughput:
@benchmark (randn(Float64),randn(Float64),randn(Float64),randn(Float64),
randn(Float64),randn(Float64),randn(Float64),randn(Float64))
@benchmark randn(VectorizedRNG.GLOBAL_vPCG, NTuple{16,Core.VecElement{Float64}}) # random uniforms with PCG
@benchmark randn(VectorizedRNG.GLOBAL_vPCG, NTuple{32,Core.VecElement{Float64}}) # random uniforms with PCG
Although the single precision vectorized RNG is much faster:
@benchmark (randn(Float32),randn(Float32),randn(Float32),randn(Float32),
randn(Float32),randn(Float32),randn(Float32),randn(Float32))
@benchmark randn(VectorizedRNG.GLOBAL_vPCG, NTuple{32,Core.VecElement{Float32}}) # random uniforms with PCG
@benchmark randn(VectorizedRNG.GLOBAL_vPCG, NTuple{64,Core.VecElement{Float32}}) # random uniforms with PCG
The single precision version is faster because the single precision log
and sincos
functions (sincos
calculates both sin
and cos
) are less accurate $-$ single precision, afterall $-$ and therefore take less steps to calculate.
Julia’s uniform random number generator actually is vectorized for large arrays, so this gives us a fairer basis for comparison. Here, the VectorizedPCG library is probably going to be slower for computers without avx512 for 64-bit floating points:
x64 = Vector{Float64}(undef, 1024)
@benchmark rand!($x64)
@benchmark rand!(VectorizedRNG.GLOBAL_vPCG, $x64)
x32 = Vector{Float32}(undef, 1024)
@benchmark rand!($x32)
@benchmark rand!(VectorizedRNG.GLOBAL_vPCG, $x32)
Similar pattern with the unvectorized routines, with the single precision performance not being close:
@btime randn!($x64);
@btime randn!(VectorizedRNG.GLOBAL_vPCG, $x64);
@btime randn!($x32);
@btime randn!(VectorizedRNG.GLOBAL_vPCG, $x32);
@btime randexp!($x64);
@btime randexp!(VectorizedRNG.GLOBAL_vPCG, $x64);
@btime randexp!($x32);
@btime randexp!(VectorizedRNG.GLOBAL_vPCG, $x32);
PCGs have good statistical properties, and they can take advantage of modern hardware to be blisteringly fast. Or, rather, let us generate huge numbers of them at a time.