VectorizedRNG

I wanted to discuss VectorizedRNG.jl a little. This library provides two vectorized permuted congruential generators (PCG): XSH-RR and RXS-M-XS.

These are used for producing vectors of random bits, which can then be transformed to uniform random floating point numbers, which in turn can be used for normal or exponential random samples.

To start with, what is a PCG? A PCG is built up of two parts: an updating state, and an output transformation.

The updating state is a linear congruential generator (LCG). A valid LCG requires:

$s_{n+1} = \left( a * s_{n} + b \right) \% m$

where $\%$ is the modulus operator (ie, the remainder), and $a$ and $m$ are relatively prime (ie, they have no common factors). Given these conditions, for some values of $a$, the LCG will traverse all values from $0,\ldots,m-1$.

A very convenient choice for $m$ is $2^64$. This is because 64-bit integer operations automatically roll over, mod $2^64$, meaning

1. We get the modulus operation for free.
2. A good LCG will hit every single 64 bit integer, meaning we get the most bang possible for our 64-bit buck.

Given that $m = 2^64$, $m$ has 64 factors, all of which are 2. Thus, $b$ may not be divisble by $2$. That is, $b$ must be odd.

Lets use $2^4$ for demonstration purposes. It would take a little too long to cycle through 2^64.

In [1]:
mutable struct LCG{T}
a::T
b::T
m::T
end
function Base.iterate(lcg::LCG)
state = 1
next_state = (state * lcg.a + lcg.b) % lcg.m
state, next_state
end
function Base.iterate(lcg::LCG, state)
state == 1 && return nothing # terminate on repeat
next_state = (state * lcg.a + lcg.b) % lcg.m
state, next_state
end
Base.length(lcg::LCG) = lcg.m
for i ∈ LCG(9, 1, 16)
@show i
end

i = 1
i = 10
i = 11
i = 4
i = 5
i = 14
i = 15
i = 8
i = 9
i = 2
i = 3
i = 12
i = 13
i = 6
i = 7
i = 0


We hit all 16 numbers, but they did not look very random. In fact, we had 8 pairs of increasing (even,odd) pairs. Note that the pattern is $+9, +1, +9, +1,\ldots$
For a multiplier, 5 does much better than 9:

In [2]:
for i ∈ LCG(5, 1, 16)
@show i
end

i = 1
i = 6
i = 15
i = 12
i = 13
i = 2
i = 11
i = 8
i = 9
i = 14
i = 7
i = 4
i = 5
i = 10
i = 3
i = 0


Note that the pattern is $+5, +9, +13, +1, +5, +9, +13, +1, \ldots$ That is, the sequence is of four repeating increments, instead of 2 we had for 9.

While others, like 7, fail to traverse all 16 numbers:

In [3]:
for i ∈ LCG(7, 1, 16)
@show i
end

i = 1
i = 8
i = 9
i = 0

In [4]:
for i ∈ LCG(7, 7, 16)
@show i
end

i = 1
i = 14
i = 9
i = 6


We’re much more interested in the multipliers, because changing the increments (b) only shifts that pattern we saw:

In [5]:
for i ∈ LCG(9, 3, 16)
@show i
end

i = 1
i = 12
i = 15
i = 10
i = 13
i = 8
i = 11
i = 6
i = 9
i = 4
i = 7
i = 2
i = 5
i = 0
i = 3
i = 14


Now the sequence is $+11, +3, +11, +3, \ldots$.

In [6]:
for i ∈ LCG(9, 5, 16)
@show i
end

i = 1
i = 14
i = 3
i = 0
i = 5
i = 2
i = 7
i = 4
i = 9
i = 6
i = 11
i = 8
i = 13
i = 10
i = 15
i = 12


Here we see the pattern $+13, +5, +13, +5,\ldots$. Using $7$ results in a sequence of $+15,+7,+15,+7,\ldots$, and so fourth.

Under the idea that the longer the sequence of increments before we hit a repeat, the more random, we would prefer 5 as the multiplier.
For the VectorizedRNG library, I generated thousands of random integers, subjected them to a few statistical tests for randomness, and picked the ones that did the best to be the defaults.

All of these LCG sequences look fairly predictable. This is worse when we look at the actual series of bits:

In [7]:
last_four_bits(x) = bitstring(x)[end-3:end]
for i ∈ LCG(5, 1, 16)
@show last_four_bits(i)
end

last_four_bits(i) = "0001"
last_four_bits(i) = "0110"
last_four_bits(i) = "1111"
last_four_bits(i) = "1100"
last_four_bits(i) = "1101"
last_four_bits(i) = "0010"
last_four_bits(i) = "1011"
last_four_bits(i) = "1000"
last_four_bits(i) = "1001"
last_four_bits(i) = "1110"
last_four_bits(i) = "0111"
last_four_bits(i) = "0100"
last_four_bits(i) = "0101"
last_four_bits(i) = "1010"
last_four_bits(i) = "0011"
last_four_bits(i) = "0000"


The last bit flips between a 0 and a 1 (period of 2), the second last bit has a period of 4, $\ldots$. Only the first bit has the full period. This is a pattern we see consistently:

In [8]:
for i ∈ LCG(9, 7, 16)
@show last_four_bits(i)
end

last_four_bits(i) = "0001"
last_four_bits(i) = "0000"
last_four_bits(i) = "0111"
last_four_bits(i) = "0110"
last_four_bits(i) = "1101"
last_four_bits(i) = "1100"
last_four_bits(i) = "0011"
last_four_bits(i) = "0010"
last_four_bits(i) = "1001"
last_four_bits(i) = "1000"
last_four_bits(i) = "1111"
last_four_bits(i) = "1110"
last_four_bits(i) = "0101"
last_four_bits(i) = "0100"
last_four_bits(i) = "1011"
last_four_bits(i) = "1010"


Therefore, these bits really don’t look random at all. Especially those trailing bits. The LCG is still useful, because it lets us represent 64 bits of state $-$ eventually reaching all of it $-$ with 64 bits of memory, and a very computationally simple way to update it.
But, we can’t just return the state as a “random” number, because it doesn’t look random at all.

What can we do about it? That’s where the output transformation comes in.
The two key tools are the bitshift, and the xor.

The bitshift shifts bits:

In [9]:
@show bitstring(UInt8(25))
@show bitstring(UInt8(25) << 2)
@show bitstring(UInt8(25) >> 2);

bitstring(UInt8(25)) = "00011001"
bitstring(UInt8(25) << 2) = "01100100"
bitstring(UInt8(25) >> 2) = "00000110"


That is, the bits “00011001” were shifted to the left by 2, and then to the right by 2. Bits shifted in are all zero.

The xor, or exclusive or, operation returns “1” if and only if one input was true, and the other false.

In [10]:
@show bitstring(0xaa)
@show bitstring(0x0f)
@show bitstring(0xaa ⊻ 0x0f);

bitstring(0xaa) = "10101010"
bitstring(0x0f) = "00001111"
bitstring(0xaa ⊻ 0x0f) = "10100101"


So the trick is, if we “xor” the lower bits with the upper bits, we can randomize them.

In [11]:
"""
Assumes a 4-bit integer proxy (ie, Julia doesn't actually have 4 bit integers,
so merely treats a higher-bit integer as though it were 4 bits).
"""
function rxs_m_xs4(i)
count = i >> 3
i = i ⊻ i >> (count + 1)
i = 5i % 16
i ⊻ (i >> 2)
end

for i ∈ LCG(5, 1, 16)
@show rxs_m_xs4(i)
end

rxs_m_xs4(i) = 4
rxs_m_xs4(i) = 11
rxs_m_xs4(i) = 15
rxs_m_xs4(i) = 9
rxs_m_xs4(i) = 7
rxs_m_xs4(i) = 12
rxs_m_xs4(i) = 14
rxs_m_xs4(i) = 2
rxs_m_xs4(i) = 6
rxs_m_xs4(i) = 1
rxs_m_xs4(i) = 5
rxs_m_xs4(i) = 13
rxs_m_xs4(i) = 3
rxs_m_xs4(i) = 10
rxs_m_xs4(i) = 8
rxs_m_xs4(i) = 0


We hit all 16 numbers, from 0 through 15.
The incrementing sequence is now $+7,+4,+10,+14,+5,+2,+4,+4,+11,+4,+8,+6,+7,+14,+8$. While a little $4$-heavy, there was no pattern in these differences: it was also 16-long.

The bit patterns also look much more random:

In [12]:
for i ∈ LCG(5, 1, 16)
@show last_four_bits(rxs_m_xs4(i))
end

last_four_bits(rxs_m_xs4(i)) = "0100"
last_four_bits(rxs_m_xs4(i)) = "1011"
last_four_bits(rxs_m_xs4(i)) = "1111"
last_four_bits(rxs_m_xs4(i)) = "1001"
last_four_bits(rxs_m_xs4(i)) = "0111"
last_four_bits(rxs_m_xs4(i)) = "1100"
last_four_bits(rxs_m_xs4(i)) = "1110"
last_four_bits(rxs_m_xs4(i)) = "0010"
last_four_bits(rxs_m_xs4(i)) = "0110"
last_four_bits(rxs_m_xs4(i)) = "0001"
last_four_bits(rxs_m_xs4(i)) = "0101"
last_four_bits(rxs_m_xs4(i)) = "1101"
last_four_bits(rxs_m_xs4(i)) = "0011"
last_four_bits(rxs_m_xs4(i)) = "1010"
last_four_bits(rxs_m_xs4(i)) = "1000"
last_four_bits(rxs_m_xs4(i)) = "0000"


Perhaps with more work I could get the sequence for the last bit to be 16, but this is a substantial improvement.

So then, now that we have sequences of bits, how do we turn those into floating point numbers? Here is a wikipedia article discussing 64-bit floating point values.

The key is that floating point numbers are made up of three components:

1. a sign bit
2. an exponent
3. a fraction

The final number equals $\text{sign} * \text{fraction} * 2^\text{exponent}$.
Increasing the exponent doubles the value:

In [13]:
bitstring(1.0)

Out[13]:
"0011111111110000000000000000000000000000000000000000000000000000"
In [14]:
bitstring(2.0)

Out[14]:
"0100000000000000000000000000000000000000000000000000000000000000"

If the exponential component equals $x$, the fraction component decides where in the range of numbers between $x$ and $2x$ the number is. For example, the biggest 64-bit floating point number smaller than 2.0 has the same exponent as 1.0, but all the fraction-bits are now 1:

In [15]:
bitstring(prevfloat(2.0))

Out[15]:
"0011111111111111111111111111111111111111111111111111111111111111"

What this means is that we can get a random floating number between $1$ and $2$ by generating a bunch of random bits, and then setting the sign and exponent bits correctly (eg, for 64 bit floats, to $001111111111$).

We can do that via the bitwise “and” (&) and “or” (|) operators: we bitwise “and” to set the first two bits to zero, and then bitwise or to set the rest of the exponential bits to 1.

In [16]:
bitstring(0x000fffffffffffff)

Out[16]:
"0000000000001111111111111111111111111111111111111111111111111111"
In [17]:
bitstring(0x3ff0000000000000)

Out[17]:
"0011111111110000000000000000000000000000000000000000000000000000"
In [18]:
bitstring(typemax(UInt))

Out[18]:
"1111111111111111111111111111111111111111111111111111111111111111"
In [19]:
bitstring(typemax(UInt) & 0x000fffffffffffff)

Out[19]:
"0000000000001111111111111111111111111111111111111111111111111111"
In [20]:
bitstring((typemax(UInt) & 0x000fffffffffffff) | 0x3ff0000000000000)

Out[20]:
"0011111111111111111111111111111111111111111111111111111111111111"
In [21]:
reinterpret(Float64, (typemax(UInt) & 0x000fffffffffffff) | 0x3ff0000000000000)

Out[21]:
1.9999999999999998

This is the largest 64-bit floating point value smaller than 2:

In [22]:
reinterpret(Float64, (typemax(UInt) & 0x000fffffffffffff) | 0x3ff0000000000000) |> nextfloat

Out[22]:
2.0

Similarly, this operation will convert the smallest integer to $1.0$:

In [23]:
bitstring(typemin(UInt))

Out[23]:
"0000000000000000000000000000000000000000000000000000000000000000"
In [24]:
bitstring(typemin(UInt) & 0x000fffffffffffff)

Out[24]:
"0000000000000000000000000000000000000000000000000000000000000000"
In [25]:
bitstring((typemin(UInt) & 0x000fffffffffffff) | 0x3ff0000000000000)

Out[25]:
"0011111111110000000000000000000000000000000000000000000000000000"
In [26]:
reinterpret(Float64, (typemin(UInt) & 0x000fffffffffffff) | 0x3ff0000000000000)

Out[26]:
1.0

It is noteworthy that this means wer are generating random numbers on the interval $[1,2)$. That is, the interval is closed-open. To get a random number between 0 and 1, we now have two choices, either:
\begin{align*}
x &\sim \text{uniform}[1,2)\\
u1 &= x – 1.0\\
u1 &\sim \text{uniform}[0,1)\\
u2 &= 2.0 – x\\
u2 &\sim \text{uniform}(0,1]\\
\end{align*}

Because it is more common for functions to be undefined on $0$ than on $1$ (eg, log), I choose $u_2$.

Now, the simplest way to convert the uniform samples into exponential and normal samples are via the probability integral transform, and box-muller, respectively.

The Box-Muller in particular has been criticized for being slow (for example, here). However, this slow speed is owes to Box-Muller requring the calculation of sqrt, log, sin, cos. I made comparisons with that library here.

To make our implementation fast, we rely on the fact that all of the operations we’ve discussed $-$ including these elementary functions like log $-$ are highly vectorizable. With a CPU that has 512-bit vectors, that means it doesn’t take much longer to generate 8 double precision or 16 single precision numbers than it takes to generate just 1.
To vectorize them, all I do is have multiple LCGs, and sample from each of them. By using different multipliers, they’re also totally independent, with different patterns, and therefore their sequences of random numbers will never overlap.

It is also slightly faster to generate 2 or even 4 vectors-worth of LCGs at the same time, thanks to out of order execution / instruction-level-paralellism. That is, we can draw samples of 1024 or 2048 random bits at a time on this machine.

In [27]:
using Random, VectorizedRNG, BenchmarkTools
@benchmark rand(Float64) # random uniform from base Julia

Out[27]:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.210 ns (0.00% GC)
median time:      2.668 ns (0.00% GC)
mean time:        2.683 ns (0.00% GC)
maximum time:     39.650 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000
In [28]:
@benchmark rand(VectorizedRNG.GLOBAL_vPCG, NTuple{16,Core.VecElement{Float64}}) # random uniforms with PCG

Out[28]:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  4544945648114630656
--------------
minimum time:     6.416 ns (0.00% GC)
median time:      6.442 ns (0.00% GC)
mean time:        6.471 ns (0.00% GC)
maximum time:     27.931 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000

It took 3 times longer to generate 16 times the floating point numbers.

In [29]:
@benchmark rand(VectorizedRNG.GLOBAL_vPCG, NTuple{32,Core.VecElement{Float64}}) # random uniforms with PCG

Out[29]:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  4552210007868858368
--------------
minimum time:     8.765 ns (0.00% GC)
median time:      8.796 ns (0.00% GC)
mean time:        8.822 ns (0.00% GC)
maximum time:     31.482 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     999

and less than 4 times longer to produce 32 times the floating point numbers. The unvectorized rand sees the same type of improvement we do beyond our vector width, but is slower to produce 8 numbers than the vectorized routine is to produce 32:

In [30]:
@benchmark (rand(Float64),rand(Float64),rand(Float64),rand(Float64),
rand(Float64),rand(Float64),rand(Float64),rand(Float64))

Out[30]:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     9.249 ns (0.00% GC)
median time:      9.740 ns (0.00% GC)
mean time:        9.765 ns (0.00% GC)
maximum time:     32.960 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000

The vectorized routine’s performance doubles when we switch to single precision, but the unvectorized routine does not benefit:

In [31]:
@benchmark (rand(Float32),rand(Float32),rand(Float32),rand(Float32),
rand(Float32),rand(Float32),rand(Float32),rand(Float32))

Out[31]:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     9.941 ns (0.00% GC)
median time:      10.430 ns (0.00% GC)
mean time:        10.453 ns (0.00% GC)
maximum time:     33.016 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000
In [32]:
@benchmark rand(VectorizedRNG.GLOBAL_vPCG, NTuple{32,Core.VecElement{Float32}}) # random uniforms with PCG

Out[32]:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  4130293440567860704
--------------
minimum time:     6.375 ns (0.00% GC)
median time:      6.412 ns (0.00% GC)
mean time:        6.429 ns (0.00% GC)
maximum time:     25.599 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     1000
In [33]:
@benchmark rand(VectorizedRNG.GLOBAL_vPCG, NTuple{64,Core.VecElement{Float32}}) # random uniforms with PCG

Out[33]:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  4062528339927142472
--------------
minimum time:     8.766 ns (0.00% GC)
median time:      8.796 ns (0.00% GC)
mean time:        8.825 ns (0.00% GC)
maximum time:     33.727 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     999

The case is less dramatic with random normals, but the vectorization still gives it a substantial lead in throughput:

In [34]:
@benchmark (randn(Float64),randn(Float64),randn(Float64),randn(Float64),
randn(Float64),randn(Float64),randn(Float64),randn(Float64))

Out[34]:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     32.820 ns (0.00% GC)
median time:      34.281 ns (0.00% GC)
mean time:        34.369 ns (0.00% GC)
maximum time:     52.728 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     995
In [35]:
@benchmark randn(VectorizedRNG.GLOBAL_vPCG, NTuple{16,Core.VecElement{Float64}}) # random uniforms with PCG

Out[35]:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  -4671199877106040429
--------------
minimum time:     28.988 ns (0.00% GC)
median time:      29.503 ns (0.00% GC)
mean time:        29.524 ns (0.00% GC)
maximum time:     64.333 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     995
In [36]:
@benchmark randn(VectorizedRNG.GLOBAL_vPCG, NTuple{32,Core.VecElement{Float64}}) # random uniforms with PCG

Out[36]:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  -4664769688981915162
--------------
minimum time:     44.372 ns (0.00% GC)
median time:      44.440 ns (0.00% GC)
mean time:        44.579 ns (0.00% GC)
maximum time:     146.593 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     990

Although the single precision vectorized RNG is much faster:

In [37]:
@benchmark (randn(Float32),randn(Float32),randn(Float32),randn(Float32),
randn(Float32),randn(Float32),randn(Float32),randn(Float32))

Out[37]:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     33.891 ns (0.00% GC)
median time:      35.357 ns (0.00% GC)
mean time:        35.450 ns (0.00% GC)
maximum time:     62.122 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     995
In [38]:
@benchmark randn(VectorizedRNG.GLOBAL_vPCG, NTuple{32,Core.VecElement{Float32}}) # random uniforms with PCG

Out[38]:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  -5048715344386131593
--------------
minimum time:     21.202 ns (0.00% GC)
median time:      21.437 ns (0.00% GC)
mean time:        21.502 ns (0.00% GC)
maximum time:     59.424 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     997
In [39]:
@benchmark randn(VectorizedRNG.GLOBAL_vPCG, NTuple{64,Core.VecElement{Float32}}) # random uniforms with PCG

Out[39]:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  -5041253489401791739
--------------
minimum time:     33.324 ns (0.00% GC)
median time:      33.453 ns (0.00% GC)
mean time:        33.569 ns (0.00% GC)
maximum time:     68.455 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     993

The single precision version is faster because the single precision log and sincos functions (sincos calculates both sin and cos) are less accurate $-$ single precision, afterall $-$ and therefore take less steps to calculate.

Julia’s uniform random number generator actually is vectorized for large arrays, so this gives us a fairer basis for comparison. Here, the VectorizedPCG library is probably going to be slower for computers without avx512 for 64-bit floating points:

In [40]:
x64 = Vector{Float64}(undef, 1024)

@benchmark rand!($x64)  Out[40]: BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 507.534 ns (0.00% GC) median time: 510.746 ns (0.00% GC) mean time: 512.055 ns (0.00% GC) maximum time: 688.554 ns (0.00% GC) -------------- samples: 10000 evals/sample: 193 In [41]: @benchmark rand!(VectorizedRNG.GLOBAL_vPCG,$x64)

Out[41]:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     288.202 ns (0.00% GC)
median time:      288.674 ns (0.00% GC)
mean time:        289.544 ns (0.00% GC)
maximum time:     428.028 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     282
In [42]:
x32 = Vector{Float32}(undef, 1024)

@benchmark rand!($x32)  Out[42]: BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 573.361 ns (0.00% GC) median time: 576.628 ns (0.00% GC) mean time: 578.777 ns (0.00% GC) maximum time: 826.858 ns (0.00% GC) -------------- samples: 10000 evals/sample: 183 In [43]: @benchmark rand!(VectorizedRNG.GLOBAL_vPCG,$x32)

Out[43]:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     148.743 ns (0.00% GC)
median time:      148.999 ns (0.00% GC)
mean time:        149.367 ns (0.00% GC)
maximum time:     203.272 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     836

Similar pattern with the unvectorized routines, with the single precision performance not being close:

In [44]:
@btime randn!($x64);   3.901 μs (0 allocations: 0 bytes)  In [45]: @btime randn!(VectorizedRNG.GLOBAL_vPCG,$x64);

  1.412 μs (0 allocations: 0 bytes)

In [46]:
@btime randn!($x32);   3.999 μs (0 allocations: 0 bytes)  In [47]: @btime randn!(VectorizedRNG.GLOBAL_vPCG,$x32);

  537.566 ns (0 allocations: 0 bytes)

In [48]:
@btime randexp!($x64);   3.781 μs (0 allocations: 0 bytes)  In [49]: @btime randexp!(VectorizedRNG.GLOBAL_vPCG,$x64);

  1.046 μs (0 allocations: 0 bytes)

In [50]:
@btime randexp!($x32);   4.096 μs (0 allocations: 0 bytes)  In [51]: @btime randexp!(VectorizedRNG.GLOBAL_vPCG,$x32);

  417.714 ns (0 allocations: 0 bytes)


PCGs have good statistical properties, and they can take advantage of modern hardware to be blisteringly fast. Or, rather, let us generate huge numbers of them at a time.