seander@cs.stanford.edu

- About the operation counting methodology
- Compute the sign of an integer
- Compute the integer absolute value (abs) without branching
- Compute the minimum (min) or maximum (max) of two integers without branching
- Determining if an integer is a power of 2
- Sign extending
- Conditionally set or clear bits without branching
- Merge bits from two values according to a mask
- Counting bits set
- Computing parity (1 if an odd number of bits set, 0 otherwise)
- Swapping Values
- Reversing bit sequences
- Reverse bits the obvious way
- Reverse bits in word by lookup table
- Reverse the bits in a byte with 3 operations (64-bit muliply and modulus division)
- Reverse the bits in a byte with 4 operations (64-bit multiply, no division)
- Reverse the bits in a byte with 7 operations (no 64-bit, only 32)
- Reverse an N-bit quantity in parallel with 5 * lg(N) operations

- Modulus division (aka computing
*remainders*) - Finding integer log base 2 of an integer (aka the position of the highest bit set)
- Find the log base 2 of an integer with the MSB N set in O(N) operations (the obvious way)
- Find the integer log base 2 of an integer with an 64-bit IEEE float
- Find the log base 2 of an integer with a lookup table
- Find the log base 2 of an N-bit integer in O(lg(N)) operations
- Find the log base 2 of an N-bit integer in O(lg(N)) operations with multiply and lookup

- Find integer log base 10 of an integer
- Find integer log base 2 of a 32-bit IEEE float
- Find integer log base 2 of the pow(2, r)-root of a 32-bit IEEE float (for unsigned integer r)
- Counting consecutive trailing zero bits (or finding bit indices)
- Count the consecutive zero bits (trailing) on the right in parallel
- Count the consecutive zero bits (trailing) on the right by binary search
- Count the consecutive zero bits (trailing) on the right by casting to a float
- Count the consecutive zero bits (trailing) on the right with modulus division and lookup
- Count the consecutive zero bits (trailing) on the right with multiply and lookup

- Round up to the next highest power of 2 by float casting
- Round up to the next highest power of 2
- Interleaving bits (aka computing
*Morton Numbers*) - Testing for ranges of bytes in a word (and counting occurances found)

int v; // we want to find the sign of v int sign; // the result goes here // CHAR_BIT is the number of bits per byte (normally 8). sign = -(v < 0); // if v < 0 then -1, else 0. // or, to avoid branching on CPUs with flag registers (IA32): sign = -(int)((unsigned int)((int)v) >> (sizeof(int) * CHAR_BIT - 1)); // or, for one less instruction (but not portable): sign = v >> (sizeof(int) * CHAR_BIT - 1);The last expression above evaluates to sign = v >> 31 for 32-bit integers. This is one operation faster than the obvious way, sign = -(v < 0). This trick works because when signed integers are shifted right, the value of the far left bit is copied to the other bits. The far left bit is 1 when the value is negative and 0 otherwise; all 1 bits gives -1. Unfortunately, this behavior is architecture-specific.

Alternatively, if you prefer the result be either -1 or +1, then use:

sign = +1 | (v >> (sizeof(int) * CHAR_BIT - 1)); // if v < 0 then -1, else +1

Alternatively, if you prefer the result be either -1, 0, or +1, then use:

sign = (v != 0) | -(int)((unsigned int)((int)v) >> (sizeof(int) * CHAR_BIT - 1)); // Or, for more speed but less portability: sign = (v != 0) | (v >> (sizeof(int) * CHAR_BIT - 1)); // -1, 0, or +1 // Or, for portability, brevity, and (perhaps) speed: sign = (v > 0) - (v < 0); // -1, 0, or +1Caveat: On March 7, 2003, Angus Duggan pointed out that the 1989 ANSI C specification leaves the result of signed right-shift implementation-defined, so on some systems this hack might not work. For greater portability, Toby Speight suggested on September 28, 2005 that CHAR_BIT be used here and throughout rather than assuming bytes were 8 bits long. Angus recommended the more portable versions above, involving casting on March 4, 2006.

int v; // we want to find the absolute value of v int r; // the result goes here r = (v ^ (v >> (sizeof(int) * CHAR_BIT - 1))) - (v >> (sizeof(int) * CHAR_BIT - 1));Some CPUs don't have an integer absolute value instruction (or the compiler fails to use them). On machines where branching is expensive, the above expression can be faster than the obvious approach, r = (v < 0) ? -v : v, even though the number of operations is the same.

Caveats: On March 7, 2003, Angus Duggan pointed out that the 1989 ANSI C
specification leaves the result of signed right-shift implementation-defined,
so on some systems this hack might not work. I've read that ANSI C does not
require values to be represented as two's complement, so it may not work
for that reason as well (on a diminishingly small number of old machines
that still use one's complement).
On March 14, 2004, Keith H. Duggar sent me the solution above; it is
superior to the one I initially came up with,
`r=(+1|(v>>(sizeof(int)*CHAR_BIT-1)))*v`

,
because a multiply is not used.
Unfortunately, this method has been
patented in the USA on June 6, 2000 by Vladimir Yu Volkonsky and
assigned to Sun Microsystems. (For an
unpatented non-multiplying alternative, consider
```
(v ^ (v >> (sizeof(int) * CHAR_BIT - 1))) +
((unsigned) v >> (sizeof(int) * CHAR_BIT - 1))
```

,
suggested by Thorbjørn Willoch on June 21, 2004 or my variation,
`(v ^ (v >> (sizeof(int) * CHAR_BIT - 1))) + (v < 0)`

,
both of which take one more operation than the patented one, though it may
be computed in parallel on modern CPUs.) On August 13, 2006, Yuriy
Kaminskiy told me that the patent is likely invalid because the method
was published well before the patent was even filed, such as in
How to Optimize for
the Pentium Processor by Agner Fog, dated November, 9, 1996. Yuriy also
mentioned that this document was translated to Russian in 1997, which
Vladimir could have read. Moreover, the Internet Archive also has an old
link to it.

int x; // we want to find the minimum of x and y int y; int r; // the result goes here r = y + ((x - y) & -(x < y)); // min(x, y)On machines where branching is expensive, the above expression can be faster than the obvious approach, r = (x < y) ? x : y, even though it involves two more instructions. It works because if x < y, then -(x < y) will be all ones, so r = y + (x - y) & ~0 = y + x - y = x. Otherwise, if x >= y, then -(x < y) will be all zeros, so r = y + (x - y) & 0 = y. On some machines, evaluating (x < y) as 0 or 1 requires a branch instruction, so there may be no advantage.

To find the maximum, use:

r = x - ((x - y) & -(x < y)); // max(x, y)

r = y + ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // min(x, y) r = x - ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // max(x, y)On March 7, 2003, Angus Duggan pointed out the right-shift portability issue. On May 3, 2005, Randal E. Bryant alerted me to the need for the precondition, INT_MIN <= x - y <= INT_MAX, and suggested the non-quick and dirty version as a fix. Both of these issues concern only the quick and dirty version. Nigel Horspoon pointed out on July 6, 2005 that gcc produced the same code on a Pentium as the obvious solution because of how it evaluates (x < y).

unsigned int v; // we want to see if v is a power of 2 bool f; // the result goes here f = (v & (v - 1)) == 0;Note that 0 is incorrectly considered a power of 2 here. To remedy this, use:

f = !(v & (v - 1)) && v;

int x; // convert this from using 5 bits to a full int int r; // resulting sign extended number goes here struct {signed int x:5;} s; r = s.x = x;The following is a C++ template function that uses the same language feature to convert from B bits in one operation (though the compiler is generating more, of course).

template <typename T, unsigned B> inline T signextend(const T x) { struct {T x:B;} s; return s.x = x; } int r = signextend<signed int,5>(x); // sign extend 5 bit number x to r

John Byrd caught a typo in the code (attributed to html formatting) on May 2, 2005. On March 4, 2006, Pat Wood pointed out that the ANSI C standard requires that the bitfield have the keyword "signed" to be signed; otherwise, the sign is undefined.

unsigned b; // number of bits representing the number in x int x; // sign extend this b-bit number to r int r; // resulting sign-extended number int const m = 1 << (b - 1); // mask can be pre-computed if b is fixed r = -(x & m) | x;The code above requires five operations. Sean A. Irvine suggested that I add sign extension methods to this page on June 13, 2004, and he provided

`m = (1 << (b - 1)) - 1; r = -(x & ~m) | x;`

as a starting point from which I optimized.

unsigned b; // number of bits representing the number in x int x; // sign extend this b-bit number to r int r; // resulting sign-extended number #define M(B) (1 << ((sizeof(x) * CHAR_BIT) - B)) // CHAR_BIT=bits/byte int const multipliers[] = { 0, M(1), M(2), M(3), M(4), M(5), M(6), M(7), M(8), M(9), M(10), M(11), M(12), M(13), M(14), M(15), M(16), M(17), M(18), M(19), M(20), M(21), M(22), M(23), M(24), M(25), M(26), M(27), M(28), M(29), M(30), M(31), M(32) }; // (add more if using more than 64 bits) int const divisors[] = { 1, ~M(1), M(2), M(3), M(4), M(5), M(6), M(7), M(8), M(9), M(10), M(11), M(12), M(13), M(14), M(15), M(16), M(17), M(18), M(19), M(20), M(21), M(22), M(23), M(24), M(25), M(26), M(27), M(28), M(29), M(30), M(31), M(32) }; // (add more for 64 bits) #undef M r = (x * multipliers[b]) / divisors[b];The following variation is not portable, but on architectures that employ an arithmetic right-shift, maintaining the sign, it should be fast.

const int s = -b; // OR: sizeof(x) * CHAR_BIT - b; r = (x << s) >> s;Randal E. Bryant pointed out a bug on May 3, 2005 in an earlier version (that used multipliers[] for divisors[]), where it failed on the case of x=1 and b=1.

bool f; // conditional flag unsigned int m; // the bit mask unsigned int w; // the word to modify: if (f) w |= m; else w &= ~m; w ^= (-f ^ w) & m;On some architectures, the lack of branching can more than make up for what appears to be twice as many operations. For instance, informal speed tests on an AMD Athlon™ XP 2100+ indicated it was 5-10% faster. Glenn Slayden informed me of this expression on December 11, 2003.

unsigned int a; // value to merge in non-masked bits unsigned int b; // value to merge in masked bits unsigned int mask; // 1 where bits from b should be selected; 0 where from a. unsigned int r; // result of (a & ~mask) | (b & mask) goes here r = a ^ ((a ^ b) & mask);This shaves one operation from the obvious way of combining two sets of bits according to a bit mask. If the mask is a constant, then there may be no advantage.

Ron Jeffery sent this to me on February 9, 2006.

unsigned int v; // count the number of bits set in v unsigned int c; // c accumulates the total bits set in v for (c = 0; v; v >>= 1) { c += v & 1; }The naive approach requires one iteration per bit, until no more bits are set. So on a 32-bit word with only the high set, it will go through 32 iterations.

const unsigned char BitsSetTable256[] = { 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8 }; unsigned int v; // count the number of bits set in 32-bit value v unsigned int c; // c is the total bits set in v // Option 1: c = BitsSetTable256[v & 0xff] + BitsSetTable256[(v >> 8) & 0xff] + BitsSetTable256[(v >> 16) & 0xff] + BitsSetTable256[v >> 24]; // Option 2: unsigned char * p = (unsigned char *) &v; c = BitsSetTable256[p[0]] + BitsSetTable256[p[1]] + BitsSetTable256[p[2]] + BitsSetTable256[p[3]]; // To initially generate the table algorithmically: BitsSetTable256[0] = 0; for (int i = 0; i < 256; i++) { BitsSetTable256[i] = (i & 1) + BitsSetTable256[i / 2]; }

unsigned int v; // count the number of bits set in v unsigned int c; // c accumulates the total bits set in v for (c = 0; v; c++) { v &= v - 1; // clear the least significant bit set }Brian Kernighan's method goes through as many iterations as there are set bits. So if we have a 32-bit word with only the high bit set, then it will only go once through the loop.

Published in 1988, the C Programming Language 2nd Ed. (by Brian W. Kernighan and Dennis M. Ritchie) mentions this in exercise 2-9. On April 19, 2006 Don Knuth pointed out to me that this method "was first published by Peter Wegner in CACM 3 (1960), 322. (Also discovered independently by Derrick Lehmer and published in 1964 in a book edited by Beckenbach.)"

unsigned int v; // count the number of bits set in v unsigned int c; // c accumulates the total bits set in v // option 1, for at most 12-bit values in v: c = (v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f; // option 2, for at most 24-bit values in v: c = ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f; c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f; // option 3, for at most 32-bit values in v: c = ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f; c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f; c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;This method requires a 64-bit CPU with fast modulus division to be efficient. The first option takes only 3 operations; the second option takes 10; and the third option takes 15.

Rich Schroeppel originally created a 9-bit version, similiar to option 1; see the Programming Hacks section of Beeler, M., Gosper, R. W., and Schroeppel, R. HAKMEM. MIT AI Memo 239, Feb. 29, 1972. His method was the inspiration for the variants above, devised by Sean Anderson. Randal E. Bryant offered a couple bug fixes on May 3, 2005.

unsigned int v; // count bits set in this (32-bit value) unsigned int c; // store the total here const int S[] = {1, 2, 4, 8, 16}; // Magic Binary Numbers const int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF}; c = v; c = v - ((v >> 1) & B[0]); c = ((c >> S[1]) & B[1]) + (c & B[1]); c = ((c >> S[2]) + c) & B[2]; c = ((c >> S[3]) + c) & B[3]; c = ((c >> S[4]) + c) & B[4];The B array, expressed as binary, is:

B[0] = 0x55555555 = 01010101 01010101 01010101 01010101 B[1] = 0x33333333 = 00110011 00110011 00110011 00110011 B[2] = 0x0F0F0F0F = 00001111 00001111 00001111 00001111 B[3] = 0x00FF00FF = 00000000 11111111 00000000 11111111 B[4] = 0x0000FFFF = 00000000 00000000 11111111 11111111We can adjust the method for larger integer sizes by continuing with the patterns for the

The best method for counting bits in a 32-bit integer v is the following:

unsigned int const w = v - ((v >> 1) & 0x55555555); // temp unsigned int const x = (w & 0x33333333) + ((w >> 2) & 0x33333333); // temp unsigned int const c = ((x + (x >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count

The best bit counting method takes only 12 operations, which is the same as the lookup-table method, but avoids the memory and potential cache misses of a table. It is a hybrid between the purely parallel method above and the earlier methods using multiplies (in the section on counting bits with 64-bit instructions), though it doesn't use 64-bit instructions. The counts of bits set in the bytes is done in parallel, and the sum total of the bits set in the bytes is computed by multiplying by 0x1010101 and shifting right 24 bits.

See
Ian Ashdown's nice newsgroup post for more information
on counting the number of bits set (also known as *sideways addition*).
The best bit counting method was brought to my attention on October 5, 2005
by Andrew Shapira;
he found it in pages 187-188 of
Software
Optimization Guide for AMD Athlon™ 64 and Opteron™ Processors.
Charlie Gordon suggested a way to shave off one operation from the
purely parallel version on December 14, 2005, and Don Clugston trimmed three
more from it on December 30, 2005. He also pointed out that if rotate
operations were available, changing the shifts to rotates would eliminate
the need for the AND masks because the overcounted sum total could instead be
right-shifted 3 at the end. I made a typo with Don's suggestion that
Eric Cole spotted on January 8, 2006.

unsigned int v; // word value to compute the parity of bool parity = false; // parity will be the parity of b while (v) { parity = !parity; v = v & (v - 1); }The above code uses an approach like Brian Kernigan's bit counting, above. The time it takes is proportional to the number of bits set.

unsigned char b; // byte value to compute the parity of bool parity = (((b * 0x0101010101010101ULL) & 0x8040201008040201ULL) % 0x1FF) & 1;The method above takes around 4 operations, but only works on bytes.

unsigned int v; // word value to compute the parity of v ^= v >> 16; v ^= v >> 8; v ^= v >> 4; v &= 0xf; return (0x6996 >> v) & 1;The method above takes around 9 operations, and works for 32-bit words. It may be optimized to work just on bytes in 5 operations by removing the two lines immediately following "unsigned int v;". The method first shifts and XORs the eight nibbles of the 32-bit value together, leaving the result in the lowest nibble of v. Next, the binary number 0110 1001 1001 0110 (0x6996 in hex) is shifted to the right by the value represented in the lowest nibble of v. This number is like a miniature 16-bit parity-table indexed by the low four bits in v. The result has the parity of v in bit 1, which is masked and returned.

Thanks to Mathew Hendry for pointing out the shift-lookup idea at the end on Dec. 15, 2002. That optimization shaves two operations off using only shifting and XORing to find the parity.

const bool ParityTable[] = { 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0 }; unsigned char b; // byte value to compute the parity of bool parity = ParityTable[b]; // OR, for 32-bit words: unsigned int v; // word value to compute the parity of bool parity = ParityTable[(v & 0x000000ff)] ^ ParityTable[(v & 0x0000ff00) >> 8] ^ ParityTable[(v & 0x00ff0000) >> 16] ^ ParityTable[(v & 0xff000000) >> 24]; // Variation: unsigned char * p = (unsigned char *) &v; bool parity = ParityTable[p[0]] ^ ParityTable[p[1]] ^ ParityTable[p[2]] ^ ParityTable[p[3]];Randal E. Byrant encouraged the addition of the (admittedly) obvious last variation with variable p on May 3, 2005. Bruce Rawles found a typo in an instance of the table variable's name on September 27, 2005, and he received a $10 bug bounty.

#define SWAP(a, b) (((a) ^= (b)), ((b) ^= (a)), ((a) ^= (b)))This is an old trick to exchange the values of the variables a and b

On January 20, 2005, Iain A. Fleming pointed out that the macro above doesn't work when you swap with the same memory location, such as SWAP(a[i], a[j]) with i == j. So if that may occur, consider defining the macro as (((a) == (b)) || (((a) ^= (b)), ((b) ^= (a)), ((a) ^= (b)))).

unsigned int i, j; // positons of bit sequences to swap unsigned int n; // number of consecutive bits in each sequence unsigned int b; // bits to swap reside in b unsigned int r; // bit-swapped result goes here int x = ((b >> i) ^ (b >> j)) & ((1 << n) - 1); // XOR temporary r = b ^ ((x << i) | (x << j));As an example of swapping ranges of bits suppose we have have b =

This method of swapping is similar to the general purpose XOR swap trick, but intended for operating on individual bits. The variable x stores the result of XORing the pairs of bit values we want to swap, and then the bits are set to the result of themselves XORed with x. Of course, the result is undefined if the sequences overlap.

unsigned int v; // reverse the bits in this unsigned int t = v << 1; // t will have the reversed bits of v int i; v >>= 1; for (i = sizeof(v) * CHAR_BIT - 2; i; i--) { t |= v & 1; t <<= 1; v >>= 1; } t |= v;

On October 15, 2004, Michael Hoisie pointed out a bug in the original version. Randal E. Byrant suggested removing an extra operation on May 3, 2005. Behdad Esfabod suggested a slight change that eliminated one iteration of the loop on May 18, 2005.

const unsigned char BitReverseTable256[] = { 0x00, 0x80, 0x40, 0xC0, 0x20, 0xA0, 0x60, 0xE0, 0x10, 0x90, 0x50, 0xD0, 0x30, 0xB0, 0x70, 0xF0, 0x08, 0x88, 0x48, 0xC8, 0x28, 0xA8, 0x68, 0xE8, 0x18, 0x98, 0x58, 0xD8, 0x38, 0xB8, 0x78, 0xF8, 0x04, 0x84, 0x44, 0xC4, 0x24, 0xA4, 0x64, 0xE4, 0x14, 0x94, 0x54, 0xD4, 0x34, 0xB4, 0x74, 0xF4, 0x0C, 0x8C, 0x4C, 0xCC, 0x2C, 0xAC, 0x6C, 0xEC, 0x1C, 0x9C, 0x5C, 0xDC, 0x3C, 0xBC, 0x7C, 0xFC, 0x02, 0x82, 0x42, 0xC2, 0x22, 0xA2, 0x62, 0xE2, 0x12, 0x92, 0x52, 0xD2, 0x32, 0xB2, 0x72, 0xF2, 0x0A, 0x8A, 0x4A, 0xCA, 0x2A, 0xAA, 0x6A, 0xEA, 0x1A, 0x9A, 0x5A, 0xDA, 0x3A, 0xBA, 0x7A, 0xFA, 0x06, 0x86, 0x46, 0xC6, 0x26, 0xA6, 0x66, 0xE6, 0x16, 0x96, 0x56, 0xD6, 0x36, 0xB6, 0x76, 0xF6, 0x0E, 0x8E, 0x4E, 0xCE, 0x2E, 0xAE, 0x6E, 0xEE, 0x1E, 0x9E, 0x5E, 0xDE, 0x3E, 0xBE, 0x7E, 0xFE, 0x01, 0x81, 0x41, 0xC1, 0x21, 0xA1, 0x61, 0xE1, 0x11, 0x91, 0x51, 0xD1, 0x31, 0xB1, 0x71, 0xF1, 0x09, 0x89, 0x49, 0xC9, 0x29, 0xA9, 0x69, 0xE9, 0x19, 0x99, 0x59, 0xD9, 0x39, 0xB9, 0x79, 0xF9, 0x05, 0x85, 0x45, 0xC5, 0x25, 0xA5, 0x65, 0xE5, 0x15, 0x95, 0x55, 0xD5, 0x35, 0xB5, 0x75, 0xF5, 0x0D, 0x8D, 0x4D, 0xCD, 0x2D, 0xAD, 0x6D, 0xED, 0x1D, 0x9D, 0x5D, 0xDD, 0x3D, 0xBD, 0x7D, 0xFD, 0x03, 0x83, 0x43, 0xC3, 0x23, 0xA3, 0x63, 0xE3, 0x13, 0x93, 0x53, 0xD3, 0x33, 0xB3, 0x73, 0xF3, 0x0B, 0x8B, 0x4B, 0xCB, 0x2B, 0xAB, 0x6B, 0xEB, 0x1B, 0x9B, 0x5B, 0xDB, 0x3B, 0xBB, 0x7B, 0xFB, 0x07, 0x87, 0x47, 0xC7, 0x27, 0xA7, 0x67, 0xE7, 0x17, 0x97, 0x57, 0xD7, 0x37, 0xB7, 0x77, 0xF7, 0x0F, 0x8F, 0x4F, 0xCF, 0x2F, 0xAF, 0x6F, 0xEF, 0x1F, 0x9F, 0x5F, 0xDF, 0x3F, 0xBF, 0x7F, 0xFF }; unsigned int v; // reverse 32-bit value, 8 bits at time unsigned int c; // c will get v reversed // Option 1: c = (BitReverseTable256[v & 0xff] << 24) | (BitReverseTable256[(v >> 8) & 0xff] << 16) | (BitReverseTable256[(v >> 16) & 0xff] << 8) | (BitReverseTable256[(v >> 24) & 0xff]); // Option 2: unsigned char * p = (unsigned char *) &v; unsigned char * q = (unsigned char *) &c; q[3] = BitReverseTable256[p[0]]; q[2] = BitReverseTable256[p[1]]; q[1] = BitReverseTable256[p[2]]; q[0] = BitReverseTable256[p[3]];The first method takes about 17 operations, and the second takes about 12, assuming your CPU can load and store bytes easily.

unsigned char b; // reverse this (8-bit) byte b = (b * 0x0202020202ULL & 0x010884422010ULL) % 1023;The multiply operation creates five separate copies of the 8-bit byte pattern to fan-out into a 64-bit value. The AND operation selects the bits that are in the correct (reversed) positions, relative to each 10-bit groups of bits. The multiply and the AND operations copy the bits from the original byte so they each appear in only one of the 10-bit sets. The reversed positions of the bits from the original byte coincide with their relative positions within any 10-bit set. The last step, which involves modulus division by 2^10 - 1, has the effect of merging together each set of 10 bits (from positions 0-9, 10-19, 20-29, ...) in the 64-bit value. They do not overlap, so the addition steps underlying the modulus division behave like or operations.

This method was attributed to Rich Schroeppel in the Programming Hacks section of Beeler, M., Gosper, R. W., and Schroeppel, R. HAKMEM. MIT AI Memo 239, Feb. 29, 1972.

unsigned char b; // reverse this byte b = ((b * 0x80200802ULL) & 0x0884422110ULL) * 0x0101010101ULL >> 32;The following shows the flow of the bit values with the boolean variables

`a, b, c, d, e, f, g,`

and `h`

, which
comprise an 8-bit byte. Notice how the first multiply fans out the
bit pattern to multiple copies, while the last multiply combines them
in the fifth byte from the right.
abcd efgh (-> hgfe dcba) * 1000 0000 0010 0000 0000 1000 0000 0010 (0x80200802) ------------------------------------------------------------------------------------------------- 0abc defg h00a bcde fgh0 0abc defg h00a bcde fgh0 & 0000 1000 1000 0100 0100 0010 0010 0001 0001 0000 (0x0884422110) ------------------------------------------------------------------------------------------------- 0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000 * 0000 0001 0000 0001 0000 0001 0000 0001 0000 0001 (0x0101010101) ------------------------------------------------------------------------------------------------- 0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000 0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000 0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000 0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000 0000 d000 h000 0c00 0g00 00b0 00f0 000a 000e 0000 ------------------------------------------------------------------------------------------------- 0000 d000 h000 dc00 hg00 dcb0 hgf0 dcba hgfe dcba hgfe 0cba 0gfe 00ba 00fe 000a 000e 0000 >> 32 ------------------------------------------------------------------------------------------------- 0000 d000 h000 dc00 hg00 dcb0 hgf0 dcba hgfe dcba & 1111 1111 ------------------------------------------------------------------------------------------------- hgfe dcbaNote that the last two steps can be combined on some processors because the registers can be accessed as bytes; just multiply so that a register stores the upper 32 bits of the result and the take the low byte. Thus, it may take only 6 operations.

Devised by Sean Anderson, July 13, 2001.

b = ((b * 0x0802LU & 0x22110LU) | (b * 0x8020LU & 0x88440LU)) * 0x10101LU >> 16Devised by Sean Anderson, July 13, 2001. Typo spotted and correction supplied by Mike Keith, January 3, 2002.

unsigned int v; // 32 bit word to reverse bit order // swap odd and even bits v = ((v >> 1) & 0x55555555) | ((v & 0x55555555) << 1); // swap consecutive pairs v = ((v >> 2) & 0x33333333) | ((v & 0x33333333) << 2); // swap nibbles ... v = ((v >> 4) & 0x0F0F0F0F) | ((v & 0x0F0F0F0F) << 4); // swap bytes v = ((v >> 8) & 0x00FF00FF) | ((v & 0x00FF00FF) << 8); // swap 2-byte long pairs v = ( v >> 16 ) | ( v << 16);The following variation is also O(lg(N)), however it requires more operations to reverse v. Its virtue is in taking less slightly memory by computing the constants on the fly.

unsigned int s = sizeof(v) * CHAR_BIT; // bit size; must be power of 2 unsigned int mask = ~0; while ((s >>= 1) > 0) { mask ^= (mask << s); v = ((v >> s) & mask) | ((v << s) & ~mask); }These methods above are best suited to situations where N is large.

See Dr. Dobb's Journal 1983, Edwin Freed's article on Binary Magic Numbers for more information. The second variation was suggested by Ken Raeburn on September 13, 2005. Veldmeijer mentioned that the first version could do without ANDS in the last line on March 19, 2006.

const unsigned int n; // numerator const unsigned int s; const unsigned int d = 1 << s; // So d will be one of: 1, 2, 4, 8, 16, 32, ... unsigned int m; // m will be n % d m = n & (d - 1);Most programmers learn this trick early, but it was included for the sake of completeness.

unsigned int n; // numerator const unsigned int s; // s > 0 const unsigned int d = (1 << s) - 1; // so d is either 1, 3, 7, 15, 31, ...). unsigned int m; // n % d goes here. for (m = n; n > d; n = m) { for (m = 0; n; n >>= s) { m += n & d; } } // Now m is a value from 0 to d, but since with modulus division // we want m to be 0 when it is d. m = m == d ? 0 : m;This method of modulus division by an integer that is one less than a power of 2 takes at most 5 + (4 + 5 * ceil(N / s)) * ceil(lg(N / s)) operations, where N is the number of bits in the numerator. In other words, it takes at most O(N * lg(N)) time.

Devised by Sean Anderson, August 15, 2001. Before Sean A. Irvine corrected me
on June 17, 2004, I mistakenly commented that we could alternatively assign
`m = ((m + 1) & d) - 1;`

at the end. Michael Miller spotted a
typo in the code April 25, 2005.

// The following is for a word size of 32 bits! const unsigned int M[] = { 0x00000000, 0x55555555, 0x33333333, 0xc71c71c7, 0x0f0f0f0f, 0xc1f07c1f, 0x3f03f03f, 0xf01fc07f, 0x00ff00ff, 0x07fc01ff, 0x3ff003ff, 0xffc007ff, 0xff000fff, 0xfc001fff, 0xf0003fff, 0xc0007fff, 0x0000ffff, 0x0001ffff, 0x0003ffff, 0x0007ffff, 0x000fffff, 0x001fffff, 0x003fffff, 0x007fffff, 0x00ffffff, 0x01ffffff, 0x03ffffff, 0x07ffffff, 0x0fffffff, 0x1fffffff, 0x3fffffff, 0x7fffffff }; const unsigned int Q[][6] = { { 0, 0, 0, 0, 0, 0}, {16, 8, 4, 2, 1, 1}, {16, 8, 4, 2, 2, 2}, {15, 6, 3, 3, 3, 3}, {16, 8, 4, 4, 4, 4}, {15, 5, 5, 5, 5, 5}, {12, 6, 6, 6 , 6, 6}, {14, 7, 7, 7, 7, 7}, {16, 8, 8, 8, 8, 8}, { 9, 9, 9, 9, 9, 9}, {10, 10, 10, 10, 10, 10}, {11, 11, 11, 11, 11, 11}, {12, 12, 12, 12, 12, 12}, {13, 13, 13, 13, 13, 13}, {14, 14, 14, 14, 14, 14}, {15, 15, 15, 15, 15, 15}, {16, 16, 16, 16, 16, 16}, {17, 17, 17, 17, 17, 17}, {18, 18, 18, 18, 18, 18}, {19, 19, 19, 19, 19, 19}, {20, 20, 20, 20, 20, 20}, {21, 21, 21, 21, 21, 21}, {22, 22, 22, 22, 22, 22}, {23, 23, 23, 23, 23, 23}, {24, 24, 24, 24, 24, 24}, {25, 25, 25, 25, 25, 25}, {26, 26, 26, 26, 26, 26}, {27, 27, 27, 27, 27, 27}, {28, 28, 28, 28, 28, 28}, {29, 29, 29, 29, 29, 29}, {30, 30, 30, 30, 30, 30}, {31, 31, 31, 31, 31, 31} }; const unsigned int R[][6] = { {0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000}, {0x0000ffff, 0x000000ff, 0x0000000f, 0x00000003, 0x00000001, 0x00000001}, {0x0000ffff, 0x000000ff, 0x0000000f, 0x00000003, 0x00000003, 0x00000003}, {0x00007fff, 0x0000003f, 0x00000007, 0x00000007, 0x00000007, 0x00000007}, {0x0000ffff, 0x000000ff, 0x0000000f, 0x0000000f, 0x0000000f, 0x0000000f}, {0x00007fff, 0x0000001f, 0x0000001f, 0x0000001f, 0x0000001f, 0x0000001f}, {0x00000fff, 0x0000003f, 0x0000003f, 0x0000003f, 0x0000003f, 0x0000003f}, {0x00003fff, 0x0000007f, 0x0000007f, 0x0000007f, 0x0000007f, 0x0000007f}, {0x0000ffff, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff}, {0x000001ff, 0x000001ff, 0x000001ff, 0x000001ff, 0x000001ff, 0x000001ff}, {0x000003ff, 0x000003ff, 0x000003ff, 0x000003ff, 0x000003ff, 0x000003ff}, {0x000007ff, 0x000007ff, 0x000007ff, 0x000007ff, 0x000007ff, 0x000007ff}, {0x00000fff, 0x00000fff, 0x00000fff, 0x00000fff, 0x00000fff, 0x00000fff}, {0x00001fff, 0x00001fff, 0x00001fff, 0x00001fff, 0x00001fff, 0x00001fff}, {0x00003fff, 0x00003fff, 0x00003fff, 0x00003fff, 0x00003fff, 0x00003fff}, {0x00007fff, 0x00007fff, 0x00007fff, 0x00007fff, 0x00007fff, 0x00007fff}, {0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff}, {0x0001ffff, 0x0001ffff, 0x0001ffff, 0x0001ffff, 0x0001ffff, 0x0001ffff}, {0x0003ffff, 0x0003ffff, 0x0003ffff, 0x0003ffff, 0x0003ffff, 0x0003ffff}, {0x0007ffff, 0x0007ffff, 0x0007ffff, 0x0007ffff, 0x0007ffff, 0x0007ffff}, {0x000fffff, 0x000fffff, 0x000fffff, 0x000fffff, 0x000fffff, 0x000fffff}, {0x001fffff, 0x001fffff, 0x001fffff, 0x001fffff, 0x001fffff, 0x001fffff}, {0x003fffff, 0x003fffff, 0x003fffff, 0x003fffff, 0x003fffff, 0x003fffff}, {0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff}, {0x00ffffff, 0x00ffffff, 0x00ffffff, 0x00ffffff, 0x00ffffff, 0x00ffffff}, {0x01ffffff, 0x01ffffff, 0x01ffffff, 0x01ffffff, 0x01ffffff, 0x01ffffff}, {0x03ffffff, 0x03ffffff, 0x03ffffff, 0x03ffffff, 0x03ffffff, 0x03ffffff}, {0x07ffffff, 0x07ffffff, 0x07ffffff, 0x07ffffff, 0x07ffffff, 0x07ffffff}, {0x0fffffff, 0x0fffffff, 0x0fffffff, 0x0fffffff, 0x0fffffff, 0x0fffffff}, {0x1fffffff, 0x1fffffff, 0x1fffffff, 0x1fffffff, 0x1fffffff, 0x1fffffff}, {0x3fffffff, 0x3fffffff, 0x3fffffff, 0x3fffffff, 0x3fffffff, 0x3fffffff}, {0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff} }; unsigned int n; // numerator const unsigned int s; // s > 0 const unsigned int d = (1 << s) - 1; // so d is either 1, 3, 7, 15, 31, ...). unsigned int m; // n % d goes here. m = (n & M[s]) + ((n & ~M[s]) >> s); for (const unsigned int * q = &Q[s][0], * r = &R[s][0]; m > d; q++, r++) { m = (m >> *q) + (m & *r); } m = m == d ? 0 : m; // OR, less portably: m = m & -((signed)(m - d) >> s);This method of finding modulus division by an integer that is one less than a power of 2 takes at most O(lg(N)) time, where N is the number of bits in the numerator (32 bits, for the code above). The number of operations is at most 12 + 9 * ceil(lg(N)). The tables may be removed if you know the denominator at compile time; just extract the few relevent entries and unroll the loop. It may be easily extended to more bits.

It finds the result by summing the values in base (1 << s) in parallel. First every other base (1 << s) value is added to the previous one. Imagine that the result is written on a piece of paper. Cut the paper in half, so that half the values are on each cut piece. Align the values and sum them onto a new piece of paper. Repeat by cutting this paper in half (which will be a quarter of the size of the previous one) and summing, until you cannot cut further. After performing lg(N/s/2) cuts, we cut no more; just continue to add the values and put the result onto a new piece of paper as before, while there are at least two s-bit values.

Devised by Sean Anderson, August 20, 2001. A typo was spotted by
Randy E. Bryant on May 3, 2005 (after pasting the code, I had later
added "unsinged" to a variable declaration). As in the previous hack,
I mistakenly commented that we could alternatively assign
`m = ((m + 1) & d) - 1;`

at the end, and Don Knuth corrected
me on April 19, 2006 and suggested `m = m & -((signed)(m - d) >> s)`

.

unsigned int v; // 32-bit word to find the log base 2 of unsigned r = 0; // r will be lg(v) while (v >>= 1) // unroll for more speed... { r++; }The log base 2 of an integer is the same as the position of the highest bit set (or most significant bit set, MSB). The following log base 2 methods are faster than this one.

int v; // 32-bit integer to find the log base 2 of int r; // result of log_2(v) goes here union { unsigned int u[2]; double d; } t; // temp t.u[BYTE_ORDER==LITTLE_ENDIAN] = 0x43300000; t.u[BYTE_ORDER!=LITTLE_ENDIAN] = v; t.d -= 4503599627370496.0; r = (t.u[BYTE_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;The code above loads a 64-bit (IEEE-754 floating-point) double with a 32 bit integer by storing the integer in the mantissa while the exponent is set to 2

Eric Cole sent me this on January 15, 2006. Evan Felix pointed out a typo on April 4, 2006.

static const char LogTable256[] = { 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7 }; unsigned int v; // 32-bit word to find the log of unsigned r = 0; // r will be lg(v) register unsigned int t, tt; // temporaries if (tt = v >> 16) { r = (t = v >> 24) ? 24 + LogTable256[t] : 16 + LogTable256[tt & 0xFF]; } else { r = (t = v >> 8) ? 8 + LogTable256[t] : LogTable256[v]; }The lookup table method takes only about 7 operations to find the log of a 32-bit value. If extended for 64-bit quantities, it would take roughly 9 operations. Another operation can be trimmed off by using four tables, with the possible additions incorporated into each. Using int table elements may be faster, depending on your architecture.

// To initially generate the log table algorithmically: LogTable256[0] = LogTable256[1] = 0; for (int i = 2; i < 256; i++) { LogTable256[i] = 1 + LogTable256[i / 2]; }Behdad Esfahbod and I shaved off a fraction of an operation (on average) on May 18, 2005.

unsigned int v; // 32-bit value to find the log2 of const unsigned int b[] = {0x2, 0xC, 0xF0, 0xFF00, 0xFFFF0000}; const unsigned int S[] = {1, 2, 4, 8, 16}; int i; register unsigned int r = 0; // result of log2(v) will go here for (i = 4; i >= 0; i--) // unroll for speed... { if (v & b[i]) { v >>= S[i]; r |= S[i]; } } // OR (IF YOUR CPU BRANCHES SLOWLY): register unsigned int shift; shift = ( ( v & 0xFFFF0000 ) != 0 ) << 4; v >>= shift; r |= shift; shift = ( ( v & 0xFF00 ) != 0 ) << 3; v >>= shift; r |= shift; shift = ( ( v & 0xF0 ) != 0 ) << 2; v >>= shift; r |= shift; shift = ( ( v & 0xC ) != 0 ) << 1; v >>= shift; r |= shift; shift = ( ( v & 0x2 ) != 0 ) << 0; v >>= shift; r |= shift; // OR (IF YOU KNOW v IS A POWER OF 2): const unsigned int b[] = {0xAAAAAAAA, 0xCCCCCCCC, 0xF0F0F0F0, 0xFF00FF00, 0xFFFF0000}; register unsigned int r = (v & b[0]) != 0; for (i = 4; i > 0; i--) // unroll for speed... { r |= ((v & b[i]) != 0) << i; }Of course, to extend the code to find the log of a 33- to 64-bit number, we would append another element, 0xFFFFFFFF00000000, to b, append 32 to S, and loop from 5 to 0. This method is much slower than the earlier table-lookup version, but if you don't want big table or your architecture is slow to access memory, it's a good choice. The second variation involves more operations, but it may be faster on machines with high branch costs (e.g. PowerPC), and it was sent to me by Eric Cole on January 7, 2006. The third variation was suggested to me by John Owens on April 24, 2002; it's faster, but

unsigned int v; // find the log base 2 of 32-bit v int r; // result goes here static const int MultiplyDeBruijnBitPosition[32] = { 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9 }; v |= v >> 1; // first round down to power of 2 v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16; v = (v >> 1) + 1; r = MultiplyDeBruijnBitPosition[(v * 0x077CB531UL) >> 27];The code above computes the log base 2 of a 32-bit integer with a small table lookup and multiply. It requires only 15 operations, compared to (up to) 20 for the previous method. The purely table-based method requires the fewest operations, but this offers a reasonable compromise between table size and speed. If v is known to be a power of 2, then only the last line is needed (3 operations).

Eric Cole devised this January 8, 2006 after reading about the entry below to round up to a power of 2 and the method below for computing the number of trailing bits with a multiply and lookup using a DeBruijn sequence.

unsigned int v; // non-zero 32-bit integer value to compute the log base 10 of int r; // result goes here int t; // temporary static unsigned int const PowersOf10[] = {1, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000}; t = (IntegerLogBase2(v) + 1) * 1233 >> 12; // (use a lg2 method from above) r = t - (v < PowersOf10[t]);The integer log base 10 is computed by first using one of the techniques above for finding the log base 2. By the relationship log

This method takes 6 more operations than IntegerLogBase2. It may be sped up (on machines with fast memory access) by modifying the log base 2 table-lookup method above so that the entries hold what is computed for t (that is, pre-add, -mulitply, and -shift). Doing so would require a total of only 9 operations to find the log base 10, assuming 4 tables were used (one for each byte of v).

Eric Cole suggested I add a version of this on January 7, 2006.

const float v; // find int(log2(v)), where v > 0.0 && finite(v) && isnormal(v) int c; // 32-bit int c gets the result; c = *(const int *) &v; // OR, for portability: memcpy(&c, &v, sizeof c); c = (c >> 23) - 127;The above is fast, but IEEE 754-compliant architectures utilize

const float v; // find int(log2(v)), where v > 0.0 && finite(v) int c; // 32-bit int c gets the result; int x = *(const int *) &v; // OR, for portability: memcpy(&x, &v, sizeof x); c = x >> 23; if (c) { c -= 127; } else { // subnormal, so recompute using mantissa: c = intlog2(x) - 149; register unsigned int t; // temporary // Note that LogTable256 was defined earlier if (t = x >> 16) { c = LogTable256[t] - 133; } else { c = (t = x >> 8) ? LogTable256[t] - 141 : LogTable256[x] - 149; } }On June 20, 2004, Sean A. Irvine suggested that I include code to handle subnormal numbers. On June 11, 2005, Falk Hüffner pointed out that ISO C99 6.5/7 specified undefined behavior for the common type punning idiom *(int *)&, though it has worked on 99.9% of C compilers. He proposed using memcpy for maximum portability or a union with a float and an int for better code generation than memcpy on some compilers.

const int r; const float v; // find int(log2(pow((double) v, 1. / pow(2, r)))), // where isnormal(v) and v > 0 int c; // 32-bit int c gets the result; c = *(const int *) &v; // OR, for portability: memcpy(&c, &v, sizeof c); c = ((((c - 0x3f800000) >> r) + 0x3f800000) >> 23) - 127;So, if r is 0, for example, we have c = int(log2((double) v)). If r is 1, then we have c = int(log2(sqrt((double) v))). If r is 2, then we have c = int(log2(pow((double) v, 1./4))).

On June 11, 2005, Falk Hüffner pointed out that ISO C99 6.5/7 left the type punning idiom *(int *)& undefined, and he suggested using memcpy.

unsigned int v; // 32-bit word input to count zero bits on right unsigned int c = 32; // c will be the number of zero bits on the right, // so if v is 1101000 (base 2), then c will be 3 const unsigned int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF}; const unsigned int S[] = {1, 2, 4, 8, 16}; // Our Magic Binary Numbers for (int i = 4; i >= 0; --i) // unroll for more speed { if (v & B[i]) { v <<= S[i]; c -= S[i]; } } if (v) { c--; }Here, we are basically doing the same operations as finding the log base 2 in parallel, but the values of b are inverted (in order to count from the right rather than the left), we shift v up rather than down, and c starts at the maximum and is decreased. We also have the additional step at the end, decrementing c if there is anything left in v. The number of operations is at most 4 * lg(N) + 2, roughly, for N bit words.

unsigned int v; // 32-bit word input to count zero bits on right unsigned int c = 0; // c will be the number of zero bits on the right, // so if v is 1101000 (base 2), then c will be 3 // NOTE: if 0 == v, then c = 31. if (v & 0x1) { // special case for odd v (assumed to happen half of the time) } else { if ((v & 0xffff) == 0) { v >>= 16; c += 16; } if ((v & 0xff) == 0) { v >>= 8; c += 8; } if ((v & 0xf) == 0) { v >>= 4; c += 4; } if ((v & 0x3) == 0) { v >>= 2; c += 2; } if ((v & 0x1) == 0) { c++; } }The code above is similar to the previous method, but it computes the number of trailing zeros by accumulating c in a manner akin to binary search. In the first step, it checks if the bottom 16 bits of v are zeros, and if so, shifts v right 16 bits and adds 16 to c, which reduces the number of bits in v to consider by half. Each of the subsequent conditional steps likewise halves the number of bits until there is only 1. This method is faster than the last one (by about 33%) because the bodies of the if statements are executed less often.

Matt Whitlock suggested this on January 25, 2006.

unsigned int v; // find the number of trailing zeros in v int r; // the result goes here float f = (float)(v & -v); // cast the least significant bit in v to a float r = (*(unsigned int *)&f >> 23) - 0x7f;Although this only takes about 6 operations, the time to convert an integer to a float can be high on some machines. The exponent of the 32-bit IEEE floating point representation is shifted down, and the bias is subtracted to give the position of the least significant 1 bit set in v. If v is zero, then the result is -127.

unsigned int v; // find the number of trailing zeros in v int r; // put the result in r const int Mod37BitPosition[] = // maps a bit value mod 37 to its position { 32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4, 7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5, 20, 8, 19, 18 }; r = Mod37BitPosition[(-v & v) % 37];The code above finds the number of zeros that are trailing on the right, so binary 0100 would produce 2. It makes use of the fact that the first 32 bit position values are relatively prime with 37, so performing a modulus division with 37 gives a unique number from 0 to 36 for each. These numbers may then be mapped to the number of zeros using a small lookup table. It uses only 4 operations, however indexing into a table and performing modulus division may make it unsuitable for some situations. I came up with this independently and then searched for a subsequence of the table values, and found it was invented earlier by Reiser, according to Hacker's Delight.

unsigned int v; // find the number of trailing zeros in 32-bit v int r; // result goes here const int MultiplyDeBruijnBitPosition[32] = { 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9 }; r = MultiplyDeBruijnBitPosition[((v & -v) * 0x077CB531UL) >> 27];Converting bit vectors to indices of set bits is an example use for this. It requires one more operation than the earlier one involving modulus division, but the multiply may be faster. The expression (v & -v) extracts the least significant 1 bit from v. The constant 0x077CB531UL is a de Bruijn sequence, which produces a unique pattern of bits into the high 5 bits for each possible bit position that it is multiplied against. When there are no bits set, it returns 0. More information can be found by reading the paper Using de Bruijn Sequences to Index 1 in a Computer Word by Charles E. Leiserson, Harald Prokof, and Keith H. Randall.

On October 8, 2005 Andrew Shapira suggested I add this.

unsigned int const v; // Round this 32-bit value to the next highest power of 2 unsigned int r; // Put the result here. (So v=3 -> r=4; v=8 -> r=8) if (v > 1) { float f = (float)v; unsigned int const t = 1 << ((*(unsigned int *)&f >> 23) - 0x7f); r = t << (t < v); } else { r = 1; }The code above uses 8 operations, but works on all v <= (1<<31).

Quick and dirty version, for domain of 1 < v < (1<<25):

float f = (float)(v - 1); r = 1 << ((*(unsigned int*)(&f) >> 23) - 126);Although the quick and dirty version only uses around 6 operations, it is roughly three times slower than the technique below (which involves 12 operations) when benchmarked on an Athlon™ XP 2100+ CPU. Some CPUs will fare better with it, though.

On September 27, 2005 Andi Smithers suggested I include a technique for casting to floats to find the lg of a number for rounding up to a power of 2. Similar to the quick and dirty version here, his version worked with values less than (1<<25), due to mantissa rounding, but it used one more operation.

unsigned int v; // compute the next highest power of 2 of 32-bit v v--; v |= v >> 1; v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16; v++;In 12 operations, this code computes the next highest power of 2 for a 32-bit integer. The result may be expressed by the formula 1 << (lg(v - 1) + 1). Note that in the edge case where v is 0, it returns 0, which isn't a power of 2; you might append the expression v += (v == 0) to remedy this if it matters. It would be faster by 2 operations to use the formula and the log base 2 methed that uses a lookup table, but in some situations, lookup tables are not suitable, so the above code may be best. (On a Athlon™ XP 2100+ I've found the above shift-left and then OR code is as fast as using a single BSR assembly language instruction, which scans in reverse to find the highest set bit.) It works by copying the highest set bit to all of the lower bits, and then adding one, which results in carries that set all of the lower bits to 0 and one bit beyond the highest set bit to 1. If the original number was a power of 2, then the decrement will reduce it to one less, so that we round up to the same original value.

Devised by Sean Anderson, Sepember 14, 2001.

unsigned short x; // Interleave bits of x and y, so that all of the unsigned short y; // bits of x are in the even positions and y in the odd; unsigned int z = 0; // z gets the resulting Morton Number. for (int i = 0; i < sizeof(x) * CHAR_BIT; i++) // unroll for more speed... { z |= (x & 1 << i) << i | (y & 1 << i) << (i + 1); }Interleaved bits (aka Morton numbers) are useful for linearizing 2D integer coordinates, so x and y are combined into a single number that can be compared easily and has the property that a number is usually close to another if their x and y values are close.

const unsigned short MortonTable256[] = { 0x0000, 0x0001, 0x0004, 0x0005, 0x0010, 0x0011, 0x0014, 0x0015, 0x0040, 0x0041, 0x0044, 0x0045, 0x0050, 0x0051, 0x0054, 0x0055, 0x0100, 0x0101, 0x0104, 0x0105, 0x0110, 0x0111, 0x0114, 0x0115, 0x0140, 0x0141, 0x0144, 0x0145, 0x0150, 0x0151, 0x0154, 0x0155, 0x0400, 0x0401, 0x0404, 0x0405, 0x0410, 0x0411, 0x0414, 0x0415, 0x0440, 0x0441, 0x0444, 0x0445, 0x0450, 0x0451, 0x0454, 0x0455, 0x0500, 0x0501, 0x0504, 0x0505, 0x0510, 0x0511, 0x0514, 0x0515, 0x0540, 0x0541, 0x0544, 0x0545, 0x0550, 0x0551, 0x0554, 0x0555, 0x1000, 0x1001, 0x1004, 0x1005, 0x1010, 0x1011, 0x1014, 0x1015, 0x1040, 0x1041, 0x1044, 0x1045, 0x1050, 0x1051, 0x1054, 0x1055, 0x1100, 0x1101, 0x1104, 0x1105, 0x1110, 0x1111, 0x1114, 0x1115, 0x1140, 0x1141, 0x1144, 0x1145, 0x1150, 0x1151, 0x1154, 0x1155, 0x1400, 0x1401, 0x1404, 0x1405, 0x1410, 0x1411, 0x1414, 0x1415, 0x1440, 0x1441, 0x1444, 0x1445, 0x1450, 0x1451, 0x1454, 0x1455, 0x1500, 0x1501, 0x1504, 0x1505, 0x1510, 0x1511, 0x1514, 0x1515, 0x1540, 0x1541, 0x1544, 0x1545, 0x1550, 0x1551, 0x1554, 0x1555, 0x4000, 0x4001, 0x4004, 0x4005, 0x4010, 0x4011, 0x4014, 0x4015, 0x4040, 0x4041, 0x4044, 0x4045, 0x4050, 0x4051, 0x4054, 0x4055, 0x4100, 0x4101, 0x4104, 0x4105, 0x4110, 0x4111, 0x4114, 0x4115, 0x4140, 0x4141, 0x4144, 0x4145, 0x4150, 0x4151, 0x4154, 0x4155, 0x4400, 0x4401, 0x4404, 0x4405, 0x4410, 0x4411, 0x4414, 0x4415, 0x4440, 0x4441, 0x4444, 0x4445, 0x4450, 0x4451, 0x4454, 0x4455, 0x4500, 0x4501, 0x4504, 0x4505, 0x4510, 0x4511, 0x4514, 0x4515, 0x4540, 0x4541, 0x4544, 0x4545, 0x4550, 0x4551, 0x4554, 0x4555, 0x5000, 0x5001, 0x5004, 0x5005, 0x5010, 0x5011, 0x5014, 0x5015, 0x5040, 0x5041, 0x5044, 0x5045, 0x5050, 0x5051, 0x5054, 0x5055, 0x5100, 0x5101, 0x5104, 0x5105, 0x5110, 0x5111, 0x5114, 0x5115, 0x5140, 0x5141, 0x5144, 0x5145, 0x5150, 0x5151, 0x5154, 0x5155, 0x5400, 0x5401, 0x5404, 0x5405, 0x5410, 0x5411, 0x5414, 0x5415, 0x5440, 0x5441, 0x5444, 0x5445, 0x5450, 0x5451, 0x5454, 0x5455, 0x5500, 0x5501, 0x5504, 0x5505, 0x5510, 0x5511, 0x5514, 0x5515, 0x5540, 0x5541, 0x5544, 0x5545, 0x5550, 0x5551, 0x5554, 0x5555 }; unsigned short x; // Interleave bits of x and y, so that all of the unsigned short y; // bits of x are in the even positions and y in the odd; unsigned int z; // z gets the resulting 32-bit Morton Number. z = MortonTable256[y >> 8] << 17 | MortonTable256[x >> 8] << 16 | MortonTable256[y & 0xFF] << 1 | MortonTable256[x & 0xFF];For more speed, use an additional table with values that are MortonTable256 pre-shifted one bit to the left. This second table could then be used for the y lookups, thus reducing the operations by two, but almost doubling the memory required. Extending this same idea, four tables could be used, with two of them pre-shifted by 16 to the left of the previous two, so that we would only need 11 operations total.

unsigned char x; // Interleave bits of (8-bit) x and y, so that all of the unsigned char y; // bits of x are in the even positions and y in the odd; unsigned short z; // z gets the resulting 16-bit Morton Number. z = ((x * 0x0101010101010101ULL & 0x8040201008040201ULL) * 0x0102040810204081ULL >> 49) & 0x5555 | ((y * 0x0101010101010101ULL & 0x8040201008040201ULL) * 0x0102040810204081ULL >> 48) & 0xAAAA;Holger Bettag was inspired to suggest this technique on October 10, 2004 after reading the multiply-based bit reversals here.

const unsigned int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF}; const unsigned int S[] = {1, 2, 4, 8}; unsigned int x; // Interleave lower 16 bits of x and y, so the bits of x unsigned int y; // are in the even positions and bits from y in the odd; unsigned int z; // z gets the resulting 32-bit Morton Number. x = (x | (x << S[3])) & B[3]; x = (x | (x << S[2])) & B[2]; x = (x | (x << S[1])) & B[1]; x = (x | (x << S[0])) & B[0]; y = (y | (y << S[3])) & B[3]; y = (y | (y << S[2])) & B[2]; y = (y | (y << S[1])) & B[1]; y = (y | (y << S[0])) & B[0]; z = x | (y << 1);

// Fewer operations: unsigned int v; // 32-bit word to check if any 8-bit byte in it is 0 bool hasZeroByte = ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F);The code above may be useful when doing a fast string copy in which a word is copied at a time; it uses 5 operations. On the other hand, testing for a null byte in the obvious ways (which follow) have at least 7 operations (when counted in the most sparing way), and at most 12.

// More operations: bool hasNoZeroByte = ((v & 0xff) && (v & 0xff00) && (v & 0xff0000) && (v & 0xff000000)) // OR: unsigned char * p = (unsigned char *) &v; bool hasNoZeroByte = *p && *(p + 1) && *(p + 2) && *(p + 3);The code at the beginning of this section (labeled "Fewer operations") works by first zeroing the high bits of the 4 bytes in the word. Subsequently, it adds a number that will result in an overflow to the high bit of a byte if any of the low bits were initialy set. Next the high bits of the original word are ORed with these values; thus, the high bit of a byte is set iff any bit in the byte was set. Finally, we determine if any of these high bits are zero by ORing with ones everywhere except the high bits and inverting the result. Extending to 64 bits is trivial; simply increase the constants to be 0x7F7F7F7F7F7F7F7F.

For an additional improvement, a fast pretest that requires only 4 operations
may be performed to determine if the word *may* have a zero byte.
The test also returns true if the high byte is 0x80, so there are
occasional false positives, but the slower and more reliable version
above may then be used on candidates for an overall increase in speed with
correct output.

bool hasZeroByte = ((v + 0x7efefeff) ^ ~v) & 0x81010100; if (hasZeroByte) // or may just have 0x80 in the high byte { hasZeroByte = ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F); }

There is yet a faster method —
use `hasless`

(v, 1),
which is defined below; it
works in 4 operations and requires no subsquent verification. It simplifies
to

bool hasZeroByte = (v - 0x01010101UL) & ~v & 0x80808080UL;The subexpression (v - 0x01010101UL), evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80. The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80). Finally, by ANDing these two sub-expressions the result is the high bits set where the bytes in v were zero, since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second.

Paul Messmer suggested the fast pretest improvement on October 2, 2004. Juha Järvi later suggested hasless(v, 1) on April 6, 2005, which he found on Paul Hsieh's Assembly Lab; previously it was written in a newsgroup post on April 27, 1987 by Alan Mycroft.

Requirements: x>=0; 0<=n<=128

#define hasless(x,n) (((x)-~0UL/255*(n))&~(x)&~0UL/255*128)To count the number of bytes in x that are less than n in 7 operations, use

#define countless(x,n) \ (((~0UL/255*(127+(n))-((x)&~0UL/255*127))&~(x)&~0UL/255*128)/128%255)

Juha Järvi sent this clever technique to me on April 6, 2005. The
`countless`

macro was added by Sean Anderson on
April 10, 2005, inspired by Juha's `countmore`

, below.

Requirements: x>=0; 0<=n<=127

#define hasmore(x,n) (((x)+~0UL/255*(127-(n))|(x))&~0UL/255*128)To count the number of bytes in x that are more than n in 6 operations, use:

#define countmore(x,n) \ (((((x)&~0UL/255*127)+~0UL/255*(127-(n))|(x))&~0UL/255*128)/128%255)

The macro `hasmore`

was suggested by Juha Järvi on
April 6, 2005, and he added `countmore`

on April 8, 2005.

Note: Bytes that equal n can be reported by `likelyhasbetween`

as false positives,
so this should be checked by character if a certain result is needed.

Requirements: x>=0; 0<=m<=127; 0<=n<=128

#define likelyhasbetween(x,m,n) \ ((((x)-~0UL/255*(n))&~(x)&((x)&~0UL/255*127)+~0UL/255*(127-(m)))&~0UL/255*128) This technique would be suitable for a fast pretest. A variation that takes one more operation (8 total for constant m and n) but provides the exact answer is:

#define hasbetween(x,m,n) \ ((~0UL/255*(127+(n))-((x)&~0UL/255*127)&~(x)&((x)&~0UL/255*127)+~0UL/255*(127-m))&~0UL/255*128)To count the number of bytes in x that are between m and n (exclusive) in 10 operations, use:

#define countbetween(x,m,n) (hasbetween(x,m,n)/128%255)

Juha Järvi suggested `likelyhasbetween`

on April 6, 2005.
From there,
Sean Anderson created `hasbetween`

and
`countbetween`

on April 10, 2005.

Hits: |