Algorithms
# Continuous Fraction
## Algorith of Float2Fractional
There is a function
~~~
countinuousFraction :: Float -> Fractional
/*
** find rational approximation to given real number
** David Eppstein / UC Irvine / 8 Aug 1993
**
** With corrections from Arno Formella, May 2008
**
** usage: a.out r d
** r is real number to approx
** d is the maximum denominator allowed
**
** based on the theory of continued fractions
** if x = a1 + 1/(a2 + 1/(a3 + 1/(a4 + ...)))
** then best approximation is found by truncating this series
** (with some adjustments in the last term).
**
** Note the fraction can be recovered as the first column of the matrix
** ( a1 1 ) ( a2 1 ) ( a3 1 ) ...
** ( 1 0 ) ( 1 0 ) ( 1 0 )
** Instead of keeping the sequence of continued fraction terms,
** we just keep the last partial product of these matrices.
*/
#include
main(ac, av)
int ac;
char ** av;
{
double atof();
int atoi();
void exit();
long m[2][2];
double x, startx;
long maxden;
long ai;
/* read command line arguments */
if (ac != 3) {
fprintf(stderr, "usage: %s r d\n",av[0]); // AF: argument missing
exit(1);
}
startx = x = atof(av[1]);
maxden = atoi(av[2]);
/* initialize matrix */
m[0][0] = m[1][1] = 1;
m[0][1] = m[1][0] = 0;
/* loop finding terms until denom gets too big */
while (m[1][0] * ( ai = (long)x ) + m[1][1] <= maxden) {
long t;
t = m[0][0] * ai + m[0][1];
m[0][1] = m[0][0];
m[0][0] = t;
t = m[1][0] * ai + m[1][1];
m[1][1] = m[1][0];
m[1][0] = t;
if(x==(double)ai) break; // AF: division by zero
x = 1/(x - (double) ai);
if(x>(double)0x7FFFFFFF) break; // AF: representation failure
}
/* now remaining x is between 0 and 1/ai */
/* approx as either 0 or 1/m where m is max that will fit in maxden */
/* first try zero */
printf("%ld/%ld, error = %e\n", m[0][0], m[1][0],
startx - ((double) m[0][0] / (double) m[1][0]));
/* now try other possibility */
ai = (maxden - m[1][1]) / m[1][0];
m[0][0] = m[0][0] * ai + m[0][1];
m[1][0] = m[1][0] * ai + m[1][1];
printf("%ld/%ld, error = %e\n", m[0][0], m[1][0],
startx - ((double) m[0][0] / (double) m[1][0]));
}
~~~
https://stackoverflow.com/questions/95727/how-to-convert-floats-to-human-readable-fractions
## Homotopy Method
There is an alternative to Newton Method since it is difficult to find answer always, think of equations with exponentials.
# sine definition
## Basic idea
- http://stackoverflow.com/questions/26146190/how-is-sine-implemented-in-java
- https://whiteboardfox.com/60544-8397-9933
$ 6 $ 次までのテーラー展開で $ 2 ^ {-58} $ の精度、
さらに近似式により誤差をなくすように設計されている
~~~
33 * Algorithm
34 * 1. Since sin(-x) = -sin(x), we need only to consider positive x.
35 * 2. if x < 2^-27 (hx < 0x3e400000 0), return x with inexact if x!=0.
36 * 3. sin(x) is approximated by a polynomial of degree 13 on
37 * [0,pi/4]
38 * 3 13
39 * sin(x) ~ x + S1*x + ... + S6*x
40 * where
41 *
42 * |sin(x) 2 4 6 8 10 12 | -58
43 * |----- - (1+S1*x +S2*x +S3*x +S4*x +S5*x +S6*x )| <= 2
44 * | x |
45 *
46 * 4. sin(x+y) = sin(x) + sin'(x')*y
47 * ~ sin(x) + (1-x*x/2)*y
48 * For better accuracy, let
49 * 3 2 2 2 2
50 * r = x *(S2+x *(S3+x *(S4+x *(S5+x *S6))))
51 * then 3 2
52 * sin(x) = x + (S1*x + (x *(r-y/2)+y))
~~~
## In reality
[sine() depends on the domain ](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/ieee754/dbl-64/s_sin.c;hb=HEAD#l281)
from [here](https://stackoverflow.com/questions/2284860/how-does-c-compute-sin-and-other-math-functions)
~~~
/*
* IBM Accurate Mathematical Library
* written by International Business Machines Corp.
* Copyright (C) 2001-2017 Free Software Foundation, Inc.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU Lesser General Public License as published by
* the Free Software Foundation; either version 2.1 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU Lesser General Public License for more details.
*
* You should have received a copy of the GNU Lesser General Public License
* along with this program; if not, see .
*/
/****************************************************************************/
/* */
/* MODULE_NAME:usncs.c */
/* */
/* FUNCTIONS: usin */
/* ucos */
/* slow */
/* slow1 */
/* slow2 */
/* sloww */
/* sloww1 */
/* sloww2 */
/* bsloww */
/* bsloww1 */
/* bsloww2 */
/* cslow2 */
/* FILES NEEDED: dla.h endian.h mpa.h mydefs.h usncs.h */
/* branred.c sincos32.c dosincos.c mpa.c */
/* sincos.tbl */
/* */
/* An ultimate sin and routine. Given an IEEE double machine number x */
/* it computes the correctly rounded (to nearest) value of sin(x) or cos(x) */
/* Assumption: Machine arithmetic operations are performed in */
/* round to nearest mode of IEEE 754 standard. */
/* */
/****************************************************************************/
#include
#include
#include "endian.h"
#include "mydefs.h"
#include "usncs.h"
#include "MathLib.h"
#include
#include
#include
/* Helper macros to compute sin of the input values. */
#define POLYNOMIAL2(xx) ((((s5 * (xx) + s4) * (xx) + s3) * (xx) + s2) * (xx))
#define POLYNOMIAL(xx) (POLYNOMIAL2 (xx) + s1)
/* The computed polynomial is a variation of the Taylor series expansion for
sin(a):
a - a^3/3! + a^5/5! - a^7/7! + a^9/9! + (1 - a^2) * da / 2
The constants s1, s2, s3, etc. are pre-computed values of 1/3!, 1/5! and so
on. The result is returned to LHS and correction in COR. */
#define TAYLOR_SIN(xx, a, da, cor) \
({ \
double t = ((POLYNOMIAL (xx) * (a) - 0.5 * (da)) * (xx) + (da)); \
double res = (a) + t; \
(cor) = ((a) - res) + t; \
res; \
})
/* This is again a variation of the Taylor series expansion with the term
x^3/3! expanded into the following for better accuracy:
bb * x ^ 3 + 3 * aa * x * x1 * x2 + aa * x1 ^ 3 + aa * x2 ^ 3
The correction term is dx and bb + aa = -1/3!
*/
#define TAYLOR_SLOW(x0, dx, cor) \
({ \
static const double th2_36 = 206158430208.0; /* 1.5*2**37 */ \
double xx = (x0) * (x0); \
double x1 = ((x0) + th2_36) - th2_36; \
double y = aa * x1 * x1 * x1; \
double r = (x0) + y; \
double x2 = ((x0) - x1) + (dx); \
double t = (((POLYNOMIAL2 (xx) + bb) * xx + 3.0 * aa * x1 * x2) \
* (x0) + aa * x2 * x2 * x2 + (dx)); \
t = (((x0) - r) + y) + t; \
double res = r + t; \
(cor) = (r - res) + t; \
res; \
})
#define SINCOS_TABLE_LOOKUP(u, sn, ssn, cs, ccs) \
({ \
int4 k = u.i[LOW_HALF] << 2; \
sn = __sincostab.x[k]; \
ssn = __sincostab.x[k + 1]; \
cs = __sincostab.x[k + 2]; \
ccs = __sincostab.x[k + 3]; \
})
#ifndef SECTION
# define SECTION
#endif
extern const union
{
int4 i[880];
double x[440];
} __sincostab attribute_hidden;
static const double
sn3 = -1.66666666666664880952546298448555E-01,
sn5 = 8.33333214285722277379541354343671E-03,
cs2 = 4.99999999999999999999950396842453E-01,
cs4 = -4.16666666666664434524222570944589E-02,
cs6 = 1.38888874007937613028114285595617E-03;
static const double t22 = 0x1.8p22;
void __dubsin (double x, double dx, double w[]);
void __docos (double x, double dx, double w[]);
double __mpsin (double x, double dx, bool reduce_range);
double __mpcos (double x, double dx, bool reduce_range);
static double slow (double x);
static double slow1 (double x);
static double slow2 (double x);
static double sloww (double x, double dx, double orig, bool shift_quadrant);
static double sloww1 (double x, double dx, double orig, bool shift_quadrant);
static double sloww2 (double x, double dx, double orig, int n);
static double bsloww (double x, double dx, double orig, int n);
static double bsloww1 (double x, double dx, double orig, int n);
static double bsloww2 (double x, double dx, double orig, int n);
int __branred (double x, double *a, double *aa);
static double cslow2 (double x);
/* Given a number partitioned into X and DX, this function computes the cosine
of the number by combining the sin and cos of X (as computed by a variation
of the Taylor series) with the values looked up from the sin/cos table to
get the result in RES and a correction value in COR. */
static inline double
__always_inline
do_cos (double x, double dx, double *corp)
{
mynumber u;
if (x < 0)
dx = -dx;
u.x = big + fabs (x);
x = fabs (x) - (u.x - big) + dx;
double xx, s, sn, ssn, c, cs, ccs, res, cor;
xx = x * x;
s = x + x * xx * (sn3 + xx * sn5);
c = xx * (cs2 + xx * (cs4 + xx * cs6));
SINCOS_TABLE_LOOKUP (u, sn, ssn, cs, ccs);
cor = (ccs - s * ssn - cs * c) - sn * s;
res = cs + cor;
cor = (cs - res) + cor;
*corp = cor;
return res;
}
/* A more precise variant of DO_COS. EPS is the adjustment to the correction
COR. */
static inline double
__always_inline
do_cos_slow (double x, double dx, double eps, double *corp)
{
mynumber u;
if (x <= 0)
dx = -dx;
u.x = big + fabs (x);
x = fabs (x) - (u.x - big);
double xx, y, x1, x2, e1, e2, res, cor;
double s, sn, ssn, c, cs, ccs;
xx = x * x;
s = x * xx * (sn3 + xx * sn5);
c = x * dx + xx * (cs2 + xx * (cs4 + xx * cs6));
SINCOS_TABLE_LOOKUP (u, sn, ssn, cs, ccs);
x1 = (x + t22) - t22;
x2 = (x - x1) + dx;
e1 = (sn + t22) - t22;
e2 = (sn - e1) + ssn;
cor = (ccs - cs * c - e1 * x2 - e2 * x) - sn * s;
y = cs - e1 * x1;
cor = cor + ((cs - y) - e1 * x1);
res = y + cor;
cor = (y - res) + cor;
cor = 1.0005 * cor + __copysign (eps, cor);
*corp = cor;
return res;
}
/* Given a number partitioned into X and DX, this function computes the sine of
the number by combining the sin and cos of X (as computed by a variation of
the Taylor series) with the values looked up from the sin/cos table to get
the result in RES and a correction value in COR. */
static inline double
__always_inline
do_sin (double x, double dx, double *corp)
{
mynumber u;
if (x <= 0)
dx = -dx;
u.x = big + fabs (x);
x = fabs (x) - (u.x - big);
double xx, s, sn, ssn, c, cs, ccs, cor, res;
xx = x * x;
s = x + (dx + x * xx * (sn3 + xx * sn5));
c = x * dx + xx * (cs2 + xx * (cs4 + xx * cs6));
SINCOS_TABLE_LOOKUP (u, sn, ssn, cs, ccs);
cor = (ssn + s * ccs - sn * c) + cs * s;
res = sn + cor;
cor = (sn - res) + cor;
*corp = cor;
return res;
}
/* A more precise variant of DO_SIN. EPS is the adjustment to the correction
COR. */
static inline double
__always_inline
do_sin_slow (double x, double dx, double eps, double *corp)
{
mynumber u;
if (x <= 0)
dx = -dx;
u.x = big + fabs (x);
x = fabs (x) - (u.x - big);
double xx, y, x1, x2, c1, c2, res, cor;
double s, sn, ssn, c, cs, ccs;
xx = x * x;
s = x * xx * (sn3 + xx * sn5);
c = xx * (cs2 + xx * (cs4 + xx * cs6));
SINCOS_TABLE_LOOKUP (u, sn, ssn, cs, ccs);
x1 = (x + t22) - t22;
x2 = (x - x1) + dx;
c1 = (cs + t22) - t22;
c2 = (cs - c1) + ccs;
cor = (ssn + s * ccs + cs * s + c2 * x + c1 * x2 - sn * x * dx) - sn * c;
y = sn + c1 * x1;
cor = cor + ((sn - y) + c1 * x1);
res = y + cor;
cor = (y - res) + cor;
cor = 1.0005 * cor + __copysign (eps, cor);
*corp = cor;
return res;
}
/* Reduce range of X and compute sin of a + da. When SHIFT_QUADRANT is true,
the routine returns the cosine of a + da by rotating the quadrant once and
computing the sine of the result. */
static inline double
__always_inline
reduce_and_compute (double x, bool shift_quadrant)
{
double retval = 0, a, da;
unsigned int n = __branred (x, &a, &da);
int4 k = (n + shift_quadrant) % 4;
switch (k)
{
case 2:
a = -a;
da = -da;
/* Fall through. */
case 0:
if (a * a < 0.01588)
retval = bsloww (a, da, x, n);
else
retval = bsloww1 (a, da, x, n);
break;
case 1:
case 3:
retval = bsloww2 (a, da, x, n);
break;
}
return retval;
}
static inline int4
__always_inline
reduce_sincos_1 (double x, double *a, double *da)
{
mynumber v;
double t = (x * hpinv + toint);
double xn = t - toint;
v.x = t;
double y = (x - xn * mp1) - xn * mp2;
int4 n = v.i[LOW_HALF] & 3;
double db = xn * mp3;
double b = y - db;
db = (y - b) - db;
*a = b;
*da = db;
return n;
}
/* Compute sin (A + DA). cos can be computed by passing SHIFT_QUADRANT as
true, which results in shifting the quadrant N clockwise. */
static double
__always_inline
do_sincos_1 (double a, double da, double x, int4 n, bool shift_quadrant)
{
double xx, retval, res, cor;
double eps = fabs (x) * 1.2e-30;
int k1 = (n + shift_quadrant) & 3;
switch (k1)
{ /* quarter of unit circle */
case 2:
a = -a;
da = -da;
/* Fall through. */
case 0:
xx = a * a;
if (xx < 0.01588)
{
/* Taylor series. */
res = TAYLOR_SIN (xx, a, da, cor);
cor = 1.02 * cor + __copysign (eps, cor);
retval = (res == res + cor) ? res : sloww (a, da, x, shift_quadrant);
}
else
{
res = do_sin (a, da, &cor);
cor = 1.035 * cor + __copysign (eps, cor);
retval = ((res == res + cor) ? __copysign (res, a)
: sloww1 (a, da, x, shift_quadrant));
}
break;
case 1:
case 3:
res = do_cos (a, da, &cor);
cor = 1.025 * cor + __copysign (eps, cor);
retval = ((res == res + cor) ? ((n & 2) ? -res : res)
: sloww2 (a, da, x, n));
break;
}
return retval;
}
static inline int4
__always_inline
reduce_sincos_2 (double x, double *a, double *da)
{
mynumber v;
double t = (x * hpinv + toint);
double xn = t - toint;
v.x = t;
double xn1 = (xn + 8.0e22) - 8.0e22;
double xn2 = xn - xn1;
double y = ((((x - xn1 * mp1) - xn1 * mp2) - xn2 * mp1) - xn2 * mp2);
int4 n = v.i[LOW_HALF] & 3;
double db = xn1 * pp3;
t = y - db;
db = (y - t) - db;
db = (db - xn2 * pp3) - xn * pp4;
double b = t + db;
db = (t - b) + db;
*a = b;
*da = db;
return n;
}
/* Compute sin (A + DA). cos can be computed by passing SHIFT_QUADRANT as
true, which results in shifting the quadrant N clockwise. */
static double
__always_inline
do_sincos_2 (double a, double da, double x, int4 n, bool shift_quadrant)
{
double res, retval, cor, xx;
double eps = 1.0e-24;
int4 k = (n + shift_quadrant) & 3;
switch (k)
{
case 2:
a = -a;
da = -da;
/* Fall through. */
case 0:
xx = a * a;
if (xx < 0.01588)
{
/* Taylor series. */
res = TAYLOR_SIN (xx, a, da, cor);
cor = 1.02 * cor + __copysign (eps, cor);
retval = (res == res + cor) ? res : bsloww (a, da, x, n);
}
else
{
res = do_sin (a, da, &cor);
cor = 1.035 * cor + __copysign (eps, cor);
retval = ((res == res + cor) ? __copysign (res, a)
: bsloww1 (a, da, x, n));
}
break;
case 1:
case 3:
res = do_cos (a, da, &cor);
cor = 1.025 * cor + __copysign (eps, cor);
retval = ((res == res + cor) ? ((n & 2) ? -res : res)
: bsloww2 (a, da, x, n));
break;
}
return retval;
}
/*******************************************************************/
/* An ultimate sin routine. Given an IEEE double machine number x */
/* it computes the correctly rounded (to nearest) value of sin(x) */
/*******************************************************************/
#ifdef IN_SINCOS
static double
#else
double
SECTION
#endif
__sin (double x)
{
double xx, res, t, cor;
mynumber u;
int4 k, m;
double retval = 0;
#ifndef IN_SINCOS
SET_RESTORE_ROUND_53BIT (FE_TONEAREST);
#endif
u.x = x;
m = u.i[HIGH_HALF];
k = 0x7fffffff & m; /* no sign */
if (k < 0x3e500000) /* if x->0 =>sin(x)=x */
{
math_check_force_underflow (x);
retval = x;
}
/*---------------------------- 2^-26 < |x|< 0.25 ----------------------*/
else if (k < 0x3fd00000)
{
xx = x * x;
/* Taylor series. */
t = POLYNOMIAL (xx) * (xx * x);
res = x + t;
cor = (x - res) + t;
retval = (res == res + 1.07 * cor) ? res : slow (x);
} /* else if (k < 0x3fd00000) */
/*---------------------------- 0.25<|x|< 0.855469---------------------- */
else if (k < 0x3feb6000)
{
res = do_sin (x, 0, &cor);
retval = (res == res + 1.096 * cor) ? res : slow1 (x);
retval = __copysign (retval, x);
} /* else if (k < 0x3feb6000) */
/*----------------------- 0.855469 <|x|<2.426265 ----------------------*/
else if (k < 0x400368fd)
{
t = hp0 - fabs (x);
res = do_cos (t, hp1, &cor);
retval = (res == res + 1.020 * cor) ? res : slow2 (x);
retval = __copysign (retval, x);
} /* else if (k < 0x400368fd) */
#ifndef IN_SINCOS
/*-------------------------- 2.426265<|x|< 105414350 ----------------------*/
else if (k < 0x419921FB)
{
double a, da;
int4 n = reduce_sincos_1 (x, &a, &da);
retval = do_sincos_1 (a, da, x, n, false);
} /* else if (k < 0x419921FB ) */
/*---------------------105414350 <|x|< 281474976710656 --------------------*/
else if (k < 0x42F00000)
{
double a, da;
int4 n = reduce_sincos_2 (x, &a, &da);
retval = do_sincos_2 (a, da, x, n, false);
} /* else if (k < 0x42F00000 ) */
/* -----------------281474976710656 <|x| <2^1024----------------------------*/
else if (k < 0x7ff00000)
retval = reduce_and_compute (x, false);
/*--------------------- |x| > 2^1024 ----------------------------------*/
else
{
if (k == 0x7ff00000 && u.i[LOW_HALF] == 0)
__set_errno (EDOM);
retval = x / x;
}
#endif
return retval;
}
/*******************************************************************/
/* An ultimate cos routine. Given an IEEE double machine number x */
/* it computes the correctly rounded (to nearest) value of cos(x) */
/*******************************************************************/
#ifdef IN_SINCOS
static double
#else
double
SECTION
#endif
__cos (double x)
{
double y, xx, res, cor, a, da;
mynumber u;
int4 k, m;
double retval = 0;
#ifndef IN_SINCOS
SET_RESTORE_ROUND_53BIT (FE_TONEAREST);
#endif
u.x = x;
m = u.i[HIGH_HALF];
k = 0x7fffffff & m;
/* |x|<2^-27 => cos(x)=1 */
if (k < 0x3e400000)
retval = 1.0;
else if (k < 0x3feb6000)
{ /* 2^-27 < |x| < 0.855469 */
res = do_cos (x, 0, &cor);
retval = (res == res + 1.020 * cor) ? res : cslow2 (x);
} /* else if (k < 0x3feb6000) */
else if (k < 0x400368fd)
{ /* 0.855469 <|x|<2.426265 */ ;
y = hp0 - fabs (x);
a = y + hp1;
da = (y - a) + hp1;
xx = a * a;
if (xx < 0.01588)
{
res = TAYLOR_SIN (xx, a, da, cor);
cor = 1.02 * cor + __copysign (1.0e-31, cor);
retval = (res == res + cor) ? res : sloww (a, da, x, true);
}
else
{
res = do_sin (a, da, &cor);
cor = 1.035 * cor + __copysign (1.0e-31, cor);
retval = ((res == res + cor) ? __copysign (res, a)
: sloww1 (a, da, x, true));
}
} /* else if (k < 0x400368fd) */
#ifndef IN_SINCOS
else if (k < 0x419921FB)
{ /* 2.426265<|x|< 105414350 */
double a, da;
int4 n = reduce_sincos_1 (x, &a, &da);
retval = do_sincos_1 (a, da, x, n, true);
} /* else if (k < 0x419921FB ) */
else if (k < 0x42F00000)
{
double a, da;
int4 n = reduce_sincos_2 (x, &a, &da);
retval = do_sincos_2 (a, da, x, n, true);
} /* else if (k < 0x42F00000 ) */
/* 281474976710656 <|x| <2^1024 */
else if (k < 0x7ff00000)
retval = reduce_and_compute (x, true);
else
{
if (k == 0x7ff00000 && u.i[LOW_HALF] == 0)
__set_errno (EDOM);
retval = x / x; /* |x| > 2^1024 */
}
#endif
return retval;
}
/************************************************************************/
/* Routine compute sin(x) for 2^-26 < |x|< 0.25 by Taylor with more */
/* precision and if still doesn't accurate enough by mpsin or dubsin */
/************************************************************************/
static inline double
__always_inline
slow (double x)
{
double res, cor, w[2];
res = TAYLOR_SLOW (x, 0, cor);
if (res == res + 1.0007 * cor)
return res;
__dubsin (fabs (x), 0, w);
if (w[0] == w[0] + 1.000000001 * w[1])
return __copysign (w[0], x);
return __copysign (__mpsin (fabs (x), 0, false), x);
}
/*******************************************************************************/
/* Routine compute sin(x) for 0.25<|x|< 0.855469 by __sincostab.tbl and Taylor */
/* and if result still doesn't accurate enough by mpsin or dubsin */
/*******************************************************************************/
static inline double
__always_inline
slow1 (double x)
{
double w[2], cor, res;
res = do_sin_slow (x, 0, 0, &cor);
if (res == res + cor)
return res;
__dubsin (fabs (x), 0, w);
if (w[0] == w[0] + 1.000000005 * w[1])
return w[0];
return __mpsin (fabs (x), 0, false);
}
/**************************************************************************/
/* Routine compute sin(x) for 0.855469 <|x|<2.426265 by __sincostab.tbl */
/* and if result still doesn't accurate enough by mpsin or dubsin */
/**************************************************************************/
static inline double
__always_inline
slow2 (double x)
{
double w[2], y, y1, y2, cor, res;
double t = hp0 - fabs (x);
res = do_cos_slow (t, hp1, 0, &cor);
if (res == res + cor)
return res;
y = fabs (x) - hp0;
y1 = y - hp1;
y2 = (y - y1) - hp1;
__docos (y1, y2, w);
if (w[0] == w[0] + 1.000000005 * w[1])
return w[0];
return __mpsin (fabs (x), 0, false);
}
/* Compute sin(x + dx) where X is small enough to use Taylor series around zero
and (x + dx) in the first or third quarter of the unit circle. ORIG is the
original value of X for computing error of the result. If the result is not
accurate enough, the routine calls mpsin or dubsin. SHIFT_QUADRANT rotates
the unit circle by 1 to compute the cosine instead of sine. */
static inline double
__always_inline
sloww (double x, double dx, double orig, bool shift_quadrant)
{
double y, t, res, cor, w[2], a, da, xn;
mynumber v;
int4 n;
res = TAYLOR_SLOW (x, dx, cor);
double eps = fabs (orig) * 3.1e-30;
cor = 1.0005 * cor + __copysign (eps, cor);
if (res == res + cor)
return res;
a = fabs (x);
da = (x > 0) ? dx : -dx;
__dubsin (a, da, w);
eps = fabs (orig) * 1.1e-30;
cor = 1.000000001 * w[1] + __copysign (eps, w[1]);
if (w[0] == w[0] + cor)
return __copysign (w[0], x);
t = (orig * hpinv + toint);
xn = t - toint;
v.x = t;
y = (orig - xn * mp1) - xn * mp2;
n = (v.i[LOW_HALF] + shift_quadrant) & 3;
da = xn * pp3;
t = y - da;
da = (y - t) - da;
y = xn * pp4;
a = t - y;
da = ((t - a) - y) + da;
if (n & 2)
{
a = -a;
da = -da;
}
x = fabs (a);
dx = (a > 0) ? da : -da;
__dubsin (x, dx, w);
eps = fabs (orig) * 1.1e-40;
cor = 1.000000001 * w[1] + __copysign (eps, w[1]);
if (w[0] == w[0] + cor)
return __copysign (w[0], a);
return shift_quadrant ? __mpcos (orig, 0, true) : __mpsin (orig, 0, true);
}
/* Compute sin(x + dx) where X is in the first or third quarter of the unit
circle. ORIG is the original value of X for computing error of the result.
If the result is not accurate enough, the routine calls mpsin or dubsin.
SHIFT_QUADRANT rotates the unit circle by 1 to compute the cosine instead of
sine. */
static inline double
__always_inline
sloww1 (double x, double dx, double orig, bool shift_quadrant)
{
double w[2], cor, res;
res = do_sin_slow (x, dx, 3.1e-30 * fabs (orig), &cor);
if (res == res + cor)
return __copysign (res, x);
dx = (x > 0 ? dx : -dx);
__dubsin (fabs (x), dx, w);
double eps = 1.1e-30 * fabs (orig);
cor = 1.000000005 * w[1] + __copysign (eps, w[1]);
if (w[0] == w[0] + cor)
return __copysign (w[0], x);
return shift_quadrant ? __mpcos (orig, 0, true) : __mpsin (orig, 0, true);
}
/***************************************************************************/
/* Routine compute sin(x+dx) (Double-Length number) where x in second or */
/* fourth quarter of unit circle.Routine receive also the original value */
/* and quarter(n= 1or 3)of x for computing error of result.And if result not*/
/* accurate enough routine calls mpsin1 or dubsin */
/***************************************************************************/
static inline double
__always_inline
sloww2 (double x, double dx, double orig, int n)
{
double w[2], cor, res;
res = do_cos_slow (x, dx, 3.1e-30 * fabs (orig), &cor);
if (res == res + cor)
return (n & 2) ? -res : res;
dx = x > 0 ? dx : -dx;
__docos (fabs (x), dx, w);
double eps = 1.1e-30 * fabs (orig);
cor = 1.000000005 * w[1] + __copysign (eps, w[1]);
if (w[0] == w[0] + cor)
return (n & 2) ? -w[0] : w[0];
return (n & 1) ? __mpsin (orig, 0, true) : __mpcos (orig, 0, true);
}
/***************************************************************************/
/* Routine compute sin(x+dx) or cos(x+dx) (Double-Length number) where x */
/* is small enough to use Taylor series around zero and (x+dx) */
/* in first or third quarter of unit circle.Routine receive also */
/* (right argument) the original value of x for computing error of */
/* result.And if result not accurate enough routine calls other routines */
/***************************************************************************/
static inline double
__always_inline
bsloww (double x, double dx, double orig, int n)
{
double res, cor, w[2], a, da;
res = TAYLOR_SLOW (x, dx, cor);
cor = 1.0005 * cor + __copysign (1.1e-24, cor);
if (res == res + cor)
return res;
a = fabs (x);
da = (x > 0) ? dx : -dx;
__dubsin (a, da, w);
cor = 1.000000001 * w[1] + __copysign (1.1e-24, w[1]);
if (w[0] == w[0] + cor)
return __copysign (w[0], x);
return (n & 1) ? __mpcos (orig, 0, true) : __mpsin (orig, 0, true);
}
/***************************************************************************/
/* Routine compute sin(x+dx) or cos(x+dx) (Double-Length number) where x */
/* in first or third quarter of unit circle.Routine receive also */
/* (right argument) the original value of x for computing error of result.*/
/* And if result not accurate enough routine calls other routines */
/***************************************************************************/
static inline double
__always_inline
bsloww1 (double x, double dx, double orig, int n)
{
double w[2], cor, res;
res = do_sin_slow (x, dx, 1.1e-24, &cor);
if (res == res + cor)
return (x > 0) ? res : -res;
dx = (x > 0) ? dx : -dx;
__dubsin (fabs (x), dx, w);
cor = 1.000000005 * w[1] + __copysign (1.1e-24, w[1]);
if (w[0] == w[0] + cor)
return __copysign (w[0], x);
return (n & 1) ? __mpcos (orig, 0, true) : __mpsin (orig, 0, true);
}
/***************************************************************************/
/* Routine compute sin(x+dx) or cos(x+dx) (Double-Length number) where x */
/* in second or fourth quarter of unit circle.Routine receive also the */
/* original value and quarter(n= 1or 3)of x for computing error of result. */
/* And if result not accurate enough routine calls other routines */
/***************************************************************************/
static inline double
__always_inline
bsloww2 (double x, double dx, double orig, int n)
{
double w[2], cor, res;
res = do_cos_slow (x, dx, 1.1e-24, &cor);
if (res == res + cor)
return (n & 2) ? -res : res;
dx = (x > 0) ? dx : -dx;
__docos (fabs (x), dx, w);
cor = 1.000000005 * w[1] + __copysign (1.1e-24, w[1]);
if (w[0] == w[0] + cor)
return (n & 2) ? -w[0] : w[0];
return (n & 1) ? __mpsin (orig, 0, true) : __mpcos (orig, 0, true);
}
/************************************************************************/
/* Routine compute cos(x) for 2^-27 < |x|< 0.25 by Taylor with more */
/* precision and if still doesn't accurate enough by mpcos or docos */
/************************************************************************/
static inline double
__always_inline
cslow2 (double x)
{
double w[2], cor, res;
res = do_cos_slow (x, 0, 0, &cor);
if (res == res + cor)
return res;
__docos (fabs (x), 0, w);
if (w[0] == w[0] + 1.000000005 * w[1])
return w[0];
return __mpcos (x, 0, false);
}
#ifndef __cos
weak_alias (__cos, cos)
# ifdef NO_LONG_DOUBLE
strong_alias (__cos, __cosl)
weak_alias (__cos, cosl)
# endif
#endif
#ifndef __sin
weak_alias (__sin, sin)
# ifdef NO_LONG_DOUBLE
strong_alias (__sin, __sinl)
weak_alias (__sin, sinl)
# endif
#endif
~~~
# Best Algorithm for Bit Reversal
( from MSB->LSB to LSB->MSB) in C
~~~
up vote
164
down vote
favorite
128
What is the best algorithm to achieve the following:
0010 0000 => 0000 0100
The conversion is from MSB->LSB to LSB->MSB. All bits must be reversed; that is, this is not endianness-swapping.
c algorithm bit-manipulation
shareedit
edited Apr 23 '09 at 10:48
asked Apr 14 '09 at 2:48
green_t
1,0133917
1
I think the appropriate name is a bitwise operation. – Lucas Apr 14 '09 at 2:52
1
I think you meant reversal, not rotation. – Juliano Apr 14 '09 at 2:53
Most ARM processors have a built-in operation for that. The ARM Cortex-M0 doesn't, and I found using a per-byte table to swap bits is the fastest approach. – starblue Jun 19 '14 at 21:09
Also see Sean Eron Anderson's Bit Twiddling Hacks. – jww Aug 9 '15 at 16:42
Please define "best" – Lee Taylor Dec 12 '15 at 23:38
add a comment
26 Answers
active oldest votes
up vote
373
down vote
accepted
NOTE: All algorithms below are in C, but should be portable to your language of choice (just don't look at me when they're not as fast :)
~~~
Options
Low Memory (32-bit int, 32-bit machine)(from here):
~~~
unsigned int
reverse(register unsigned int x)
{
x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1));
x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2));
x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4));
x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8));
return((x >> 16) | (x << 16));
}
From the famous Bit Twiddling Hacks page:
Fastest (lookup table):
static const unsigned char BitReverseTable256[] =
{
0x00, 0x80, 0x40, 0xC0, 0x20, 0xA0, 0x60, 0xE0, 0x10, 0x90, 0x50, 0xD0, 0x30, 0xB0, 0x70, 0xF0,
0x08, 0x88, 0x48, 0xC8, 0x28, 0xA8, 0x68, 0xE8, 0x18, 0x98, 0x58, 0xD8, 0x38, 0xB8, 0x78, 0xF8,
0x04, 0x84, 0x44, 0xC4, 0x24, 0xA4, 0x64, 0xE4, 0x14, 0x94, 0x54, 0xD4, 0x34, 0xB4, 0x74, 0xF4,
0x0C, 0x8C, 0x4C, 0xCC, 0x2C, 0xAC, 0x6C, 0xEC, 0x1C, 0x9C, 0x5C, 0xDC, 0x3C, 0xBC, 0x7C, 0xFC,
0x02, 0x82, 0x42, 0xC2, 0x22, 0xA2, 0x62, 0xE2, 0x12, 0x92, 0x52, 0xD2, 0x32, 0xB2, 0x72, 0xF2,
0x0A, 0x8A, 0x4A, 0xCA, 0x2A, 0xAA, 0x6A, 0xEA, 0x1A, 0x9A, 0x5A, 0xDA, 0x3A, 0xBA, 0x7A, 0xFA,
0x06, 0x86, 0x46, 0xC6, 0x26, 0xA6, 0x66, 0xE6, 0x16, 0x96, 0x56, 0xD6, 0x36, 0xB6, 0x76, 0xF6,
0x0E, 0x8E, 0x4E, 0xCE, 0x2E, 0xAE, 0x6E, 0xEE, 0x1E, 0x9E, 0x5E, 0xDE, 0x3E, 0xBE, 0x7E, 0xFE,
0x01, 0x81, 0x41, 0xC1, 0x21, 0xA1, 0x61, 0xE1, 0x11, 0x91, 0x51, 0xD1, 0x31, 0xB1, 0x71, 0xF1,
0x09, 0x89, 0x49, 0xC9, 0x29, 0xA9, 0x69, 0xE9, 0x19, 0x99, 0x59, 0xD9, 0x39, 0xB9, 0x79, 0xF9,
0x05, 0x85, 0x45, 0xC5, 0x25, 0xA5, 0x65, 0xE5, 0x15, 0x95, 0x55, 0xD5, 0x35, 0xB5, 0x75, 0xF5,
0x0D, 0x8D, 0x4D, 0xCD, 0x2D, 0xAD, 0x6D, 0xED, 0x1D, 0x9D, 0x5D, 0xDD, 0x3D, 0xBD, 0x7D, 0xFD,
0x03, 0x83, 0x43, 0xC3, 0x23, 0xA3, 0x63, 0xE3, 0x13, 0x93, 0x53, 0xD3, 0x33, 0xB3, 0x73, 0xF3,
0x0B, 0x8B, 0x4B, 0xCB, 0x2B, 0xAB, 0x6B, 0xEB, 0x1B, 0x9B, 0x5B, 0xDB, 0x3B, 0xBB, 0x7B, 0xFB,
0x07, 0x87, 0x47, 0xC7, 0x27, 0xA7, 0x67, 0xE7, 0x17, 0x97, 0x57, 0xD7, 0x37, 0xB7, 0x77, 0xF7,
0x0F, 0x8F, 0x4F, 0xCF, 0x2F, 0xAF, 0x6F, 0xEF, 0x1F, 0x9F, 0x5F, 0xDF, 0x3F, 0xBF, 0x7F, 0xFF
};
unsigned int v; // reverse 32-bit value, 8 bits at time
unsigned int c; // c will get v reversed
// Option 1:
c = (BitReverseTable256[v & 0xff] << 24) |
(BitReverseTable256[(v >> 8) & 0xff] << 16) |
(BitReverseTable256[(v >> 16) & 0xff] << 8) |
(BitReverseTable256[(v >> 24) & 0xff]);
// Option 2:
unsigned char * p = (unsigned char *) &v;
unsigned char * q = (unsigned char *) &c;
q[3] = BitReverseTable256[p[0]];
q[2] = BitReverseTable256[p[1]];
q[1] = BitReverseTable256[p[2]];
q[0] = BitReverseTable256[p[3]];
~~~
You can extend this idea to 64-bit ints, or trade off memory for speed (assuming your L1 Data Cache is large enough), and reverse 16-bits at a time with a 64K-entry lookup table.
Others
Simple
~~~
unsigned int v; // input bits to be reversed
unsigned int r = v; // r will be reversed bits of v; first get LSB of v
int s = sizeof(v) * CHAR_BIT - 1; // extra shift needed at end
for (v >>= 1; v; v >>= 1)
{
r <<= 1;
r |= v & 1;
s--;
}
r <<= s; // shift when v's highest bits are zero
Faster (32-bit processor)
unsigned char b = x;
b = ((b * 0x0802LU & 0x22110LU) | (b * 0x8020LU & 0x88440LU)) * 0x10101LU >> 16;
Faster (64-bit processor)
unsigned char b; // reverse this (8-bit) byte
b = (b * 0x0202020202ULL & 0x010884422010ULL) % 1023;
If you want to do this on a 32-bit int, just reverse the bits in each bytes, and reverse the order of the bytes. That is:
unsigned int toReverse;
unsigned int reversed;
unsigned char inByte0 = (toReverse & 0xFF);
unsigned char inByte1 = (toReverse & 0xFF00) >> 8;
unsigned char inByte2 = (toReverse & 0xFF0000) >> 16;
unsigned char inByte3 = (toReverse & 0xFF000000) >> 24;
reversed = (reverseBits(inByte0) << 24) | (reverseBits(inByte1) << 16) | (reverseBits(inByte2) << 8) | (reverseBits(inByte3);
~~~
Results
I benchmarked the two most promising solutions, the lookup table, and bitwise-AND (the first one). The test machine is a laptop w/ 4GB of DDR2-800 and a Core 2 Duo T7500 @ 2.4GHz, 4MB L2 Cache; YMMV. I used gcc 4.3.2 on 64-bit Linux. OpenMP (and the GCC bindings) were used for high-resolution timers.
reverse.c
~~~
#include
#include
#include
unsigned int
reverse(register unsigned int x)
{
x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1));
x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2));
x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4));
x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8));
return((x >> 16) | (x << 16));
}
int main()
{
unsigned int *ints = malloc(100000000*sizeof(unsigned int));
unsigned int *ints2 = malloc(100000000*sizeof(unsigned int));
for(unsigned int i = 0; i < 100000000; i++)
ints[i] = rand();
unsigned int *inptr = ints;
unsigned int *outptr = ints2;
unsigned int *endptr = ints + 100000000;
// Starting the time measurement
double start = omp_get_wtime();
// Computations to be measured
while(inptr != endptr)
{
(*outptr) = reverse(*inptr);
inptr++;
outptr++;
}
// Measuring the elapsed time
double end = omp_get_wtime();
// Time calculation (in seconds)
printf("Time: %f seconds\n", end-start);
free(ints);
free(ints2);
return 0;
}
~~~
reverse_lookup.c
~~~
#include
#include
#include
static const unsigned char BitReverseTable256[] =
{
0x00, 0x80, 0x40, 0xC0, 0x20, 0xA0, 0x60, 0xE0, 0x10, 0x90, 0x50, 0xD0, 0x30, 0xB0, 0x70, 0xF0,
0x08, 0x88, 0x48, 0xC8, 0x28, 0xA8, 0x68, 0xE8, 0x18, 0x98, 0x58, 0xD8, 0x38, 0xB8, 0x78, 0xF8,
0x04, 0x84, 0x44, 0xC4, 0x24, 0xA4, 0x64, 0xE4, 0x14, 0x94, 0x54, 0xD4, 0x34, 0xB4, 0x74, 0xF4,
0x0C, 0x8C, 0x4C, 0xCC, 0x2C, 0xAC, 0x6C, 0xEC, 0x1C, 0x9C, 0x5C, 0xDC, 0x3C, 0xBC, 0x7C, 0xFC,
0x02, 0x82, 0x42, 0xC2, 0x22, 0xA2, 0x62, 0xE2, 0x12, 0x92, 0x52, 0xD2, 0x32, 0xB2, 0x72, 0xF2,
0x0A, 0x8A, 0x4A, 0xCA, 0x2A, 0xAA, 0x6A, 0xEA, 0x1A, 0x9A, 0x5A, 0xDA, 0x3A, 0xBA, 0x7A, 0xFA,
0x06, 0x86, 0x46, 0xC6, 0x26, 0xA6, 0x66, 0xE6, 0x16, 0x96, 0x56, 0xD6, 0x36, 0xB6, 0x76, 0xF6,
0x0E, 0x8E, 0x4E, 0xCE, 0x2E, 0xAE, 0x6E, 0xEE, 0x1E, 0x9E, 0x5E, 0xDE, 0x3E, 0xBE, 0x7E, 0xFE,
0x01, 0x81, 0x41, 0xC1, 0x21, 0xA1, 0x61, 0xE1, 0x11, 0x91, 0x51, 0xD1, 0x31, 0xB1, 0x71, 0xF1,
0x09, 0x89, 0x49, 0xC9, 0x29, 0xA9, 0x69, 0xE9, 0x19, 0x99, 0x59, 0xD9, 0x39, 0xB9, 0x79, 0xF9,
0x05, 0x85, 0x45, 0xC5, 0x25, 0xA5, 0x65, 0xE5, 0x15, 0x95, 0x55, 0xD5, 0x35, 0xB5, 0x75, 0xF5,
0x0D, 0x8D, 0x4D, 0xCD, 0x2D, 0xAD, 0x6D, 0xED, 0x1D, 0x9D, 0x5D, 0xDD, 0x3D, 0xBD, 0x7D, 0xFD,
0x03, 0x83, 0x43, 0xC3, 0x23, 0xA3, 0x63, 0xE3, 0x13, 0x93, 0x53, 0xD3, 0x33, 0xB3, 0x73, 0xF3,
0x0B, 0x8B, 0x4B, 0xCB, 0x2B, 0xAB, 0x6B, 0xEB, 0x1B, 0x9B, 0x5B, 0xDB, 0x3B, 0xBB, 0x7B, 0xFB,
0x07, 0x87, 0x47, 0xC7, 0x27, 0xA7, 0x67, 0xE7, 0x17, 0x97, 0x57, 0xD7, 0x37, 0xB7, 0x77, 0xF7,
0x0F, 0x8F, 0x4F, 0xCF, 0x2F, 0xAF, 0x6F, 0xEF, 0x1F, 0x9F, 0x5F, 0xDF, 0x3F, 0xBF, 0x7F, 0xFF
};
int main()
{
unsigned int *ints = malloc(100000000*sizeof(unsigned int));
unsigned int *ints2 = malloc(100000000*sizeof(unsigned int));
for(unsigned int i = 0; i < 100000000; i++)
ints[i] = rand();
unsigned int *inptr = ints;
unsigned int *outptr = ints2;
unsigned int *endptr = ints + 100000000;
// Starting the time measurement
double start = omp_get_wtime();
// Computations to be measured
while(inptr != endptr)
{
unsigned int in = *inptr;
// Option 1:
//*outptr = (BitReverseTable256[in & 0xff] << 24) |
// (BitReverseTable256[(in >> 8) & 0xff] << 16) |
// (BitReverseTable256[(in >> 16) & 0xff] << 8) |
// (BitReverseTable256[(in >> 24) & 0xff]);
// Option 2:
unsigned char * p = (unsigned char *) &(*inptr);
unsigned char * q = (unsigned char *) &(*outptr);
q[3] = BitReverseTable256[p[0]];
q[2] = BitReverseTable256[p[1]];
q[1] = BitReverseTable256[p[2]];
q[0] = BitReverseTable256[p[3]];
inptr++;
outptr++;
}
// Measuring the elapsed time
double end = omp_get_wtime();
// Time calculation (in seconds)
printf("Time: %f seconds\n", end-start);
free(ints);
free(ints2);
return 0;
}
~~~
I tried both approaches at several different optimizations, ran 3 trials at each level, and each trial reversed 100 million random unsigned ints. For the lookup table option, I tried both schemes (options 1 and 2) given on the bitwise hacks page. Results are shown below.
Bitwise AND
~~~
mrj10@mjlap:~/code$ gcc -fopenmp -std=c99 -o reverse reverse.c
mrj10@mjlap:~/code$ ./reverse
Time: 2.000593 seconds
mrj10@mjlap:~/code$ ./reverse
Time: 1.938893 seconds
mrj10@mjlap:~/code$ ./reverse
Time: 1.936365 seconds
mrj10@mjlap:~/code$ gcc -fopenmp -std=c99 -O2 -o reverse reverse.c
mrj10@mjlap:~/code$ ./reverse
Time: 0.942709 seconds
mrj10@mjlap:~/code$ ./reverse
Time: 0.991104 seconds
mrj10@mjlap:~/code$ ./reverse
Time: 0.947203 seconds
mrj10@mjlap:~/code$ gcc -fopenmp -std=c99 -O3 -o reverse reverse.c
mrj10@mjlap:~/code$ ./reverse
Time: 0.922639 seconds
mrj10@mjlap:~/code$ ./reverse
Time: 0.892372 seconds
mrj10@mjlap:~/code$ ./reverse
Time: 0.891688 seconds
Lookup Table (option 1)
mrj10@mjlap:~/code$ gcc -fopenmp -std=c99 -o reverse_lookup reverse_lookup.c
mrj10@mjlap:~/code$ ./reverse_lookup
Time: 1.201127 seconds
mrj10@mjlap:~/code$ ./reverse_lookup
Time: 1.196129 seconds
mrj10@mjlap:~/code$ ./reverse_lookup
Time: 1.235972 seconds
mrj10@mjlap:~/code$ gcc -fopenmp -std=c99 -O2 -o reverse_lookup reverse_lookup.c
mrj10@mjlap:~/code$ ./reverse_lookup
Time: 0.633042 seconds
mrj10@mjlap:~/code$ ./reverse_lookup
Time: 0.655880 seconds
mrj10@mjlap:~/code$ ./reverse_lookup
Time: 0.633390 seconds
mrj10@mjlap:~/code$ gcc -fopenmp -std=c99 -O3 -o reverse_lookup reverse_lookup.c
mrj10@mjlap:~/code$ ./reverse_lookup
Time: 0.652322 seconds
mrj10@mjlap:~/code$ ./reverse_lookup
Time: 0.631739 seconds
mrj10@mjlap:~/code$ ./reverse_lookup
Time: 0.652431 seconds
Lookup Table (option 2)
mrj10@mjlap:~/code$ gcc -fopenmp -std=c99 -o reverse_lookup reverse_lookup.c
mrj10@mjlap:~/code$ ./reverse_lookup
Time: 1.671537 seconds
mrj10@mjlap:~/code$ ./reverse_lookup
Time: 1.688173 seconds
mrj10@mjlap:~/code$ ./reverse_lookup
Time: 1.664662 seconds
mrj10@mjlap:~/code$ gcc -fopenmp -std=c99 -O2 -o reverse_lookup reverse_lookup.c
mrj10@mjlap:~/code$ ./reverse_lookup
Time: 1.049851 seconds
mrj10@mjlap:~/code$ ./reverse_lookup
Time: 1.048403 seconds
mrj10@mjlap:~/code$ ./reverse_lookup
Time: 1.085086 seconds
mrj10@mjlap:~/code$ gcc -fopenmp -std=c99 -O3 -o reverse_lookup reverse_lookup.c
mrj10@mjlap:~/code$ ./reverse_lookup
Time: 1.082223 seconds
mrj10@mjlap:~/code$ ./reverse_lookup
Time: 1.053431 seconds
mrj10@mjlap:~/code$ ./reverse_lookup
Time: 1.081224 seconds
~~~
Conclusion
Use the lookup table, with option 1 (byte addressing is unsurprisingly slow) if you're concerned about performance. If you need to squeeze every last byte of memory out of your system (and you might, if you care about the performance of bit reversal), the optimized versions of the bitwise-AND approach aren't too shabby either.
Caveat
Yes, I know the benchmark code is a complete hack. Suggestions on how to improve it are more than welcome. Things I know about:
I don't have access to ICC. This may be faster (please respond in a comment if you can test this out).
A 64K lookup table may do well on some modern microarchitectures with large L1D.
-mtune=native didn't work for -O2/-O3 (ld blew up with some crazy symbol redefinition error), so I don't believe the generated code is tuned for my microarchitecture.
There may be a way to do this slightly faster with SSE. I have no idea how, but with fast replication, packed bitwise AND, and swizzling instructions, there's got to be something there.
I know only enough x86 assembly to be dangerous; here's the code GCC generated on -O3 for option 1, so somebody more knowledgable than myself can check it out:
32-bit
~~~
.L3:
movl (%r12,%rsi), %ecx
movzbl %cl, %eax
movzbl BitReverseTable256(%rax), %edx
movl %ecx, %eax
shrl $24, %eax
mov %eax, %eax
movzbl BitReverseTable256(%rax), %eax
sall $24, %edx
orl %eax, %edx
movzbl %ch, %eax
shrl $16, %ecx
movzbl BitReverseTable256(%rax), %eax
movzbl %cl, %ecx
sall $16, %eax
orl %eax, %edx
movzbl BitReverseTable256(%rcx), %eax
sall $8, %eax
orl %eax, %edx
movl %edx, (%r13,%rsi)
addq $4, %rsi
cmpq $400000000, %rsi
jne .L3
~~~
EDIT: I also tried using uint64_t's on my machine to see if there was any performance boost. Performance was about 10% faster than 32-bit, and was nearly identical whether you were just using 64-bit types to reverse bits on two 32-bit ints at a time, or whether you were actually reversing bits in half as many 64-bit values. The assembly code is shown below (for the former case, reversing bits for 2 32-bit ints at a time):
~~~
.L3:
movq (%r12,%rsi), %rdx
movq %rdx, %rax
shrq $24, %rax
andl $255, %eax
movzbl BitReverseTable256(%rax), %ecx
movzbq %dl,%rax
movzbl BitReverseTable256(%rax), %eax
salq $24, %rax
orq %rax, %rcx
movq %rdx, %rax
shrq $56, %rax
movzbl BitReverseTable256(%rax), %eax
salq $32, %rax
orq %rax, %rcx
movzbl %dh, %eax
shrq $16, %rdx
movzbl BitReverseTable256(%rax), %eax
salq $16, %rax
orq %rax, %rcx
movzbq %dl,%rax
shrq $16, %rdx
movzbl BitReverseTable256(%rax), %eax
salq $8, %rax
orq %rax, %rcx
movzbq %dl,%rax
shrq $8, %rdx
movzbl BitReverseTable256(%rax), %eax
salq $56, %rax
orq %rax, %rcx
movzbq %dl,%rax
shrq $8, %rdx
movzbl BitReverseTable256(%rax), %eax
andl $255, %edx
salq $48, %rax
orq %rax, %rcx
movzbl BitReverseTable256(%rdx), %eax
salq $40, %rax
orq %rax, %rcx
movq %rcx, (%r13,%rsi)
addq $8, %rsi
cmpq $400000000, %rsi
jne .L3
~~~
shareedit
edited Jun 4 '12 at 15:04
Shahbaz
27k552100
answered Apr 14 '09 at 3:11
Matt J
20.8k53452
6
It was an interesting exercise, if not all that fulfilling. If nothing else, I hope seeing the process is constructive to somebody else who may want to benchmark something more meritorious :) – Matt J Apr 14 '09 at 7:09
27
Good god! What a thorough answer! – Nathan Fellman Apr 14 '09 at 8:06
2
Why can't I mark an answer as favorite? :) This might be handy, great reference. – lacop Apr 14 '09 at 11:16
5
There needs to be a "best of" on stackoverflow for answers like this. – Jay Nov 1 '11 at 14:26
3
My...God! I think I've found...what may very well be...a TRUE speciman. I shall have to consult my documents, and do further research, but something tells me (God, help me), that this is by far the greatest, most thorough and useful answer Stack Overflow has yet. Even John Skeet would be both appalled and impressed! – zeboidlund Mar 5 '12 at 2:42
show 10 more comments
# Draw Circle
~~~
// C code
void drawcircle(int x0, int y0, int radius)
{
int x = radius;
int y = 0;
int err = 0;
while (x >= y)
{
putpixel(x0 + x, y0 + y); // angle :: 0 ~ 4/pi
putpixel(x0 + y, y0 + x);
putpixel(x0 - y, y0 + x);
putpixel(x0 - x, y0 + y);
putpixel(x0 - x, y0 - y);
putpixel(x0 - y, y0 - x);
putpixel(x0 + y, y0 - x);
putpixel(x0 + x, y0 - y);
if (err <= 0)
{
y += 1;
err += 2*y + 1;
}
if (err > 0)
{
x -= 1;
err -= 2*x + 1;
}
}
}
~~~
# Paint
paint closed area = recursion
塗りつぶし = 再帰
(x,y) から始めて、色が c1 色なら c2 色に塗りつぶす
~~~
procedure Paint(x,y,c1,c2: longint);
begin
if PixelColor(x,y)=C1 then
begin
SetPixelColor(x,y,C2);
Paint(x+1,y);
Paint(x,y+1);
Paint(x-1,y);
Paint(x,y-1);
end;
end:
~~~
# naming set theory on prime number
when we name set use,
~~~
set[2], set[3], set[5], set[7], set[11], ...
set[prime(1)], set[prime(2)], ...
~~~
then we can have arbitrary product of sets
and then compress this database with radix tree or something
if (n != noprime ; set[n]) it is product of sets and decomposing to prime number gets each sets.
is there any advantage than directly pointing DB?