Archive for February, 2012

The Veltkamp-Dekker route to extended precision

Posted in Computer, multi-precision calculations, Programming Languages, Software tools on February 13, 2012 by giorgiomcmlx

Particularly in case of embedded systems, memory usage and calculation speed really matter even nowadays. Thus, if it can be proven in advance that only some steps of a long algorithm need to be carried out on a higher level of accuracy than the rest of the code, a method relying on the so-called Veltkamp-Dekker split (VDS) algorithm for floating-point numbers may be used in these very steps.

The Veltkamp – Dekker split of floating-point numbers

The idea behind the VDS algorithm is to divide the mantissa figures of a floating-point number numb (nearly) equally into those of the two numbers numb_hi and numb_lo of the same kind without overlap. Let mfigs denote the mathematical maximum of figures of the mantissa (or more exact, the significand), and let base denote the base of the number system in use. Then, the steps of the VDS algorithm can be described in pseudo code as follows:

1. expo = mfigs - Floor( mfigs / 2 )
2. fact = base ^ expo + 1
3. help = FLPTmult( fact ; numb )
4. numb_hi = FLPTsubt( help ; FLPTsubt( help ; numb ) )
5. numb_lo = FLPTsubt( numb ; numb_hi ) )

In this code, "Floor" means rounding down to the nearest integer, "FLPTmult ( a ; b )" means calculating the product of a and b to exactly the same precision that both a and be have, and "FLPTsubt ( a ; b )" means subtracting b from a on exactly the level of precision that both a and be have. Mixed-precision calculations will be described later.
In case of IEEE754-SP(DP) numbers, base equals 2, and mfigs takes a value of 24(53) inclusive the hidden leading bit, thus yielding a value of expo of 12(27). The first of the two splitted numbers, i.e. numb_hi, carries the leading 12(26) mantissa bits of the original number numb, and the second one, i.e. numb_lo, carries its trailing 12(27) bits. In other words, any reasonable value of expo determines the minimum amount of trailing empty bits of numb_hi.
numb_hi always inherits its sign from numb. Because the sign of numb_lo may be negative, the exponent of numb_hi may be larger than that of numb is. Consequently, the number of significant bits is not preserved either.

The following examples use SP numbers in order to demonstrate important characteristics to the VDS algorithm. The decimal representation of these numbers is calculated utilising an arbitrary precision library in order to enable displaying all significant decimal figures.

numbSP := sum over n from -21 to 2 of 2^n
= +7.999999523162841796875

numbSP has 24 significant bits by construction. The VDS algorithm yields the following results when applied to this number (both SP hex and bin representations are in little endian order):

numbSP = 40 FF FF FF
= + 7.999999523162841796875

numbSP_hi = 41 00 00 00
= + 8.0

numbSP_lo = B5 00 00 00
= - 0.000000476837158203125

Contrary to the hex notation, the binary notation given here does not directly correspond to the byte values in memory but separates the sign bit from the exponent bits and shows the hidden leading mantissa bit as well. Double slashes were inserted between all the three parts of any SP number in order to increase readability. As can be easily seen, numbSP has 24 significant bits, whereas both numbSP_hi and numbSP_lo have only 1.

The number Pi now serves as an example in case the decimal is not exactly representable by a FP number:

numb := Pi ≈ +3.14159265358979323846…

numbSP = 40 49 0F DB
= + 3.1415927410125732421875

numbSP_hi = 40 49 10 00
= + 3.1416015625

numbSP_lo = B7 14 00 00
= - 0.0000088214874267578125

From comparing the bit pattern of numbSP to that of numbSP_hi, it can be seen that using 12 as value of expo removes exactly the 12 trailing bits of numbSP. A closer look reveals that these bits are not simply cleared but that the value of numbSP_hi is the value of numbSP rounded to that smaller level of binary accuracy. Like in the previous example, the sum of the significant bits in both numbSP_hi and numbSP_lo is 18 and thus much lower than 24, which is the number of significant bits of numbSP.

The main purpose of such a VDS split is to enable calculations on a higher level of accuracy than provided by the data format actually used. Using any of the numerical data formats provided by a programming language, all elementary mathematical operations take two numbers as input and yield one number as output. The trick harnessed in order to achieve a higher level of accuracy is to apply the VDS method to any input number before it is processed. Thus, all the elementary mathematical operations now take four numbers as input and yield a pair of two numbers as output. This additive pair of numbers exactly represents the algebraic result of the elementary operation. The high part of these two output numbers is the closest approximation to the exact result that is possible with only one FP number. The low part gives the correction that needs to be added to the high part in order to get the exact result.
"Exact" here means correct to a higher level of floating-point accuracy; of course, it is still impossible to exactly represent Pi at any level of FP accuracy, for example. The routines for adding and multiplying two numbers that were previously subject to the VDS method will be described in a following post.


Dekker’s paper is written in English and can be accessed online. You may want to click on the Union Jack to switch the CMS output to the English language.
This archive is one of Germany’s scientific treasuries! You may access many of the original works of Carl Friedrich Gauß, Leonhard Euler, David Hilbert, and so on. Of course, many of the documents will be written in German or even in Latin… In case of hand-written medieval papers, even the notation of formulae often is quite different from that used nowadays…

"Library for Double-Double and Quad-Double Arithmetic" by Y. Hida, X. S. Li, and D. H. Bailey, December 29, 2007.


How to specify constants accurately to DP precision?

Posted in Computer, Mathematics, Programming Languages, Software on February 7, 2012 by giorgiomcmlx

I like code that is highly portable even from one programming language to another. For widely used languages, there are FOSS parsers that manage the translation of the code’s syntax. However, when it comes to specifying the value of a constant to DP precision, a variety of problems may arise. For example, the software architects might have decided not to allow you to specify as many digits as are necessary for a unique DP number or you might not know the decimal separator used at execution time.

Imho, the most reliable way to exchange DP numbers across all kinds of borders is to store the DP number as a sequence of 16 hex symbols together with the endianness used for encoding. Obviously, this requires writing as well as testing routines for encoding and decoding, resp., in all programming languages of interest. Moreover, usually the code of high-level programming languages is mainly designed for readability by humans. Most programmers will realise that the sequence "3.14159265358979324" specifies Pi, but I doubt that many of them will notice "LEDP: 40 09 21 FB 54 44 2D 18" (where LE stands in for little endian) has exactly the same meaning.

When looking for an alternative that is highly portable and readable as well, I came across the method of rational approximation, which is valuable particularly in the regime 1E-3 to 1E+3. Usually, a product or ratio of two n-digit numbers has got 2·n significant digits. 17 digits are necessary to specify a unique DP number. Thus, it might be expected that it is possible to uniquely approximate a DP number by a ratio of two integer numbers with at maximum 9 digits, at least, if its magnitude is not too different from one. The following figure demonstrates how the ratios of two integers approximate Pi the closer the more significant digits are used.

rational approximation to Pi

The ratio of the two integers having the smallest amount of significant figures yet exactly reproducing the DP representation of Pi is found to be 245850922 / 78256779 . The figure also gives the notation of the ratios as continued fractions in standard (only positive coefficients) as well as extended (negative coefficients allowed, too) notation. The deviations that can be observed between the blue and red points, resp., in the regime beyond the DP accuracy limit just show the effect of this limit on calculations. Only the values obtained by using Maxima’s data type "bigfloat" are correct.

Concluding, pseudo-code statements like

Double Precision Constant Pi = CDBLE("245850922")/CDBLE("78256779");

are my favourites in the future. It will be interesting for me to observe how long faulty statements like this one

var pi = 3.14159265358979;

survive in documents found on the web. That statement does not yield Pi but a value corresponding to "LEDP: 40 09 21 FB 54 44 2D 11" that is seven ULPs smaller than Pi!