How to calculate 1-x·x in floating-point

Posted in Computer on December 13, 2012 by giorgiomcmlx

Some routines that calculate elliptic integrals show poor overall accuracy in case the elliptic modulus gets close to one, even if they have no pole there. It turned out that this effect is not caused by a numerical instability of the implemented core algorithm, but by a numerically unstable transformation used for calculating the complementary elliptic modulus right at the beginning of these routines. Let x denote the elliptic modulus here, then the complementary modulus y is given by y = sqrt( 1 – x · x ), wherein sqrt denotes the mathematical square-root function. Because IEEE recommendations guarantee an accurate sqrt function, the reason for the accuracy problems had to be found in how the mathematical expression 1 – x·x had been implemented in floating-point arithmetic.

In a first attempt, I tried to use the mathematically equivalent form of y1_DP := 1 – x·x that results from applying the third binomial theorem, which reads y2_DP := ( 1_DP – x_DP ) · ( 1_DP + x_DP ). It turned out that this version is much more accurate than the naive implementation in the regime x_DP ≥ 1/sqrt(2). Quite recently, I happened to come across a different implementation in an old Fortran library that had been designed in the 1980ies and seemed rather strange to me at a first glance, i.e., y3_DP := ( 2_DP * ( 1_DP – x_DP ) ) – ( ( 1_DP – x_DP ) * ( 1_DP – x_DP ) ). This implementation turned out to yield the smallest errors of all three implementations mentioned so far in case abs(x_DP) ≥ 0.5 . Moreover, it yields accurate results for more than 85% of the (tested) values and limits the rel. abs. error for all (tested) values to a maximum of 1 ULP.

This implementation is most easily derived from y2_DP: because for any 0.5 ≤ x_DP ≤ 1 the subtraction 1 – x_DP is accurate, the first multiplicand of y2_DP is always accurate for x_DP ≥ 0.5; so the task is to express 1 + x_DP as a function of 1 – x_DP. Obviously, we have 1 + x = 2 – ( 1 – x ), which yields y4_DP := ( 1 – x_DP ) *( 2 – ( 1 – x_DP ) ). Finally, y3_DP is obtained by applying the distributive law of algebra, which does not hold for floating-point arithmetic and can thus lead to different results (you might want to take the burden to read through Goldberg’s paper).

The following two graphs show the abs. rel. errors obtained for y1_DP, y2_DP, and y3_DP accumulated over samples of 1025 pseudo-random numbers x_DP each. If the accumulated error is found to be larger than 1025, the maximum error for a single x_DP is definitely larger than 1 ULP. On the other hand, if the individual maximum error is known to be limited by 1 ULP, hundred times one minus the value of the accumulated error divided by 1025 gives the percentage of accurate values per sample.

A follow-up post will address the question of how the usage of different floating-point units, e.g. x87 and SSE, influences these results.


Mathematical constants in program code

Posted in binary, Compiler, Computer, decimal, hexadecimal, Mathematics, multi-precision calculations, Number systems, Software, Software tools, xNumbers on October 2, 2012 by giorgiomcmlx

This text is published under the Creative Commons’ non-commercial share-alike (CC-NC-SA) license V3.0. It was created by GiorgioMCMLX at carolomeetsbarolo on in October 2012.

In source codes concerned with numerical subjects, usually the values of all frequently used expressions that do not change during execution time are precomputed externally and assigned to variables as a sequence of decimal figures. For example, in some fictious programming language that lacks an internal constant corresponding to the mathematical constant π, such a statement could look like

Declare Constant pi_on_3 as Double Precision = 1.04719755119659775E+00_8B

The mathematically exact value of π/3 = 1.047197551196597746154214461… comprises incountably many decimal figures. Thus, it was decided to round it to a precision of 18 decimal significant figures in the so-called scientific notation. The variable pi_on_3 is defined to be of the type double precision (DP) binary floating-point (FP) that uses 8 bytes = 64 bits for each value: 1 sign bit, 11 exponent bits, and 52 plus 1 implicit significand bits. This is the most widespread FP type used in numerical code nowadays. At a first glance, the declaration seems to be quite natural, but it neglects the very property of π that led to its invention: π is a so-called irrational number; moreover, it even belongs to the set of transcendental numbers among them. Consequently, π and all of its multiples definitely cannot be represented exactly by any of the floating-point (FP) variable types used in computers, because all their values are binary rational numbers that all correspond to decimal rational numbers. Nearly all mathematical constants belong at least to the group of irrational numbers. Otherwise, their invention would not have made much sense…

From the point of view of a compiler, 1.04719755119659775E+00 is not a number but a string consisting of decimal characters. The transformation of such numeric strings found in code into values of numerical variables is an example of what is called parsing. When a compiler meets a statement like that given above it will call its parser in order to determine a DP number that will substitute the value of the decimal string in the calculations. Parsing of decimals will almost always cause round-off errors (a special form of the more general quantisation errors), because the rational DP numbers form a sparse set of isolated points within the continuum of all the real numbers. As a consequence, all the incountably many real numbers comprised by a finite interval need to be parsed onto a single DP number.

Usually, it is tried to parse a numerical string to the nearest DP. Then, the magnitude of the quantisation error is somewhere between zero and half the distance between a pair of consecutive DP numbers. This full distance is called an ULP (unit in the last place). It is a relative measure: it doubles its absolute value at all the DP numbers that are exact powers of two. Each DP number DPval represents the interval of real numbers from DPval-0.5ULPs to DPval+0.5ULPs. Disregarding exact powers of two, the absolute distance between a pair of consecutive DP numbers is equal to the length of the interval that both DP numbers stand in for. Parsing a numerical string to the nearest DP thus
limits the unavoidable quantisation error to ±0.5ULPs. The absolute value of an ULP depends on the exponent of the DP number it is associated with: 1 ULP = 2^(LPo2-52), where LPo2 denotes the true leading power of two that is encoded in biased form in the exponent. LPo2 equals zero for all the DP numbers found in the interval [1;2); in this range, 1 ULP = 1/2^52 ≅ 2.22…E-16. This special absolute value is called machine epsilon.

Consequently, all exact binary DP numbers can be uniquely identified if at least 17 decimal significant figures (SFs) are provided. Unfortunately, this theorem seems to be widely misunderstood. In particular, it does not mean that any given numerical string consisting of 17 decimal SFs uniquely identifies a binary DP number. The limit of 17 decimal SFs only holds for parsing numerical strings the values of which correspond to exact DPs, for example when printing the results of DP calculations. In the other direction, for example when parsing numerical strings to DP numbers in order to specify the input of DP calculations, at maximum about 770 (seven-seven-zero, no typo!) decimal SFs might be needed. Thus, the fictious programmer, who might even have thought that specifying 18 decimal SFs is more than sufficient, might be terribly wrong.

For any parsing-to-nearest-DP process to succeed, it is necessary that the value of a given numerical string can be discriminated from the border between the quantisation intervals of a pair of consecutive DPs, which is called a "tie". Obviously, the rule "round/parse to nearest" has to be complemented by a tie-breaking instruction. Usually, this is "ties to even", and means that in case of a tie the DP number having an even binary representation is chosen, i.e., the trailing bit of which is zero. Ties formally involve just one additional bit as compared to the pair of DP numbers they separate; if the power of two that bit corresponds to is negative, this one extra bit exactly requires one extra decimal SF. The approximate relation between the total amount of decimal significant figures SF_dec of an exact DP and its leading power of two LPo2 has already been estimated my means of linear regression in a previous post:

SF_dec ≅ - 0,699*LPo2 + 52,5  ; LPo2 ≤ 52
SF_dec ≅ + 0,301*LPo2 + 0,496 ; LPo2 ≥ 52

For the most negative normalised value of LPo2, i.e. -1022, this formula predicts SF_dec ≅ 767. Consequently, at maximum about 770 decimal SFs assure that any numerical string having a value within the range of normalised DP numbers can definitely be parsed to the nearest DP number. Unfortunately, there is no way of predicting the amount of SF_dec that are really needed for a given decimal string without parsing it. The maximum amount will only be needed if the number is extremely close to a tie. On the other hand, if the number is close to an exact DP, 17 digits would do the job. So the scaling law that governs the amount of SF_dec actually needed for parsing-to-nearest-DP of a given decimal number needs to worked out.

The distance between a given decimal number and the nearest DP tie can be calculated harnessing multiple-precision software. All of the calculations presented here were done using the FOSS XL-Addin called xNumbers. A simple VBA-script was used to output the properties of the multi-precision results obtained when dividing π by all integers between 1 and 2E+7. In order to assure that the absolute value of the relative measure ULP remained constant, all results were scaled by a power of two so that all the final results are found in the range between one and two. This kind of scaling does not change the relative position of a decimal number with respect to the quantisation interval of the nearest DP, because these intervals just scale likewise! The scaled results were then ordered by decreasing distance to the nearest DP tie. In xNumbers, there is a built-in function "dgmat" that outputs the amount of matching digits of two xNs. Adding one to the result of "dgmat" thus yields a good estimate of the amount of significant decimal figures that would allow for correctly parsing each of the scaled results to the DP nearest to it. The following table summarises some of the results.

amount of decimal significant figures for selected multiples of pi

It can be seen that 19 decimal SFs enable parsing π/3 to the nearest DP, whereas for π/59425 23 SFs are needed. Of course, you should be aware that a "stupid" parser might use not all the figures and thus yield an incorrect result! In order to find the scaling law that relates the distance dist_tie between a decimal number and its nearest DP tie to the amount SF_dec of decimal SFs that needs to be specified for enabling parsing to nearest DP, the following figure depicts the values found in the last two columns of the preceding table (orange rhombuses). The normalised quantity dist_tie takes values between zero (exact tie) and one (exact DP); its values are computed by removing the sign from the non-normalised values as well as by dividing them by 0.5ULPs.

relation between distance to tie and amount of decimal significant figures for selected multiples of pi

The line drawn in blue colour corresponds to the finally deduced scaling law:

SF_dec ≅ 17 - log10(dist_tie)

Summarising, the amount of decimal significant figures to which a mathematical constant has to be specified in order to enable parsing-to-the-nearest-DP depends on the distance between this constant and its nearest DP tie. Unfortunately, this distance is not known until the constant has been parsed, so that this has to be done in the multi-precision environment in which the constant is calculated! This solution is mathematically correct but practically unsafe and stupid as well, because all the parsing work would have to be done at least twice: once when writing the code and once again during code execution. A simple improvement is obvious though: instead of using the probably very long numerical string corresponding to the mathematically exact value of a constant, simply the numerical string comprising the 17 decimal significant figures of the nearest DP is to be used.

This method can still be greatly improved by avoiding any parsing at run-time, which can be achieved by reading/writing bit patterns directly from/to the memory of a computer. All major programming languages provide means for this kind of manipulation. It usually converts a DP number into 16 hexadecimal figures of which each corresponds to 4 bits, and vice versa. Unfortunately, two different schemes co-exist that define the ordering of the 8 bytes of a DP number: little-endian and big-endian byte order. Within a byte, the bits always follow the big-endian scheme. Thus, most humans like to read these
hexadecimal numbers in big-endian notation, and that is why it is used here. Of course, the endianness of stored hexadecimal strings should be fixed (e.g., to big-endian), and the conversion routines should account for the endianness of the machine they are executed on. Concluding, the statement given at the beginning of this post should be replaced by something like this:

Declare pi_on_3 as Double Precision
pi_on_3 = HEX2DP("3FF0C152382D7366")

where HEX2DP calls the function that converts the 16 hexadecimal figures into a DP number. The code of this function depends on the programming language as well as on the hardware platform.

Catastrophic cancellation in action

Posted in Computer, Fortran, Maxima, Microsoft Excel, mpmath, multi-precision calculations, Programming Languages, Python, Software, Software tools, Spreadsheet, VBA, xNumbers on July 20, 2012 by giorgiomcmlx


In order to obtain the reference values needed for checking results of numerical software that use data types like double precision (DP), I like to harness three multiple-precision (MP) tools, i.e., xNumbers, Maxima, and PY-mpmath. If all the MP results agree to the last decimal digit, I take them for granted because it seems very unlikely to me that the mentioned tools are mutual clones of each other. Moreover, even effects caused by routines of an OS can be identified because mpmath as well as Maxima are supported on Linux and Windows as well. When saving the MP results to CSV-formatted files, all the MP numbers need to be preceded by a guard character in order to prevent the spreadsheet software, which will later be used to carry out the comparison, from stealing most of the decimal digits. Unfortunately, even parsers of high-level programming languages fail astonishingly often when converting numerical strings into the values of DP variables. Therefore, I usually save not only DP results, but their bytes in hex notation as well. According to my experience, these hex values provide the only reliable kind of how to represent data meant for exchange between any two different computer environments. It is particularly useful in case of floating-point numbers being much smaller than one, because in decimal notation a really huge amount of decimal figures may be necessary in order to uniquely denote a specific DP value.

The tangent function near π/2

The tangent function has a so-called pole () at π/2, i.e., with x approaching π/2 from below, tan(x) tends to infinity, and with x approaching π/2 from above, tan(x) tends to minus infinity. This behaviour contradicts the maximum-smoothness paradigm frequently used when designing an algorithm for a built-in function. Near π/2, correctly approximating the tangent function can only be achieved by using a model function that has a pole at π/2, too. Though it is not uncommon for built-in functions to use several different polynomial representations over the range they are defined on, the usage of model functions having poles is surprisingly rare.
For example, many people have realised that the tangent functions of MS-Excel&register;™ 2002 and VBA6 as well lack the change of modelling paradigm in their model functions. Consequently, over some interval around the mathematically exact value of π/2, they both yield very in-accurate values for tan(x). In the next figure, the effect is demonstrated for the 49 DP values bracketing the exact value of π/2.

The maximum error is found outside the range of this figure; it occurs at pio2_DP, by which the DP value closest to π/2 is denoted here, and which is larger than π/2 by about 1.22E-17. In case of XL2002, the maximum relative error is found to be about -4.1E-4, in case of VBA6, it is 3.3E-5. Obviously, different model functions are being used for the tangent function in XL2002 and VBA6, resp. The relative errors of the tangent function in XL2002 are about 12.4 times bigger than those in VBA6 and of opposite sign! This means XL2002 can hardly be used to try out an algorithm in a spreadsheet before encoding it in VBA6… interesting…

Some people obviously know of the trigonometric identity tan(x)=1/tan(π/2-x), and have tried to use it in order to obtain a more reliable value of tan(x) in case x is very close to π/2. Applying this trigonometric identity naively to a value of x near π/2 in DP even leads to relative errors that are at least about 5 orders of magnitude bigger than those found in case of the flawed built-in tangent functions are. Thus, defining w_DP = pio2_DP – x_DP and then using 1/tan(w_DP) as a substitute for tan(x) is a very bad idea, as can be seen from the next figure.

From comparisons to results of MP software, it is known that around zero the tangent functions in XL2002 and VBA6 as well yield identical values that moreover are correct to DP precision. This means there must be some odd effect that renders the difference pio2_DP – x meaningless in case x is close to pio2_DP. This effect is easily identified: it is called catastrophic cancellation. π is an irrational number that definitely cannot be represented exactly by the sparse set that all the rational DP values form within the dense set of real values. Mathematicians would say something like “The set comprising all the DP values is of measure zero within the set of real numbers”, by which they mean that the probability any given real number can be exactly represented by a DP value is almost always zero.
Imho, this points out one of FORTRAN’s really big crimes: calling a data type “real” that in fact can only represent rational numbers; moreover, it can represent even only a tiny subset of all the rational numbers. Nevertheless, the set of DP numbers has been cleverly designed in order to cover a range of numerical values that should be sufficient for almost all numerical efforts of humans. Unfortunately, this claim does not hold true for the precision of the DP numbers.

The DP value nearest to π/2, pio2_DP, is found to be 3F F9 21 FB 54 44 2D 18 in big endian DP hex notation, which corresponds to about +1.570796326794896558 in decimal notation. Theoretically, exactly identical problems should occur around pi_DP, which is represented by 40 09 21 FB 54 44 2D 18 or about +3.141592653589793116, resp., because not only the error will double from 6.12E-17 to 1.22E-16, but the distance between adjacent DP number will do that too, as 1 < π/2 < 2 and 2 < π < 4. The next figure provides the position of π in the three floating-point formats SP, DP, and QP. All the differences are given in ULPs, an abbreviation that means “unit in the last place” and thus is a relative measure. The corresponding absolute measures are given in the figure. There is one important detail: in SP, which was the dominant floating point format until about 1985, the value nearest to π was larger than the mathematical value, whereas in DP as well as in QP it is smaller than π. This caused much trouble when code was ported from SP to DP about thirty years ago… and fortunately will cause much less trouble when code is ported from DP to QP in the near future.

It is now clear why the cancellation that occurs when π/2-x is calculated as pio2_DP-x_DP is to be called catastrophic: the closer this difference gets to zero, the more prominent the consequences of the in-accuracy in pio2_DP will be. Fortunately, the cure is easy enough: simply provide all the bits lost in cancellation in order to keep the result accurate to the full precision of 53 bits. This can be achieved by defining the DP variable v_DP, which is meant to represent the DP values of π/2-x, as follows: v_DP = [(pio2_DP – x_DP) + pio2_DPcorr]. Therein, pio2_DPcorr contains the trailing 53 bits that need to be added to pio2_DP in order to yield an approximation to π/2 with a precision of 106 bits. Any bits that are lost due to cancellation in the difference (pio2_DP – x_DP) will be filled in from pio2_DPcorr, thus maintaining a precision of 53 bits under all circumstances. Consequently, 1/tan(v) does not show any systematic errors and should be used to replace tan(x) in the region where it is flawed, as can be seen from the next figures.

According to my experience, the tangent function in XL2010 is still flawed around π/2, but the one in VBA7 is not. Keep in mind that mathematically unnecessary brackets can really matter in Excel, i.e., in case x_DP is very close to pio2_DP, entering something corresponding to “=(pio2_DP – x_DP)” in a cell will produce exact results, whereas “= pio2_DP – x_DP” will yield a large interval filled with zeros around x_DP = pio2_DP.

In general, it is hard to predict whether a software implements a tangent function that is precise around π/2 or not. In order to enable you to check for that, here’s to you the two DP values bracketing π2 together with the correct DP values of the tangent function at these values:

x_lo_DP : 3F F9 21 FB 54 44 2D 18 // +1.570796326794896558
x_hi_DP : 3F F9 21 FB 54 44 2D 19 // +1.570796326794896780
tan(x_lo_DP)_DP : 43 4D 02 96 7C 31 CD B5 // +16331239353195370.0
tan(x_hi_DP)_DP : C3 36 17 A1 54 94 76 7A // -6218431163823738.0

In case you would like to implement your own tangent function using the helper variable v as defined above, you need to subtract your x from the value of x_lo_DP given above; the DP that needs to be added to that result in order to maintain the full precision of 53 bits is:

pio2corr : 3C 91 A6 26 33 14 5C 07 // +6.123233995736766036E-17

The level of precision of such pairs of DP values is often referred to by the term “double-double” (DD). In this context, replacing a single line of code in DP precision by its equivalent in DD made the algorithm work properly. As long as the IEEE QP floating-point data type is not widely available in numerical software, using libraries written in DD may solve many problems that are caused by catastrophic cancellation in 53 bits. Depending on whether the programming language used to implement such a library makes use of the high-precision internal registers of modern CPUs or not, some care must be taken in order to make them work properly. Imho, VBA does not use them, but modern versions of FORTRAN, C, and Python etc. do. In this case, the line v = (pio2_DP – x_DP) + pio2_DP_corr would need to be modified in order to assure that the difference (pio2_DP – x_DP) is rounded down to DP precision before it is used again in the addition part. Thus, it has to be pulled out of the CPU and written to memory, from where it is to be re-read then.

Obviously, the procedure of partial enhancement of precision as described in this post has a variety of applications. One of the most important examples of where it is necessary to be implemented is the so-called argument reduction that is frequently applied in all subjects that deal with periodic processes.

follow-up: using multiple versions of gcc on Windows

Posted in Compiler, Computer, Fortran, gcc, Programming Languages, Software on April 23, 2012 by giorgiomcmlx

I got some mails in which I was asked to describe the solution to the “liblto_plugin-0.dll not found” problem a bit more detailed. So let us assume that you installed (or plan to do that) all versions of gcc previously obtained from as subdirectories of the folder “X:\myCompilers” (myCompilers might be a path name!). Let us further assume, for example, that you decided to call these subdirectories “gcc461” and “gcc470” according to the version numbers of the gcc suites during the installations. At least, this convention will turn out to be a smart idea. After having firstly saved the environment variables to disk and secondly having erased all references to gcc from them, you might then want to create a DOS batch file called “go_gcc.bat”, for example, in the root directory of drive “X:”. Two lines of this batch file then set two environment variables to the values required by a specific gcc version obtained from

set PATH=X:\myCompilers\gcc%1%2%3\bin;E:\Projekte\COMPILER\gcc%1%2%3\libexec\gcc\i686-pc-mingw32\%1.%2.%3;%PATH%
set EQ_LIBRARY_PATH=X:\myCompilers\gcc%1%2%3\i686-pc-mingw32\lib

Of course, you have to use the saved information about the path names on your PC in order to correctly substitute all the symbolic paths names used in this post. Finally, if you want to use gcc and thus start cmd, you only need to enter “X:\go_gcc 4 6 3” in order to enable version 4.6.3 of the gcc suite, for example. If you plan to install multiple versions of gcc, it is even smarter to use “gcc_x.y.z” as folder names, as you then only need to use a single batch variable that holds the value of “x.y.z”:

set PATH=X:\myCompilers\gcc_%1\bin;E:\Projekte\COMPILER\gcc_%1\libexec\gcc\i686-pc-mingw32\%1;%PATH%
set EQ_LIBRARY_PATH=X:\myCompilers\gcc_%1\i686-pc-mingw32\lib

You would want to start this batch file typing in “X:\go_gcc 4.6.3”.

I hope things are obvious now.

strange error message by GCC on Windows platform

Posted in Compiler, Computer, Fortran, Programming Languages, Software on April 22, 2012 by giorgiomcmlx

After having installed a second version of the GCC suite on the MS-Windows platform of my PC, it turned out to be impossible to use the gfortran front end of the older one. I had downloaded both editions from the web page.

When trying to link a program using the older gcc/gfortran version, I always got an error message that looked like “fatal error: -fuse-linker-plugin, but liblto_plugin-0.dll not found”. Unfortunately, even a very extensive web search did not help. As I had not changed the position of the old files in the directory tree, it must have been the installation routine that created that mess. Indeed, the problem turned out to be caused by the installation routine adding the paths to two directories of the newer gcc version at the beginning of two environment variables, i.e., the user variable EQ_LIBRARY_PATH, and the system variable PATH. Because the name of the file is still the same in the newer version, the older gcc suite detects a wrong version of that file and raises the error message. Obviously, the error message should rather read “gcc found a wrong version of …”.

Thus, the cure to the problem is to edit these two environment variables manually. After having saved their proper content to text files by typing “set > set.txt” and “path > path.txt” on the command line, e.g., erase all information related to the gcc versions from both variables. In order to change PATH permanently, you need to have admin privileges. After that, just use a batch file on the command line to set the content of both variables according to the version of gcc that you are going to use in that very session of the command interpreter and all is fine. At least, that’s how it worked for me.

The Veltkamp-Dekker route to extended precision

Posted in Computer, multi-precision calculations, Programming Languages, Software tools on February 13, 2012 by giorgiomcmlx

Particularly in case of embedded systems, memory usage and calculation speed really matter even nowadays. Thus, if it can be proven in advance that only some steps of a long algorithm need to be carried out on a higher level of accuracy than the rest of the code, a method relying on the so-called Veltkamp-Dekker split (VDS) algorithm for floating-point numbers may be used in these very steps.

The Veltkamp – Dekker split of floating-point numbers

The idea behind the VDS algorithm is to divide the mantissa figures of a floating-point number numb (nearly) equally into those of the two numbers numb_hi and numb_lo of the same kind without overlap. Let mfigs denote the mathematical maximum of figures of the mantissa (or more exact, the significand), and let base denote the base of the number system in use. Then, the steps of the VDS algorithm can be described in pseudo code as follows:

1. expo = mfigs - Floor( mfigs / 2 )
2. fact = base ^ expo + 1
3. help = FLPTmult( fact ; numb )
4. numb_hi = FLPTsubt( help ; FLPTsubt( help ; numb ) )
5. numb_lo = FLPTsubt( numb ; numb_hi ) )

In this code, "Floor" means rounding down to the nearest integer, "FLPTmult ( a ; b )" means calculating the product of a and b to exactly the same precision that both a and be have, and "FLPTsubt ( a ; b )" means subtracting b from a on exactly the level of precision that both a and be have. Mixed-precision calculations will be described later.
In case of IEEE754-SP(DP) numbers, base equals 2, and mfigs takes a value of 24(53) inclusive the hidden leading bit, thus yielding a value of expo of 12(27). The first of the two splitted numbers, i.e. numb_hi, carries the leading 12(26) mantissa bits of the original number numb, and the second one, i.e. numb_lo, carries its trailing 12(27) bits. In other words, any reasonable value of expo determines the minimum amount of trailing empty bits of numb_hi.
numb_hi always inherits its sign from numb. Because the sign of numb_lo may be negative, the exponent of numb_hi may be larger than that of numb is. Consequently, the number of significant bits is not preserved either.

The following examples use SP numbers in order to demonstrate important characteristics to the VDS algorithm. The decimal representation of these numbers is calculated utilising an arbitrary precision library in order to enable displaying all significant decimal figures.

numbSP := sum over n from -21 to 2 of 2^n
= +7.999999523162841796875

numbSP has 24 significant bits by construction. The VDS algorithm yields the following results when applied to this number (both SP hex and bin representations are in little endian order):

numbSP = 40 FF FF FF
= + 7.999999523162841796875

numbSP_hi = 41 00 00 00
= + 8.0

numbSP_lo = B5 00 00 00
= - 0.000000476837158203125

Contrary to the hex notation, the binary notation given here does not directly correspond to the byte values in memory but separates the sign bit from the exponent bits and shows the hidden leading mantissa bit as well. Double slashes were inserted between all the three parts of any SP number in order to increase readability. As can be easily seen, numbSP has 24 significant bits, whereas both numbSP_hi and numbSP_lo have only 1.

The number Pi now serves as an example in case the decimal is not exactly representable by a FP number:

numb := Pi &approx; +3.14159265358979323846…

numbSP = 40 49 0F DB
= + 3.1415927410125732421875

numbSP_hi = 40 49 10 00
= + 3.1416015625

numbSP_lo = B7 14 00 00
= - 0.0000088214874267578125

From comparing the bit pattern of numbSP to that of numbSP_hi, it can be seen that using 12 as value of expo removes exactly the 12 trailing bits of numbSP. A closer look reveals that these bits are not simply cleared but that the value of numbSP_hi is the value of numbSP rounded to that smaller level of binary accuracy. Like in the previous example, the sum of the significant bits in both numbSP_hi and numbSP_lo is 18 and thus much lower than 24, which is the number of significant bits of numbSP.

The main purpose of such a VDS split is to enable calculations on a higher level of accuracy than provided by the data format actually used. Using any of the numerical data formats provided by a programming language, all elementary mathematical operations take two numbers as input and yield one number as output. The trick harnessed in order to achieve a higher level of accuracy is to apply the VDS method to any input number before it is processed. Thus, all the elementary mathematical operations now take four numbers as input and yield a pair of two numbers as output. This additive pair of numbers exactly represents the algebraic result of the elementary operation. The high part of these two output numbers is the closest approximation to the exact result that is possible with only one FP number. The low part gives the correction that needs to be added to the high part in order to get the exact result.
"Exact" here means correct to a higher level of floating-point accuracy; of course, it is still impossible to exactly represent Pi at any level of FP accuracy, for example. The routines for adding and multiplying two numbers that were previously subject to the VDS method will be described in a following post.


Dekker’s paper is written in English and can be accessed online. You may want to click on the Union Jack to switch the CMS output to the English language.
This archive is one of Germany’s scientific treasuries! You may access many of the original works of Carl Friedrich Gauß, Leonhard Euler, David Hilbert, and so on. Of course, many of the documents will be written in German or even in Latin… In case of hand-written medieval papers, even the notation of formulae often is quite different from that used nowadays…

"Library for Double-Double and Quad-Double Arithmetic" by Y. Hida, X. S. Li, and D. H. Bailey, December 29, 2007.

How to specify constants accurately to DP precision?

Posted in Computer, Mathematics, Programming Languages, Software on February 7, 2012 by giorgiomcmlx

I like code that is highly portable even from one programming language to another. For widely used languages, there are FOSS parsers that manage the translation of the code’s syntax. However, when it comes to specifying the value of a constant to DP precision, a variety of problems may arise. For example, the software architects might have decided not to allow you to specify as many digits as are necessary for a unique DP number or you might not know the decimal separator used at execution time.

Imho, the most reliable way to exchange DP numbers across all kinds of borders is to store the DP number as a sequence of 16 hex symbols together with the endianness used for encoding. Obviously, this requires writing as well as testing routines for encoding and decoding, resp., in all programming languages of interest. Moreover, usually the code of high-level programming languages is mainly designed for readability by humans. Most programmers will realise that the sequence "3.14159265358979324" specifies Pi, but I doubt that many of them will notice "LEDP: 40 09 21 FB 54 44 2D 18" (where LE stands in for little endian) has exactly the same meaning.

When looking for an alternative that is highly portable and readable as well, I came across the method of rational approximation, which is valuable particularly in the regime 1E-3 to 1E+3. Usually, a product or ratio of two n-digit numbers has got 2·n significant digits. 17 digits are necessary to specify a unique DP number. Thus, it might be expected that it is possible to uniquely approximate a DP number by a ratio of two integer numbers with at maximum 9 digits, at least, if its magnitude is not too different from one. The following figure demonstrates how the ratios of two integers approximate Pi the closer the more significant digits are used.

rational approximation to Pi

The ratio of the two integers having the smallest amount of significant figures yet exactly reproducing the DP representation of Pi is found to be 245850922 / 78256779 . The figure also gives the notation of the ratios as continued fractions in standard (only positive coefficients) as well as extended (negative coefficients allowed, too) notation. The deviations that can be observed between the blue and red points, resp., in the regime beyond the DP accuracy limit just show the effect of this limit on calculations. Only the values obtained by using Maxima’s data type "bigfloat" are correct.

Concluding, pseudo-code statements like

Double Precision Constant Pi = CDBLE("245850922")/CDBLE("78256779");

are my favourites in the future. It will be interesting for me to observe how long faulty statements like this one

var pi = 3.14159265358979;

survive in documents found on the web. That statement does not yield Pi but a value corresponding to "LEDP: 40 09 21 FB 54 44 2D 11" that is seven ULPs smaller than Pi!