Archive for the Mathematics Category

To SSE, or not to SSE, that is the question

Posted in binary, decimal, gnumeric, Maxima, Microsoft Excel, multi-precision calculations, Number processing, Number systems, Software tools, Spreadsheet, VBA, xNumbers on February 25, 2014 by giorgiomcmlx

Some time ago, I discovered that in a thirty year old floating-point (FP) library the value of the mathematical expression 1 – x2 was determined by calculating 2·(1-x)-(1-x)2 for 0.5 ≤ x ≤ 1. At a first glance, this method appeared rather peculiar to me, but then I realised that the Lemma of Sterbenz (see, e.g. page 45 of this slide collection on FP calculations) guarantees that 1-x is calculated in FP without any round-off error in the specified range. The multiplication by 2 does not introduce errors as well. Yet, there are still two sources of errors: firstly, in case (1-x) has more than 26 significant bits, (1-x)2 is not exactly representable in FP, and secondly, finally taking the difference might cause a round-off error as well.

In my tests though, this old formula outperformed all other versions on average, and so I shared it with developer friends. Surprisingly, some of them responded with a statement like "The results produced by this formula do not differ from those obtained using (1-x)·(1+x)." In the end, the reason for this paradox could be identified: all my tests were executed on SSE registers, calculations on which strictly adhere to the double-precision floating-point format, whereas the calculations not showing any improvement had harnessed the intermediate extended double-precision format of the x87 stack. Modern x87-compatible FPUs provide both of these two very different kinds of intermediate FP number storage.

The SSE have been invented in favour of fast parallel data processing, whereas the x87 double-extended stack provides both 11 additional mantissa bits and 4 additional exponent bits to intermediate calculations in order to try to maintain an overall accuracy of 53 bits and to reduce the probability of intermediate under-/over-flows as well. The decision whether a piece of software uses the x87 or the SSE instruction sets has to be made up at compile time, because the generated byte code is different. Contrary, both the rounding-mode and the exception-handling properties may be influenced at runtime by changing the so called floating-point control word (FPUCW). I could not find an article on Wikipedia, so please see e.g. this collection of or this introduction to the FPUCW world. I also have come across these imho nice summaries that are written from C-programmers points of view: The pitfalls of verifying floating-point computations as well as SciLab is not naive. Unfortunately, values of the FPUCW having an identical meaning differ across platforms. For example, a value of the MS-FPUCW of hex 8001F corresponds to a GNU-GCC-FPUCW value of hex 37F …  Fortunately, the ability to change the rounding mode during runtime enables programmers to harness interval arithmetic in order to compute reliable FP error estimates of their code for both the x87 and SSE instructions sets. The rounding mode can be chosen from the set UP, DOWN, CHOP, and TO NEAREST – TIES TO EVEN, the latter usually being the default setting. Only if the x87 instruction has been selected at compile time, the precision of the calculations may be changed during runtime as well.

The 11 additional significand bits of the x87 extDP format correspond to roughly 3 decimal figures, and thus are a valuable tool in order to minimise the effects of cancellation in intermediate calculations. The 4 additional exponent bits allow for an intermediate product of up to 16 huge (tiny) doubles without causing an over-(under-)flow. Yet, harnessing the extDP format bears one dirty little secret: when the result of an intermediate calculation is transferred from the x87 stack firstly into an extDP x87 register and finally into the DP main memory, the intermediate value is subject to a double-rounding procedure. In combination with the standard rounding mode "TO NEAREST – TIES TO EVEN" this can lead to a slightly wrong result in two situations: an intermediate calculation rounds up/down to an extDP value representing an exact DP tie that is again rounded up/down to a DP value. The following tables give an example:

decimal number    DP hex    comment
+18446744073709551616.0 43F0000000000000 exact DP Num1
+6143.0 40B7FF0000000000 exact DP Num2
+18446744073709557759.0 Num1 + Num2,
not representable in DP
+18446744073709555712.0 43F0000000000001 DP_lo: nearest DP
+18446744073709559808.0 43F0000000000002 DP_hi: next larger DP
-2047.0 error of DP_lo
+2049.0 error of DP_hi

When adding Num1 to Num2 on the x87 extDP stack, the hex representation of the final DP sum in memory will be 43F0000000000002 instead of the closest DP 43F0000000000001. This can be understood by analysing the binary representation of the numbers involved. In order to clearly show the rounding effects, some separators are used: a "V" indicates the border between any two DP half-bytes, each of which corresponds to a single character in the hex representation, an "X" shows where the DP bits end and the extDP bits begin, and finally a "Y" marks the end of the extDP bit sequence.

Num1 + Num2
same with separators
1st step: round (up) to extDP
2nd step: round DP tie (up) to even

In this example, the addition creates a bit sequence, the mantissa of which exceeds the 64 bits of the extDP format. Thus, in a first step, the result is rounded to 64 bits. These extDP bits happen to represent a so-called tie (a number exactly half-way between the two adjacent DP numbers) of the DP format. In this example, the tie-breaking rule TO EVEN finally causes another rounding up procedure, so that the value stored in memory does not represent the DP number closest to the mathematical result of the summation, but the next larger one. Yet, it is still one of the two DP numbers bracketing the true value.

Only two different bit sequences among the 2^13 possible ones lead to this kind of error:
leadingbitsIXOIIIIIIIIIIYItrailingbits ++
leadingbitsOXIOOOOOOOOOOYOtrailingbits —

so that the odds ratio of this error should be expected to be close to 1/2^12 = 1/4096 ≈ 2.44E-4 . Of course, if the bit sequence of an intermediate result does not exceed the 64 bits of the extDP format, there will be only a single rounding step yielding a correct result. When analysing the results of 5.75E5 renormalisation steps of double-double numbers in VBA7, the sum of which is determined to exceed the 64 bit extDP limit, and that were introduced here at the end of this post, I found 1480 errors and thus an odds ratio of about 2.57E-4. Due to the relatively large sample size, I guess the deviation from the theoretical value is caused by non-randomness of the bit sequences involved.

Double-rounding is unavoidable when using the extDP registers of the x87 FPU, but it only slightly increases the representation error of about 0.5 ULPs by at maximum 0.05&percent;. Moreover, this only happens with an odds ratio of about 1/4096. Thus, this disadvantage of rarely occurring false double rounding is usually more than counterbalanced by the gain of 11 bits of intermediate precision in every calculation. However, there is a special kind of multi-precision software that cannot be run on the x87 stack due to the double rounding, i.e. all the double-double as well as quad-double software that use chained DP variables in order to store multi-precision values (see, e.g. this implementation). The double rounding on the x87 stack can lead to violations of the basic assumption of this kind of software, i.e., that the leading DP value always represents the DP value closest to the multi-precision value stored in a chain of doubles. Therefore, arbitrary precision tools being determined to use the x87 stack, for example Xnumbers, need to harness completely different algorithms for internal number storage. The numerical trick using a double-double representation of π/2 introduced in the aforementioned post works reliably, because the firstly calculated intermediate difference does not involve more than 64 significant bits.

The FP instruction set actually used by third-party numerical software is often unknown and needs to be determined by using test cases. In order to avoid problems with the digits chopping feature of most spreadsheet software, examples need to be created that use exact DP values corresponding to less than 14 significant decimal digits. The previously given example can be reformulated as sum of +1.8446744E+19 ≡ 43EFFFFFFDDAD230 and +73709557759 ≡ 4231296E97FF0000. In order to check the DP result for correctness without any debugging capabilities like those provided by Xnumbers for MS XL, e.g., both DP numbers that were summed up before now need to be successively subtracted from the sum, the bigger one at first, and the smaller one afterwards. It is important to assure that all intermediate differences are rounded to DP, which can most easily be achieved by assigning them to different DP variables, or different cells in a spreadsheet. Contrary to the first addition, the two final subtractions do not introduce any FP error, because they are subject to the lemma of Sterbenz. A final result of -2047 corresponds to calculations that have not been executed on the x87, whereas a value of +2049 indicates the opposite. Here are my PY3- and CAS-Maxima scripts:

Python 3     CAS-Maxima
num1 = +1.8446744E+19 +1.8446744E+19;
num2 = +73709557759.0 +73709557759.0;
sum12 = (num1+num2) %o1+%o2;
sum12sub1 = (sum12-num1) %-%o1;
sum12sub1sub2=(sum12sub1-num2) %-%o2;

The following table summarises the results I found in some spreadsheet and scripting software running on MS WIN7x64prof. On Linux, all SW listed in the table that is available on that OS as well returns -2047.

spreadsheet SW   result        scripting SW   result
GNUmeric 1.12.9 +2049 PY2.7.3×64 -2047
GScalc 11.4.2 -2047 PY3.2.3×64 -2047
KSspread 9.1.0 +2049 R2.15.2×64 -2047
LOcalc -2047 R3.0.0×32 +2049
LT SYM 3.0.1 -2047 R3.0.0×64 -2047
MS XL14/VBA7 +2049 FreeMat 4.2 -2047
SM PM 2012 -2047 Maxima x32 +2049

As you can see from the table, the situation is completely messed up. It seems to me as if on the Win32 platform the x87 with its extDP format was used by default, whereas on the Win64 platform the SSE part of the FPU was involved. Of course, any software can override the default setting, and MS XL64 as well as VBA7 indeed do so. Generally speaking, specifying the FPUCW as well as the kind of FPU during compilation is a very good idea in case of numerical libraries that were developed and optimised for only one of the two scenarios. In case these two setting are omitted and the unknown defaults are used at runtime, small deviations are to be expected only in case the software was designed not to harness the extDP precision of the x87 stack. Otherwise, SSE lacking the additional 11 bits of precision will very likely cause things like polynomial approximations to special functions to fail at runtime. In particular, source code of libraries originally designed for the x87 must not be re-compiled for SSE without closely inspecting the results. Libraries that were built before the advent of the x87 can be a very valuable source of inspiration for changes in the code that are likely to become necessary.


Mathematical constants in program code

Posted in binary, Compiler, Computer, decimal, hexadecimal, Mathematics, multi-precision calculations, Number systems, Software, Software tools, xNumbers on October 2, 2012 by giorgiomcmlx

This text is published under the Creative Commons’ non-commercial share-alike (CC-NC-SA) license V3.0. It was created by GiorgioMCMLX at carolomeetsbarolo on in October 2012.

In source codes concerned with numerical subjects, usually the values of all frequently used expressions that do not change during execution time are precomputed externally and assigned to variables as a sequence of decimal figures. For example, in some fictious programming language that lacks an internal constant corresponding to the mathematical constant π, such a statement could look like

Declare Constant pi_on_3 as Double Precision = 1.04719755119659775E+00_8B

The mathematically exact value of π/3 = 1.047197551196597746154214461… comprises incountably many decimal figures. Thus, it was decided to round it to a precision of 18 decimal significant figures in the so-called scientific notation. The variable pi_on_3 is defined to be of the type double precision (DP) binary floating-point (FP) that uses 8 bytes = 64 bits for each value: 1 sign bit, 11 exponent bits, and 52 plus 1 implicit significand bits. This is the most widespread FP type used in numerical code nowadays. At a first glance, the declaration seems to be quite natural, but it neglects the very property of π that led to its invention: π is a so-called irrational number; moreover, it even belongs to the set of transcendental numbers among them. Consequently, π and all of its multiples definitely cannot be represented exactly by any of the floating-point (FP) variable types used in computers, because all their values are binary rational numbers that all correspond to decimal rational numbers. Nearly all mathematical constants belong at least to the group of irrational numbers. Otherwise, their invention would not have made much sense…

From the point of view of a compiler, 1.04719755119659775E+00 is not a number but a string consisting of decimal characters. The transformation of such numeric strings found in code into values of numerical variables is an example of what is called parsing. When a compiler meets a statement like that given above it will call its parser in order to determine a DP number that will substitute the value of the decimal string in the calculations. Parsing of decimals will almost always cause round-off errors (a special form of the more general quantisation errors), because the rational DP numbers form a sparse set of isolated points within the continuum of all the real numbers. As a consequence, all the incountably many real numbers comprised by a finite interval need to be parsed onto a single DP number.

Usually, it is tried to parse a numerical string to the nearest DP. Then, the magnitude of the quantisation error is somewhere between zero and half the distance between a pair of consecutive DP numbers. This full distance is called an ULP (unit in the last place). It is a relative measure: it doubles its absolute value at all the DP numbers that are exact powers of two. Each DP number DPval represents the interval of real numbers from DPval-0.5ULPs to DPval+0.5ULPs. Disregarding exact powers of two, the absolute distance between a pair of consecutive DP numbers is equal to the length of the interval that both DP numbers stand in for. Parsing a numerical string to the nearest DP thus
limits the unavoidable quantisation error to ±0.5ULPs. The absolute value of an ULP depends on the exponent of the DP number it is associated with: 1 ULP = 2^(LPo2-52), where LPo2 denotes the true leading power of two that is encoded in biased form in the exponent. LPo2 equals zero for all the DP numbers found in the interval [1;2); in this range, 1 ULP = 1/2^52 ≅ 2.22…E-16. This special absolute value is called machine epsilon.

Consequently, all exact binary DP numbers can be uniquely identified if at least 17 decimal significant figures (SFs) are provided. Unfortunately, this theorem seems to be widely misunderstood. In particular, it does not mean that any given numerical string consisting of 17 decimal SFs uniquely identifies a binary DP number. The limit of 17 decimal SFs only holds for parsing numerical strings the values of which correspond to exact DPs, for example when printing the results of DP calculations. In the other direction, for example when parsing numerical strings to DP numbers in order to specify the input of DP calculations, at maximum about 770 (seven-seven-zero, no typo!) decimal SFs might be needed. Thus, the fictious programmer, who might even have thought that specifying 18 decimal SFs is more than sufficient, might be terribly wrong.

For any parsing-to-nearest-DP process to succeed, it is necessary that the value of a given numerical string can be discriminated from the border between the quantisation intervals of a pair of consecutive DPs, which is called a "tie". Obviously, the rule "round/parse to nearest" has to be complemented by a tie-breaking instruction. Usually, this is "ties to even", and means that in case of a tie the DP number having an even binary representation is chosen, i.e., the trailing bit of which is zero. Ties formally involve just one additional bit as compared to the pair of DP numbers they separate; if the power of two that bit corresponds to is negative, this one extra bit exactly requires one extra decimal SF. The approximate relation between the total amount of decimal significant figures SF_dec of an exact DP and its leading power of two LPo2 has already been estimated my means of linear regression in a previous post:

SF_dec ≅ - 0,699*LPo2 + 52,5  ; LPo2 ≤ 52
SF_dec ≅ + 0,301*LPo2 + 0,496 ; LPo2 ≥ 52

For the most negative normalised value of LPo2, i.e. -1022, this formula predicts SF_dec ≅ 767. Consequently, at maximum about 770 decimal SFs assure that any numerical string having a value within the range of normalised DP numbers can definitely be parsed to the nearest DP number. Unfortunately, there is no way of predicting the amount of SF_dec that are really needed for a given decimal string without parsing it. The maximum amount will only be needed if the number is extremely close to a tie. On the other hand, if the number is close to an exact DP, 17 digits would do the job. So the scaling law that governs the amount of SF_dec actually needed for parsing-to-nearest-DP of a given decimal number needs to worked out.

The distance between a given decimal number and the nearest DP tie can be calculated harnessing multiple-precision software. All of the calculations presented here were done using the FOSS XL-Addin called xNumbers. A simple VBA-script was used to output the properties of the multi-precision results obtained when dividing π by all integers between 1 and 2E+7. In order to assure that the absolute value of the relative measure ULP remained constant, all results were scaled by a power of two so that all the final results are found in the range between one and two. This kind of scaling does not change the relative position of a decimal number with respect to the quantisation interval of the nearest DP, because these intervals just scale likewise! The scaled results were then ordered by decreasing distance to the nearest DP tie. In xNumbers, there is a built-in function "dgmat" that outputs the amount of matching digits of two xNs. Adding one to the result of "dgmat" thus yields a good estimate of the amount of significant decimal figures that would allow for correctly parsing each of the scaled results to the DP nearest to it. The following table summarises some of the results.

amount of decimal significant figures for selected multiples of pi

It can be seen that 19 decimal SFs enable parsing π/3 to the nearest DP, whereas for π/59425 23 SFs are needed. Of course, you should be aware that a "stupid" parser might use not all the figures and thus yield an incorrect result! In order to find the scaling law that relates the distance dist_tie between a decimal number and its nearest DP tie to the amount SF_dec of decimal SFs that needs to be specified for enabling parsing to nearest DP, the following figure depicts the values found in the last two columns of the preceding table (orange rhombuses). The normalised quantity dist_tie takes values between zero (exact tie) and one (exact DP); its values are computed by removing the sign from the non-normalised values as well as by dividing them by 0.5ULPs.

relation between distance to tie and amount of decimal significant figures for selected multiples of pi

The line drawn in blue colour corresponds to the finally deduced scaling law:

SF_dec ≅ 17 - log10(dist_tie)

Summarising, the amount of decimal significant figures to which a mathematical constant has to be specified in order to enable parsing-to-the-nearest-DP depends on the distance between this constant and its nearest DP tie. Unfortunately, this distance is not known until the constant has been parsed, so that this has to be done in the multi-precision environment in which the constant is calculated! This solution is mathematically correct but practically unsafe and stupid as well, because all the parsing work would have to be done at least twice: once when writing the code and once again during code execution. A simple improvement is obvious though: instead of using the probably very long numerical string corresponding to the mathematically exact value of a constant, simply the numerical string comprising the 17 decimal significant figures of the nearest DP is to be used.

This method can still be greatly improved by avoiding any parsing at run-time, which can be achieved by reading/writing bit patterns directly from/to the memory of a computer. All major programming languages provide means for this kind of manipulation. It usually converts a DP number into 16 hexadecimal figures of which each corresponds to 4 bits, and vice versa. Unfortunately, two different schemes co-exist that define the ordering of the 8 bytes of a DP number: little-endian and big-endian byte order. Within a byte, the bits always follow the big-endian scheme. Thus, most humans like to read these
hexadecimal numbers in big-endian notation, and that is why it is used here. Of course, the endianness of stored hexadecimal strings should be fixed (e.g., to big-endian), and the conversion routines should account for the endianness of the machine they are executed on. Concluding, the statement given at the beginning of this post should be replaced by something like this:

Declare pi_on_3 as Double Precision
pi_on_3 = HEX2DP("3FF0C152382D7366")

where HEX2DP calls the function that converts the 16 hexadecimal figures into a DP number. The code of this function depends on the programming language as well as on the hardware platform.

How to specify constants accurately to DP precision?

Posted in Computer, Mathematics, Programming Languages, Software on February 7, 2012 by giorgiomcmlx

I like code that is highly portable even from one programming language to another. For widely used languages, there are FOSS parsers that manage the translation of the code’s syntax. However, when it comes to specifying the value of a constant to DP precision, a variety of problems may arise. For example, the software architects might have decided not to allow you to specify as many digits as are necessary for a unique DP number or you might not know the decimal separator used at execution time.

Imho, the most reliable way to exchange DP numbers across all kinds of borders is to store the DP number as a sequence of 16 hex symbols together with the endianness used for encoding. Obviously, this requires writing as well as testing routines for encoding and decoding, resp., in all programming languages of interest. Moreover, usually the code of high-level programming languages is mainly designed for readability by humans. Most programmers will realise that the sequence "3.14159265358979324" specifies Pi, but I doubt that many of them will notice "LEDP: 40 09 21 FB 54 44 2D 18" (where LE stands in for little endian) has exactly the same meaning.

When looking for an alternative that is highly portable and readable as well, I came across the method of rational approximation, which is valuable particularly in the regime 1E-3 to 1E+3. Usually, a product or ratio of two n-digit numbers has got 2·n significant digits. 17 digits are necessary to specify a unique DP number. Thus, it might be expected that it is possible to uniquely approximate a DP number by a ratio of two integer numbers with at maximum 9 digits, at least, if its magnitude is not too different from one. The following figure demonstrates how the ratios of two integers approximate Pi the closer the more significant digits are used.

rational approximation to Pi

The ratio of the two integers having the smallest amount of significant figures yet exactly reproducing the DP representation of Pi is found to be 245850922 / 78256779 . The figure also gives the notation of the ratios as continued fractions in standard (only positive coefficients) as well as extended (negative coefficients allowed, too) notation. The deviations that can be observed between the blue and red points, resp., in the regime beyond the DP accuracy limit just show the effect of this limit on calculations. Only the values obtained by using Maxima’s data type "bigfloat" are correct.

Concluding, pseudo-code statements like

Double Precision Constant Pi = CDBLE("245850922")/CDBLE("78256779");

are my favourites in the future. It will be interesting for me to observe how long faulty statements like this one

var pi = 3.14159265358979;

survive in documents found on the web. That statement does not yield Pi but a value corresponding to "LEDP: 40 09 21 FB 54 44 2D 11" that is seven ULPs smaller than Pi!

Decimal significant figures of IEEE754 DP numbers

Posted in ArPrec, binary, Computer, decimal, Fortran, gnumeric, Mathematics, Microsoft Excel, multi-precision calculations, Number processing, Number systems, Programming Languages, Python, Software, Software tools, Spreadsheet, VBA, xNumbers on January 27, 2012 by giorgiomcmlx

The amount of decimal figures used to show a binary DP number in decimal floating-point format varies among the spreadsheet programs as well as programming languages I use. For example, the following table shows the default amount of decimal figures used when displaying the square root of two in decimal notation. The abbreviations are explained at the end of this post.

sqrt(2): square root of two ( programming languge || spreadsheet software )

30: 1.41421356237309504880168872421b0 (MXM fpprec:30;sqrt(2b0);)
18: 1.41421356237309515 (F77RM for DOS as of 1992)
17: 1.4142135623730951 (C, F77OW, F99/95G, JS, PY || GNM )
16: 1.414213562373095 (MXM ev(sqrt(2),numer);)
15: 1.41421356237310 ( || MSXL, LOC)
12: 1.41421356237 (F95S)

Obviously, there is a contradiction between the numbers shown in the first and second lines, resp. The reason for that is Maxima not using the IEEE 754 DP number format. Thus, the first line shows the mathematically correct value of sqrt(2) rounded to 30 decimal significant figures (SFs), whereas all the other lines show rounded values of the very decimal number that represents the DP approximation to sqrt(2). Both the decimal and binary fractions corresponding to the exact value of sqrt(2) have infinitely many SFs, because it is an irrational number. According to the IEEE754 specification, the binary DP representation of sqrt(2) comprises at maximum 53 binary SFs. Thus, it is a rational number in the binary system. All DP numbers correspond to rational decimal numbers. This is easily seen when writing the sum of at maximum 53 different powers of two (Po2) as a product of a factor times the Po2 corresponding to the 53rd bit. If that Po2 is non-negative, a product of two integers is obtained; if it is negative, the result is a fraction of two integers. Either form describes a rational decimal number. The remaining question to answer is how many decimal SFs this number has.
If all the Po2 values of a DP number are non-negative, i.e. the leading hidden bit corresponds to a value of Po2 ≥ 52, we might expect a scaling with a factor of log10(2) = 0.30103, because that is the position of the leading decimal figure in decimal columnar number notation. The trailing figure is always found on position 1, because its value cycles through the sequence 1&rarrow;cycle(2&rarrow;4&rarrow;8&rarrow;6&rarrow;). If all the Po2 values are negative, i.e. the leading hidden bit corresponds to a Po2 < 0, the position of the leading decimal figure scales like -log10(2), and the position of the trailing decimal figure like -1, so that the number of SFs in this case can be expected to scale like -1 – (-0.30103) = -0.69897. The following figures show the practically determined amount of decimal SFs of DP numbers for two different values of the amount of binary SFs, i.e. 1 and 53, resp.

decimal significant figures of DP numbers; overview
decimal significant figures of DP numbers; details near leading power of two equals zero

A simple linear regression yields the following relations between the leading Po2 (LPo2) of a DP number and the amount of its decimal SFs:

pure power of two, only hidden bit set
SF_01Po2(LPo2≥0) = + 0.301023(31) ⋅ LPo2 + 0.504(18)
SF_01Po2(LPo2≤0) = - 0.698923(31) ⋅ LPo2 + 0.516(19)

full 53-bit resolution
SF_53Po2(LPo2≥52) = + 0.301034(33) ⋅ LPo2 + 0.496(20)
SF_53Po2(LPo2≤52) = - 0.698963(28) ⋅ LPo2 + 52.504(16)

Of course, the fractional values resulting from these calculations have to be rounded to an integer value and two has to be added for safety (if I will find the time I will plot the confidence intervals of these regression lines…). The values of scaling factors found in case of 53-bit resolution agree closely with our expectations. The smallest normalised DP numbers (corresponding to 53 binary SFs) can be correctly displayed in the decimal system only if the really huge number of 767 (yes, seven hundred and sixty seven) decimal places are used! If fixed point notation is to be used, the amount of decimal places even increases up to a value of 1022 + 52 + 2 + 1 = 1077 columns inclusive sign noted in the first and leading zero and decimal separator noted in the second and third columns, resp.

The graph giving the details of the relations near LPo2 = 0 shows that the minimum amount of decimal SFs needed to exactly display a 53-bit DP number is 16. Thus, displaying at maximum only 15 decimal figures like many popular spreadsheet software do prevents any 53-bit DP number from being displayed correctly in the decimal system! On the other hand, because the deviation between an arbitrary decimal fraction and the value of its DP approximation on average starts on the 16-th decimal SF, it does not make much sense to use more than 18 decimal SFs when displaying the final result of calculations performed with DP numbers.
However, if some intermediate results seem strange and are to be further analysed, extended-precision frameworks like xNumbers or ArPrec are mandatory in order to deal with the huge amounts of digits that are likely to occur. Alternatively, you may decide to sit down together with your children under the Christmas tree and write add-with-carry as well as subtract-with-borrow codes that can handle up to 1500 decimal figures. If you manage to obey the KISS principle during coding, your work will be highly portable to any programming language, but that’s another story.

List of abbreviations with download links, if applicable.
CAS: computer algebra system

FOSS: free and open source software

MXM: Maxima, a FOSS CAS

C: name of a programming language

gcc: gnu C compiler

F: abbreviation for FORTRAN, the name of a programming language

F77OW: OpenWatcam Fortran 77

F77RM: Ryan-McFarland Fortran 77 for DOS as of 1987, malware????legal????

F99/95G: the Fortran 90/95/99/… front end for the GNU C compiler

F95S: Salford® FTN95 compiler

F99I: Intel® Fortran compiler for Linux

JS: JavaScript(MS: JScript), name of a scripting language targeting at web browsers

PY: Python, name of a programming language

GNM: gnumeric spreadsheet software

LOC: LibreOffice Calc spreadsheet software

MSXL: Microsoft® Excel®

Update of the DBL – HEX VBA library

Posted in binary, Computer, Fortran, hexadecimal, Microsoft Excel, Number systems, Programming Languages, Software, Software tools, Spreadsheet, VBA, xNumbers on October 23, 2011 by giorgiomcmlx

I’ve just added some VBA functions to the DBL to HEX conversion library. There are two important new functions: xNum2DblHex as well as HEX2DBL.

The VBA function xNum2DblHex yields the hex representation of the very IEEE754DP floating-point number nearest to an extended number. The rounding convention used in this function is to round up always if the 54th mantissa bit, which is the most significant bit not being encoded in the IEEE754DP standard, would take a value of 1. Extended numbers do not only arise from multi-precision tools like xNumbers, they also occur when importing numbers with 18 significant digits from a FORTRAN output file into a spreadsheet, e.g. As I originally wrote the DBL2HEX function in FORTRAN (this language has a standard ‘equivalence’ statement), it is now possible to check whether Fortran DP quantities (declared by real*8 statements, e.g.) have the same binary representation after being imported into XLS as they originally had in FORTRAN. This is an important prerequisite for comparing the performance of an identical algorithm that is executed in different environments like XLS, VBA, PY, FOR, … in a spreadsheet environment.

The VBA function HEX2DBL converts any valid hex representation of an IEEE754DP number into a VBA double. In order to accomplish that task it harnesses the same trick as its inverse function DBL2HEX does, i.e., calling the VBA Win32/64 API function RtlMoveMemory.

You may download the complete VBA conversion library if you obey its GPL V3 license.

One of the next posts will be concerned with answering the question of how to ‘optimally’ calculate a quantity like y = 1 – x*x when x is close to 1.

Mind the gap when using IEEE754 DP numbers

Posted in binary, Computer, Mathematics, Microsoft Excel, Number processing, Number systems, Programming Languages, Software, Spreadsheet, Uncategorized, VBA on September 11, 2011 by giorgiomcmlx

The gap between two adjacent IEEE754 DP numbers is more or less proportional to the magnitude of these numbers. The overwhelming majority of DP numbers do not exactly equal a power of 2. In this case, the gaps between such a DP number and its direct neighbours on both the smaller as well as the bigger sides have a magnitude of about 1.11E-16 times the smallest power of 2 that is larger than the number. For example, consider the DP number 1.5 (hex 3F F8 00 00 00 00 00 00). The smallest power of 2 which is bigger than 1.5 is 2^1=2. Thus, the biggest DP number being smaller than 1.5 is about 1.5 – 2⋅1.11E-16 = 1,499999999999999778 (exactly hex 3F F7 FF FF FF FF FF FF), and the smallest DP number being bigger than 1.5 is about 1.5 + 2⋅1.11E-16 = 1,500000000000000222 (exactly hex 3F F8 00 00 00 00 00 01).

At all exact powers of 2 the gap between adjacent DP numbers doubles in size. This means the question “does x equal 1.0” will yield the answer “true” for any real x in the range 1.0-5.55E-17 <= x < 1.0+1.11E-16, if rounding to the nearest is used in order to convert an arbitrary real number x into an IEEE754 DP number. Please note that in this special case, the interval limits are located asymmetrically around 1.0! That happens for all exact powers of 2. The size of the gap between two adjacent DP numbers in the interval from 1 to 2, i.e. 2.22E-16, is called machine epsilon.

The following figures show the discretisation effect occurring when using DP numbers. In particular, the region around 1 is explored using cell formulae in the spreadsheet application Microsoft®™ Excel®™ 2002 as well as using the VBA 6®™ built-in programming language. The FOSS add-in xNumbers is used to show how the results look like if the calculations are carried out with considerably higher numerical accuracy, i.e., 30 instead of the 17 decimal places in case of the IEEE754 DP numbers. The vertical scaling had to be shifted downwards by 1 in the figures, because XL cannot deal with axis limits of 1+/-2E-15, for example. The discretisation effect occurs in the first step of all shown calculations, i.e., when the relatively small quantity x is added to 1, and is thus not influenced by the final shift process.

machine epsilon in MS Excel and VBA

There is a remarkable difference between the results obtained using XL directly on the one hand and those obtained by invoking the programming language VBA 6 from XL on the other hand. MS seems to have implemented some “politically correct mathematics” into XL here in order to maximise the interval for which the answer to “does x equal 1” yields true… The results of the XL cell formulae are clearly not IEEE754 compliant.

details of machine epsilon in MS Excel and VBA

DP numbers are most dense around 0. Neglecting the special concept of “denormalised numbers”, the absolute value of the smallest numbers different from 0 is 2^(-1022), which is about 2.225E-308. The absolute value of the biggest DP number is 2^1023+2^1022+…+2^972+2^971 (hex
7F EF FF FF FF FF FF FF), which nearly equals 1.7976931348623157E+308. The adjacent DP number on its lower side, i.e. 1.7976931348623155E+308, is smaller by an amount of 2^971, which is about 2.00E+292. In the biggest dyad of the DP number system, the gap of 2.00E+292 between two adjacent DP numbers is huge in absolute terms, but still small in relative terms: 2.00E+292/1,79769E+308 = 1.11E-16, just as it was to be expected.

I have “translated” a routine from FORTRAN into a VBA array function that gives you all relevant information about the numerical characteristics of your computer. You may download the VBA6 code and use it if you follow the GPL v3 (included).

The size of the DP gap is usually small even in absolute terms for all numbers humans deal with in their every day or business life. However, it should be kept in mind that the DP gap around a number is roughly proportional to its magnitude. For example, in the range between 2^52 and 2^53, which equals about 4.5036E+15 to 9.0072E+15, only the integers (gap is 2^0=1) are exactly representable by DP numbers. In the next bigger dyad from 9.0072E+15 to 1.80144E+16, only the even integers (gap is 2^1=2) are exactly representable by DP numbers, and so on. Of course, this is only a consequence of limiting the mantissa of IEEE DP numbers 53 bits or roughly 17 significant decimal digits.

Even in science and engineering, numbers are very rarely known with an accuracy of more than 10 significant decimal digits. In fact, even most scientific measurements typically yield an accuracy of less than 5 significant decimal digits. This is why usually virtually nobody is bothered by neither the discreteness of the set of DP numbers in the continuum of the real numbers nor their limited accuracy of about 17 decimal digits. Consequently, people are quite surprised if even very simple calculations fail due to an effect called catastrophic cancellation. This will be the subject of a follow-up post.

Do you really know how your spreadsheet works?

Posted in binary, Computer, decimal, Mathematics, Number processing, Number systems, Software, Spreadsheet on August 27, 2011 by giorgiomcmlx

Many people all over the world use spreadsheet calculation software on their computers. Only few of them, though, seem to be aware what is really happening beneath the GUI. Many effects that seem strange at a first glance can be understood when considering how most recent spreadsheet programs process numbers. It is reasonable to assume that at least the programming language the spreadsheet software is built in uses a number format called Double Precision. This has two important consequences: firstly, such a software (and the computer it is running on as well) represents numbers using the binary numeral system ( wikipedia has a more advanced description), and secondly, only a limited amount of Double Precision (DP) numbers is available to represent any number that is entered into a cell, e.g. The distance between a DP number and its two direct neighbours nearly always equals 2^-52 times the leading power of 2 that occurs in that number. The situation is different for the few numbers exactly equalling a power of 2: because the leading power of 2 changes at these numbers, the distance to their next smaller DP neighbour is half the distance to their next larger DP neighbour. The DP numbers are most dense around zero and most sparse near their maximum, which is about 10^307.

Consequently, software internally using DP numbers can only exactly represent a decimal input if that equals a binary fraction and as well contains a minimum and maximum power of 2 that do not differ by more than 52. For example, 4, 1, 0.5, 0.25, 1.25, and 0.6125 are exactly representable in the binary as well as decimal numeral system. But what about numbers like sqrt(2), 0.9, 0.1 &hell;? They are rounded to a DP number using a standardised algorithm. Thus, there is a double dilemma with the doubles. Firstly, most of the numbers humans want to work with cannot be exactly represented by a DP number but only by what is called a string. Secondly, most of the usual DP numbers a computer can deal with would require at maximum 751 decimal digits to represent them exactly. Fortunately, in case of IEEE 754 DP numbers 17 decimal digits uniquely identify a DP number (18, if false rounding is to be avoided).

Usually, spreadsheet software displays only 15 digits. But then you cannot spot that the result of 0.5-0.4 is rounded to a different DP number than the direct input of 0.1, as demonstrated in a previous post.
The joint consequences of different spacing of DP neighbours as well as rounding to a DP can lead to considerable differences between the results of mathematically equivalent formulae used to calculate a result. The following graph shows the DP differences between the DP results of y1=1-x⋅x and y2=(1-x)⋅(1+x) for 30,000 different values of DP x between (0;1).

Example how DP numbers influence a calculation

The absolute value of the difference does not exceed 1.1⋅10^-16, but the smaller the results for y1 as well as y2 get as x approaches 1, the more different DP numbers exist to represent the difference between y1 and y2. The amount of available levels that have a non-zero value doubles at all the x values where y decreases below a power of 2, i.e., at about x≈0.707 (x>1/sqrt(2), y<1/2), x≈0.866 (x>sqrt(3)/2, y<1/4), x≈0.935 (x>sqrt(7/8), y<1/8) …

Example how DP numbers influence a calculation

But unfortunately, even for x close to 1, the absolute values of the differences still nearly fill the interval between +/-6⋅10^-17, though the value of y gets very small there. Thus, the accuracy of the results is questionable and moreover, the question arises which of both methods is more reliable. These aspects will be treated in a follow-up post.