This text is published under the Creative Commons’ non-commercial share-alike (CC-NC-SA) license V3.0. It was created by GiorgioMCMLX at carolomeetsbarolo on wordpress.com in October 2012.

In source codes concerned with numerical subjects, usually the values of all frequently used expressions that do not change during execution time are precomputed externally and assigned to variables as a sequence of decimal figures. For example, in some fictious programming language that lacks an internal constant corresponding to the mathematical constant π, such a statement could look like

`Declare Constant pi_on_3 as Double Precision = 1.04719755119659775E+00_8B`

The mathematically exact value of π/3 = 1.047197551196597746154214461… comprises incountably many decimal figures. Thus, it was decided to round it to a precision of 18 decimal significant figures in the so-called scientific notation. The variable pi_on_3 is defined to be of the type double precision (DP) binary floating-point (FP) that uses 8 bytes = 64 bits for each value: 1 sign bit, 11 exponent bits, and 52 plus 1 implicit significand bits. This is the most widespread FP type used in numerical code nowadays. At a first glance, the declaration seems to be quite natural, but it neglects the very property of π that led to its invention: π is a so-called irrational number; moreover, it even belongs to the set of transcendental numbers among them. Consequently, π and all of its multiples definitely cannot be represented exactly by any of the floating-point (FP) variable types used in computers, because all their values are binary rational numbers that all correspond to decimal rational numbers. Nearly all mathematical constants belong at least to the group of irrational numbers. Otherwise, their invention would not have made much sense…

From the point of view of a compiler, 1.04719755119659775E+00 is not a number but a string consisting of decimal characters. The transformation of such numeric strings found in code into values of numerical variables is an example of what is called parsing. When a compiler meets a statement like that given above it will call its parser in order to determine a DP number that will substitute the value of the decimal string in the calculations. Parsing of decimals will almost always cause round-off errors (a special form of the more general quantisation errors), because the rational DP numbers form a sparse set of isolated points within the continuum of all the real numbers. As a consequence, all the incountably many real numbers comprised by a finite interval need to be parsed onto a single DP number.

Usually, it is tried to parse a numerical string to the nearest DP. Then, the magnitude of the quantisation error is somewhere between zero and half the distance between a pair of consecutive DP numbers. This full distance is called an ULP (unit in the last place). It is a relative measure: it doubles its absolute value at all the DP numbers that are exact powers of two. Each DP number DPval represents the interval of real numbers from DPval-0.5ULPs to DPval+0.5ULPs. Disregarding exact powers of two, the absolute distance between a pair of consecutive DP numbers is equal to the length of the interval that both DP numbers stand in for. Parsing a numerical string to the nearest DP thus

limits the unavoidable quantisation error to ±0.5ULPs. The absolute value of an ULP depends on the exponent of the DP number it is associated with: 1 ULP = 2^(LPo2-52), where LPo2 denotes the true leading power of two that is encoded in biased form in the exponent. LPo2 equals zero for all the DP numbers found in the interval [1;2); in this range, 1 ULP = 1/2^52 ≅ 2.22…E-16. This special absolute value is called machine epsilon.

Consequently, all exact binary DP numbers can be uniquely identified if at least 17 decimal significant figures (SFs) are provided. Unfortunately, this theorem seems to be widely misunderstood. In particular, it does not mean that any given numerical string consisting of 17 decimal SFs uniquely identifies a binary DP number. The limit of 17 decimal SFs only holds for parsing numerical strings the values of which correspond to exact DPs, for example when printing the results of DP calculations. In the other direction, for example when parsing numerical strings to DP numbers in order to specify the input of DP calculations, at maximum about 770 (seven-seven-zero, no typo!) decimal SFs might be needed. Thus, the fictious programmer, who might even have thought that specifying 18 decimal SFs is more than sufficient, might be terribly wrong.

For any parsing-to-nearest-DP process to succeed, it is necessary that the value of a given numerical string can be discriminated from the border between the quantisation intervals of a pair of consecutive DPs, which is called a "tie". Obviously, the rule "round/parse to nearest" has to be complemented by a tie-breaking instruction. Usually, this is "ties to even", and means that in case of a tie the DP number having an even binary representation is chosen, i.e., the trailing bit of which is zero. Ties formally involve just one additional bit as compared to the pair of DP numbers they separate; if the power of two that bit corresponds to is negative, this one extra bit exactly requires one extra decimal SF. The approximate relation between the total amount of decimal significant figures SF_dec of an exact DP and its leading power of two LPo2 has already been estimated my means of linear regression in a previous post:

`SF_dec ≅ - 0,699*LPo2 + 52,5 ; LPo2 ≤ 52`

SF_dec ≅ + 0,301*LPo2 + 0,496 ; LPo2 ≥ 52

For the most negative normalised value of LPo2, i.e. -1022, this formula predicts SF_dec ≅ 767. Consequently, at maximum about 770 decimal SFs assure that any numerical string having a value within the range of normalised DP numbers can definitely be parsed to the nearest DP number. Unfortunately, there is no way of predicting the amount of SF_dec that are really needed for a given decimal string without parsing it. The maximum amount will only be needed if the number is extremely close to a tie. On the other hand, if the number is close to an exact DP, 17 digits would do the job. So the scaling law that governs the amount of SF_dec actually needed for parsing-to-nearest-DP of a given decimal number needs to worked out.

The distance between a given decimal number and the nearest DP tie can be calculated harnessing multiple-precision software. All of the calculations presented here were done using the FOSS XL-Addin called xNumbers. A simple VBA-script was used to output the properties of the multi-precision results obtained when dividing π by all integers between 1 and 2E+7. In order to assure that the absolute value of the relative measure ULP remained constant, all results were scaled by a power of two so that all the final results are found in the range between one and two. This kind of scaling does not change the relative position of a decimal number with respect to the quantisation interval of the nearest DP, because these intervals just scale likewise! The scaled results were then ordered by decreasing distance to the nearest DP tie. In xNumbers, there is a built-in function "dgmat" that outputs the amount of matching digits of two xNs. Adding one to the result of "dgmat" thus yields a good estimate of the amount of significant decimal figures that would allow for correctly parsing each of the scaled results to the DP nearest to it. The following table summarises some of the results.

It can be seen that 19 decimal SFs enable parsing π/3 to the nearest DP, whereas for π/59425 23 SFs are needed. Of course, you should be aware that a "stupid" parser might use not all the figures and thus yield an incorrect result! In order to find the scaling law that relates the distance dist_tie between a decimal number and its nearest DP tie to the amount SF_dec of decimal SFs that needs to be specified for enabling parsing to nearest DP, the following figure depicts the values found in the last two columns of the preceding table (orange rhombuses). The normalised quantity dist_tie takes values between zero (exact tie) and one (exact DP); its values are computed by removing the sign from the non-normalised values as well as by dividing them by 0.5ULPs.

The line drawn in blue colour corresponds to the finally deduced scaling law:

`SF_dec ≅ 17 - log10(dist_tie)`

Summarising, the amount of decimal significant figures to which a mathematical constant has to be specified in order to enable parsing-to-the-nearest-DP depends on the distance between this constant and its nearest DP tie. Unfortunately, this distance is not known until the constant has been parsed, so that this has to be done in the multi-precision environment in which the constant is calculated! This solution is mathematically correct but practically unsafe and stupid as well, because all the parsing work would have to be done at least twice: once when writing the code and once again during code execution. A simple improvement is obvious though: instead of using the probably very long numerical string corresponding to the mathematically exact value of a constant, simply the numerical string comprising the 17 decimal significant figures of the nearest DP is to be used.

This method can still be greatly improved by avoiding any parsing at run-time, which can be achieved by reading/writing bit patterns directly from/to the memory of a computer. All major programming languages provide means for this kind of manipulation. It usually converts a DP number into 16 hexadecimal figures of which each corresponds to 4 bits, and vice versa. Unfortunately, two different schemes co-exist that define the ordering of the 8 bytes of a DP number: little-endian and big-endian byte order. Within a byte, the bits always follow the big-endian scheme. Thus, most humans like to read these

hexadecimal numbers in big-endian notation, and that is why it is used here. Of course, the endianness of stored hexadecimal strings should be fixed (e.g., to big-endian), and the conversion routines should account for the endianness of the machine they are executed on. Concluding, the statement given at the beginning of this post should be replaced by something like this:

`Declare pi_on_3 as Double Precision`

pi_on_3 = HEX2DP("3FF0C152382D7366")

where HEX2DP calls the function that converts the 16 hexadecimal figures into a DP number. The code of this function depends on the programming language as well as on the hardware platform.