Quad precision in C – make it a double double!

Fortran provides quad-precision floating point numbers in certain implementations. Declaring a number in the following manner, will create a variable with IEEE 128-bit quad precision (the 16 implies 16 bytes):

real (kind=16) :: quad_x
real (kind=8)  :: double_x
real (kind=4)  :: single_x

In order to see its affect, consider a piece of Fortran code which calculates π using the equation of Leibniz. Run with double precision provides the following output:

How many terms to calculate pi to? 10000
The value of pi is: 3.1416925847879611

Run with quad precision on the other hand:

How many terms to calculate pi to? 10000
The value of pi is: 3.14169258478796109557151794433593750

In C, 128 bits can be accessed via the type __float128. This may not be availabe in all compilers, but GCC (4.6 and newer) does seem to  support it – however, according to the GCC specs, it is said to be “an order of magnitude or two slower” than double precision. To make it even *easier* to use, they provide a bunch of functions, like quadmath_snprintf which converts the number into a string so it can be output. But why make it so difficult? Fortran deals with every variable output in the same way (i.e. transparently).

Here’s what to do – call it a double double, and make the conversion specifier for I/O %TH.

Easy eh?

Posted in: C

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.