What I learned from Floating Points

2 minute read

OK I confess, I didn’t read the legandary What Every Computer Scientist Should Know About Floating-Point Arithmetic by Goldberg. It’s a must-read however sometimes you just need to find your answer, rather than master on a topic, like floating points. Here, I’ll list the things I learned from working with FPs, in case you are looking for an answer too.

A Floating point number is less than FLT_MIN or a double number is less than DBL_MIN. How is it possible?

Looking for an underflow error? How would you decide if underflow has occured. Underflow is defined by number computed is less than minimum number representable. Documentation of strtod states the following:

If the correct value would cause underflow, the function returns a value whose magnitude is no greater than the smallest normalized positive number and sets errno to ERANGE.

We expect a number which is less than or equal to FLT_MIN or DBL_MIN. Let’s check what exactly those macros are.

FLT_MIN : The value of this macro is the minimum normalized positive floating point number that is representable in type float. It is supposed to be no more than 1E-37. Source.

DBL_MIN : Same as FLT_MIN, for double.

In that case, you may ask what is minimum normalized value? Normalized numbers do not have leading zeros in decimal representation of a float. For instance,

Normalized : 1.17549e-038 (FLT_MIN)
Subnormal : 0.0017549e-41 (same number)

As you can see, even though FLT_MIN_EXP, exponent of minimum float is 38, we can write smaller numbers than it by allowing subnormal numbers. So, when you have a string represents a float smaller than FLT_MIN, it may be still representable and could smaller than FLT_MIN. Let’s look at the code:

void smallerF()
  cout << "MIN FLOAT test" << endl;
  errno = 0;
  stringstream ss;
  ss << "1.17549430e-39";
  cout << "Min possible :\t" << FLT_MIN << endl;
  cout << "str: \t\t" << ss.str() << endl;
  float converted = strtof(ss.str().c_str(), NULL);
  cout << "Error: \t\t" << (errno == ERANGE ? "Range error" : "No error") << endl;
  cout << "Converted: \t" << converted << endl;
  cout << "Normal :\t" << (isnormal(converted) ? "True" : "False") << endl;
  cout << "Comparison: \t" << (converted < FLT_MIN ? "smaller" : "greater or equal") << endl;

The output:


Min possible :  1.17549e-038
str:            1.17549430e-39
Error:          Range error
Converted:      1.17549e-039
Normal :        True
Comparison:     smaller

But wait, how? The number is smaller than FLT_MIN by all terms, exponent is smaller than FLT_MIN_10_EXP and it is still normal? Let’s take a close look on the both numbers. My favorite tool for examine floats is IEEE-754 Analysisby Michael Lubow.

Let’s check how FLT_MIN is represented in memory.

Decimal HEX Exponent (Binary) Exponent (Decimal) Significand (Decimal)
1.17549e-037 0x021FFFD9 00000100 -123 1.24999535084
1.17549e-038 0x007FFFE0 00000000 -126 0.99999618530
1.17549e-039 0x000CCCC9 00000000 -126 0.19999921322
1.17549e-040 0x000147AD 00000000 -126 0.01999986172

See the pattern? Exponent is in its minimum. We can’t lower the number any further, that’s why we hit FLT_MIN_10_EXP limit. (Why it is not -38? Because it is base2. 2-126=1.175494e-38.) However something else changes, the significand.

Well, that seem subnormal. Why isnormal() didn’t return false then? Good question, and I don’t have the answer. What I see from the tests, isnormal() doesn’t return false until the number 1.17549430e-46.

A word about errno for underflow. Standard library is not obligated to raise ERANGE when underflow occurs. Do not count on it if you will support multiple compilers.

Leave a Comment