how PPC Cpu handle "double" and "float" in terms of speed

	Bottom Previous Topic Next Topic
Register To Post

kas1e

Posted on: 2018/11/20 13:11 #1

Home away from home

While porting some stuff from Pandora, have a question: how our PPCs handle "double" and "float" in terms of speed ? For example on ARM, float are much faster than double (especialy on old architecture like the cortex-a8 that equip the Pandora). What the case with our cpus ? Maybe there is some specific cpu options as well for GCC which can made things better for let's say x1000 and x5000 cpus ?

Join us to improve dopus5!
AmigaOS4 on youtube

Deniil

Re: how PPC Cpu handle "double" and "float" in terms of speed

Posted on: 2018/11/20 15:41 #2

Quite a regular

Double is faster! The CPU can only do double, so when dealing with floats it first needs to convert it internally.

I did some tests long ago on 603e and found that particular test to be like 5-10% faster with double, and that's with 32-bit memory bus!!

Software developer for Amiga OS3 and OS4.
Develops for OnyxSoft and the Amiga using E and C and occasionally C++

Daytona675x

Re: how PPC Cpu handle "double" and "float" in terms of speed

Posted on: 2018/11/21 8:22 #3

Not too shy to talk

@kas1e
Roughly spoken: for the CPU it usually doesn't really matter too much if it's double or single precision you're using. Single precision values are converted to their double representation for internal use when (un)loaded into/from fpu registers, for free. What's not for free are explicit conversion commands.
On e.g. Tabor it's different, there the FPU internally distinguishes between single and double precision and also offers simple SIMD capabilities for 2 single precision values. AltiVec btw. wants single precision too.
For many commands you can specify whether to use single or double precision (e.g. fadd vs fadds). As far as I know some commands are a bit faster in single-precision mode.

Anyway, in real world you almost always want to follow this rule of thumb:
use the lowest precision required to do the task.

double precision puts twice the pressure on caches and memory, which is the main reason why using single precision most often results in (significant) higher performance on our systems.
I can definitely tell you: if you manage to kill your cache, your performance is gone, you won't get it back, simple as that - and the other way around: you can do tons of calculations on your data if you manage to have your stuff in the cache in time.
Therefore usually single precision floats are one of your best friends when it's about performance.

However, *don't* naively change a programs float / double usage unless you know exactly what you're doing. Although for example the avg. 3D game's vertex data and internal calculations are most often done with single precision data you shouldn't just blindly tell your compiler to compile everything for single prec! You may end up getting very subtle bugs.

The Vampire guys tapped into such an issue recently (which was fixed quickly). They gave their FPU a slightly too low internal precision (which is somewhat equivalent of telling your compiler to always use single prec). While most things worked, the timing of certain demos just went bogus depending on your system time: the demo would run fine if in the year 1970 and it would mess up when running in 2018, a phenomenon called "catastrophic cancellation", which may happen if you take two really big numbers which are close to each other, subtract the one by the other - and falsely expect the result to still have any sort of precision or value other than 0

With a bit more precision those calculations with those input values were okay again.

So, to sum it up:
- use single prec whenever you can for best performance because of cache/memory.
- use doubles when you have to.
- don't change the float / double behaviour of other people's code unless you know exactly what you're doing.

@Deniil
Are you sure that you really measured the right thing (e.g. not some explicit conversions) and a real-world situation (e.g. not just some tight loop with a fixed 32 byte dataset)? Because that's really not the result one would expect.

Edited by Daytona675x on 2018/11/21 8:41:34

[Facebook] [YouTube Channel] [Atomic Bomberman Discord]

kas1e

Re: how PPC Cpu handle "double" and "float" in terms of speed

Posted on: 2018/11/21 8:45 #4

Home away from home

@Daniel
Thanks for detailed answer !
It is for game which while 3d based, pretty much depends on CPU than GPU, and while by default it builds with doubles , it has option (and tested with) to be build with floats. On ARM it give good extra boost, so probabaly should be ok for PPC as well then.

Join us to improve dopus5!
AmigaOS4 on youtube

Daytona675x

Re: how PPC Cpu handle "double" and "float" in terms of speed

Posted on: 2018/11/21 8:56 #5

Not too shy to talk

@kas1e
If the game has a well tested explicit single-prec-code-path, I strongly suggest: go for it

[Facebook] [YouTube Channel] [Atomic Bomberman Discord]

kas1e

Re: how PPC Cpu handle "double" and "float" in terms of speed

Posted on: 2018/11/21 9:10 #6

Home away from home

@Daniel
Btw, i just made a simple test:


#include <stdio.h>

#include <stdlib.h>

#include <sys/time.h>

 

// This can not be larger because of the resolution of the float data type

#define NUMLOOPS 67108860

 

static double deltime(struct timeval *end, struct timeval *begin)

{

  double rv = (end->tv_usec - begin->tv_usec) * 1e-6;

  rv       += (end->tv_sec - begin->tv_sec);

  return rv;

}

 

// pi = 4 * (1/1 - 1/3 + 1/5 - 1.7 ...)

static float pi_float()

{

  float rv = 0;

 

  for (float i=1; i<NUMLOOPS; i += 4.0f) {

    rv += (1.0f / i) - (1.0f / (i+2.0f));

  }

 

  return 4.0f * rv;

}

 

static double pi_double()

{

  double rv = 0;

 

  for (double i=1; i<NUMLOOPS; i += 4.0) {

    rv += (1.0 / i) - (1.0 / (i+2.0));

  }

 

  return 4.0 * rv;

}

 

int main() {

  struct timeval startTime, endTime;

 

  gettimeofday(&startTime, NULL);

  float fpi = pi_float();

  gettimeofday(&endTime, NULL);

  printf("Float Pi generation took %0.3f s and yields %.9f\n",

  deltime(&endTime, &startTime), fpi);

 

  gettimeofday(&startTime, NULL);

  double dpi = pi_double();

  gettimeofday(&endTime, NULL);

  printf("Double Pi generation took %0.3f s and yields %.9f\n",

  deltime(&endTime, &startTime), dpi);

}

And that what i have:

Quote:

4/0.RAM Disk:> double_float_tests2
Float Pi generation took 0.608 s and yields 3.141383648
Double Pi generation took 0.868 s and yields 3.141592624

4/0.RAM Disk:> double_float_tests2
Float Pi generation took 0.607 s and yields 3.141383648
Double Pi generation took 0.868 s and yields 3.141592624

4/0.RAM Disk:> double_float_tests2
Float Pi generation took 0.608 s and yields 3.141383648
Double Pi generation took 0.869 s and yields 3.141592624

4/0.RAM Disk:>

Even about 25% speed up ?

@Deniil
Are you sure that on 603e you have double be faster on 5-10% than float ? Maybe its other way you mean, floats faster ?:)

Join us to improve dopus5!
AmigaOS4 on youtube

kas1e

Re: how PPC Cpu handle "double" and "float" in terms of speed

Posted on: 2018/11/21 9:30 #7

Home away from home

@Daniel
And in game it give in end ~4fps more with floats :)

Join us to improve dopus5!
AmigaOS4 on youtube

yescop

Re: how PPC Cpu handle "double" and "float" in terms of speed

Posted on: 2018/11/21 10:17 #8

Just popping in

@kas1e
In your double function, you didn't cast the constants to double as you did in float function.
So I think there is an implicit conversion and this costs time.
Change your double function to
Rv += 2d/(i(i+2))
And see the difference with float function and old double function.

Since a lot of months without an amiga to love, I was lost.

Now I feel happiness again with a Sam Flex 800 .

Deniil

Re: how PPC Cpu handle "double" and "float" in terms of speed

Posted on: 2018/11/22 9:40 #9

Quite a regular

Interesting!

I also expected float to be faster, because of memory bandwidth mostly, but also cache in my algorithm and was surprized when double was faster.

I'll see if I can find those old test results again, and the algorithm. It was some audio processing routine of some sort.

Different CPUs are probably differently efficient on converting float<->double internally. Some may do it in zero cycles (while loading). Others need extra steps.

Software developer for Amiga OS3 and OS4.
Develops for OnyxSoft and the Amiga using E and C and occasionally C++

Register To Post
	Top Previous Topic Next Topic

Currently Active Users Viewing This Thread: 1 ( 0 members and 1 Anonymous Users )