With QEMU amigaone I get good results (slower than real G4 but we know QEMU FPU is slow and this combines that with memory access that also gets slower for bigger blocks):
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 170922 microseconds.
(= 85461 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 2540.6 0.068012 0.062977 0.077582
Scale: 1009.5 0.163934 0.158498 0.172685
Add: 1217.1 0.208123 0.197197 0.223799
Triad: 965.6 0.261186 0.248550 0.281111
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
But with QEMU sam460ex something seems to be wrong:
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 192721 microseconds.
(= 64240 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 1672.3 0.098496 0.095678 0.109226
Scale: 771.9 0.212180 0.207291 0.219799
Add: 968.0 0.253143 0.247930 0.264263
Triad: 760.0 0.326087 0.315804 0.345664
-------------------------------------------------------------
Failed Validation on array a[], AvgRelAbsErr > epsilon (1.000000e-13)
Expected Value: 1.153301e+12, AvgAbsErr: 1.141173e+12, AvgRelAbsErr: 9.894843e-01
For array a[], 9895936 errors were found.
Failed Validation on array b[], AvgRelAbsErr > epsilon (1.000000e-13)
Expected Value: 2.306602e+11, AvgAbsErr: 2.282429e+11, AvgRelAbsErr: 9.895204e-01
AvgRelAbsErr > Epsilon (1.000000e-13)
For array b[], 9895936 errors were found.
Failed Validation on array c[], AvgRelAbsErr > epsilon (1.000000e-13)
Expected Value: 3.075469e+11, AvgAbsErr: 3.043100e+11, AvgRelAbsErr: 9.894753e-01
AvgRelAbsErr > Epsilon (1.000000e-13)
For array c[], 9895936 errors were found.
-------------------------------------------------------------
which is odd as the FPU emulation is the same so maybe there's some memory access issues still left. (This is with my current development version but same result with QEMU 8.0.0 so at least not something I broke recently but don't know yet what causes it.)
Compiling stream.c with gcc 10.2.1 for Linux (with -mcpu=powerpc -O3 -DVERBOSE) and running it on QEMU sam460ex with Linux guest instead of AmigaOS I get:
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 208608 microseconds.
(= 104304 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 2126.6 0.076052 0.075236 0.077315
Scale: 775.6 0.208164 0.206299 0.212734
Add: 950.8 0.256040 0.252422 0.261350
Triad: 825.2 0.294674 0.290838 0.302348
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
Results Validation Verbose Results:
Expected a(1), b(1), c(1): 1153300781250.000000 230660156250.000000 307546875000.000000
Observed a(1), b(1), c(1): 1153300781250.000000 230660156250.000000 307546875000.000000
Rel Errors on a, b, c: 0.000000e+00 0.000000e+00 0.000000e+00
-------------------------------------------------------------
So the validation error only happens on AmigaOS with the binary from @sailor. Could it be some problem with gcc 6 and -O3? I don't have AmigaOS compiler set up now so can't try it but if somebody can compile it with gcc 10 or without -O3 and verify if that runs correctly on QEMU sam460ex that may help further to locate why this fails.
Edited by balaton on 2024/4/28 13:56:30
Edited by balaton on 2024/4/28 15:50:34
Edited by balaton on 2024/4/28 15:51:35