Also libc libraries (newlib/clib) needs to be recompiled for P1222 support, we need new sdk to produce the right binaries for A1222 without any workaround.
A special SPE compiled version of newlib.library for the Tabor/A1222 exists since version 53.54 (released to beta testers in October 2019).
The exposed ABI of the library's "main" interface is and has to remain that of generic PPC code (e.g. floating point parameters and results are passed in the emulated FPU registers) because otherwise it wouldn't be possible to run existing non-SPE compiled programs on the A1222.
What might however be possible would be to also expose the SPE ABI functions directly through another "main.spe" interface but in order for it to be usable special versions of the startup code and libc will likely also be needed.
The ABI for SPE code generated by gcc is identical to soft-float ABI in that double precision floats are passed as register pairs (r3/r4, r5/r6, r7/r8, r9/r10) even though for the SPE they could be passed in a single 64-bit register.
And please, how to use soft-float C library? Is it something like: "gcc -mcrt=clib2 -msoft-float .... -lm" ?
Yes.
Quote:
And how is floating-point parameters passed when I used "-mcpu=powerpc -msoft-float" ? Via GPR registers? They are 32-bit in powerpc ABI. Or via stack?
Same as regular integers, first in the 8 registers, if more parameters are used on the stack. float = int32 = one 32 bit register, double = int64 = two 32 bit registers.
A SPE C library is required, but as long as there is none and if for some reason rebuilding clib4 for it doesn't work, the old, already existing soft-float clib2 could be used for now: - Build everything which doesn't use (much) float/double code with -msoft-float and use the soft-float C library. - Put code which uses float/double calculations in separate sources compiled with -mabi=spe -mfloat-gprs=double instead. - Make sure SPE functions called from soft-float code, and the other way round, are compatible, for example by only using pointers to float/double instead of direct float/double parameters. May not even be required if they are compatible anyway, as salass00 wrote.
@salass00 Quote:
What might however be possible would be to also expose the SPE ABI functions directly through another "main.spe" interface but in order for it to be usable special versions of the startup code and libc will likely also be needed.
Unless the way I implemented the newlib libc.(a|so) stub functions was changed only a new startup code (crtbegin) using interface "spe" instead of "main" should be required.
Edited by joerg on 2024/4/18 15:13:07 Edited by joerg on 2024/4/18 15:17:55
- Build everything which doesn't use (much) float/double code with -msoft-float and use the soft-float C library. - Put code which uses float/double calculations in separate sources compiled with -mabi=spe -mfloat-gprs=double instead.
And what if I need to use math library functions ( sin,cos..)? Do you know, what is faster? To call it newlib + standard powerpc way, i.e. it uses LTE emulator, or to use clib2 + integer emulation from here? Of course, I cam measure it, I am asking just for case.
Quote:
- Make sure SPE functions called from soft-float code, and the other way round, are compatible, for example by only using pointers to float/double instead of direct float/double parameters. May not even be required if they are compatible anyway, as salass00 wrote.
at least printf, fprintf and sin() are not identical ( newlib.library 53.84 )- calling from SPE code returns nonsence. These I tested.
AmigaOS3: Amiga 1200 AmigaOS4: Micro A1-C, AmigaOne XE, Pegasos II, Sam440ep, Sam440ep-flex, AmigaOne X1000 MorphOS: Efika 5200b, Pegasos I, Pegasos II, Powerbook, Mac Mini, iMac, Powermac Quad
at least printf, fprintf and sin() are not identical ( newlib.library 53.84 )- calling from SPE code returns nonsence.
There is no soft-float newlib, the function calls are -mhard-float using the PowerPC ABI with FPU registers, even if the A1222 version is internally using SPE code. You have to use clib2 for now.
Unless the way I implemented the newlib libc.(a|so) stub functions was changed only a new startup code (crtbegin) using interface "spe" instead of "main" should be required.
My first SPE-modified application stream is finished!
I need some apps for bechmarking of A1222+, and if nearly no exists, I have to it myselves . It is on OS4 depot now. It is only one small easy piece, but this is also my first c-code after 20+ years, so I am happy...
AmigaOS3: Amiga 1200 AmigaOS4: Micro A1-C, AmigaOne XE, Pegasos II, Sam440ep, Sam440ep-flex, AmigaOne X1000 MorphOS: Efika 5200b, Pegasos I, Pegasos II, Powerbook, Mac Mini, iMac, Powermac Quad
#stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 371870 microseconds.
(= 185935 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 760.6 0.214048 0.210373 0.223806
Scale: 328.4 0.496581 0.487278 0.506078
Add: 429.6 0.564791 0.558662 0.571497
Triad: 429.3 0.568371 0.559030 0.578980
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
#
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 51567 microseconds.
(= 25783 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 2647.9 0.064595 0.060425 0.071738
Scale: 4057.5 0.044220 0.039433 0.047337
Add: 3822.7 0.067003 0.062782 0.075438
Triad: 3823.4 0.069087 0.062771 0.071987
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
People are dying. Entire ecosystems are collapsing. We are in the beginning of a mass extinction. And all you can talk about is money and fairytales of eternal economic growth. How dare you! – Greta Thunberg
With QEMU amigaone I get good results (slower than real G4 but we know QEMU FPU is slow and this combines that with memory access that also gets slower for bigger blocks):
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 170922 microseconds.
(= 85461 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 2540.6 0.068012 0.062977 0.077582
Scale: 1009.5 0.163934 0.158498 0.172685
Add: 1217.1 0.208123 0.197197 0.223799
Triad: 965.6 0.261186 0.248550 0.281111
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
But with QEMU sam460ex something seems to be wrong:
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 192721 microseconds.
(= 64240 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 1672.3 0.098496 0.095678 0.109226
Scale: 771.9 0.212180 0.207291 0.219799
Add: 968.0 0.253143 0.247930 0.264263
Triad: 760.0 0.326087 0.315804 0.345664
-------------------------------------------------------------
Failed Validation on array a[], AvgRelAbsErr > epsilon (1.000000e-13)
Expected Value: 1.153301e+12, AvgAbsErr: 1.141173e+12, AvgRelAbsErr: 9.894843e-01
For array a[], 9895936 errors were found.
Failed Validation on array b[], AvgRelAbsErr > epsilon (1.000000e-13)
Expected Value: 2.306602e+11, AvgAbsErr: 2.282429e+11, AvgRelAbsErr: 9.895204e-01
AvgRelAbsErr > Epsilon (1.000000e-13)
For array b[], 9895936 errors were found.
Failed Validation on array c[], AvgRelAbsErr > epsilon (1.000000e-13)
Expected Value: 3.075469e+11, AvgAbsErr: 3.043100e+11, AvgRelAbsErr: 9.894753e-01
AvgRelAbsErr > Epsilon (1.000000e-13)
For array c[], 9895936 errors were found.
-------------------------------------------------------------
which is odd as the FPU emulation is the same so maybe there's some memory access issues still left. (This is with my current development version but same result with QEMU 8.0.0 so at least not something I broke recently but don't know yet what causes it.)
Compiling stream.c with gcc 10.2.1 for Linux (with -mcpu=powerpc -O3 -DVERBOSE) and running it on QEMU sam460ex with Linux guest instead of AmigaOS I get:
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 208608 microseconds.
(= 104304 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 2126.6 0.076052 0.075236 0.077315
Scale: 775.6 0.208164 0.206299 0.212734
Add: 950.8 0.256040 0.252422 0.261350
Triad: 825.2 0.294674 0.290838 0.302348
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
Results Validation Verbose Results:
Expected a(1), b(1), c(1): 1153300781250.000000 230660156250.000000 307546875000.000000
Observed a(1), b(1), c(1): 1153300781250.000000 230660156250.000000 307546875000.000000
Rel Errors on a, b, c: 0.000000e+00 0.000000e+00 0.000000e+00
-------------------------------------------------------------
So the validation error only happens on AmigaOS with the binary from @sailor. Could it be some problem with gcc 6 and -O3? I don't have AmigaOS compiler set up now so can't try it but if somebody can compile it with gcc 10 or without -O3 and verify if that runs correctly on QEMU sam460ex that may help further to locate why this fails. But the same binary runs on real Sam460EX as confirmed above so the problem must be in QEMU but I have no idea how to debug it.
Edited by balaton on 2024/4/28 13:56:30 Edited by balaton on 2024/4/28 15:50:34 Edited by balaton on 2024/4/28 15:51:35 Edited by balaton on 2024/4/28 19:10:58
8.System:> Work:Benchmark/stream-5.10-AOS/
8.Work:Benchmark/stream-5.10-AOS> stream_spe
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 293711 microseconds.
(= 146855 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 787.1 0.204503 0.203269 0.208423
Scale: 492.9 0.326322 0.324588 0.329637
Add: 568.0 0.424966 0.422508 0.427871
Triad: 541.6 0.445014 0.443115 0.449225
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
standart powerpc FPU code with LTE emulator:
8.Work:Benchmark/stream-5.10-AOS> stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 1032608 microseconds.
(= 516304 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 788.5 0.204721 0.202919 0.208100
Scale: 148.0 1.081844 1.080804 1.085773
Add: 154.7 1.554502 1.551267 1.557342
Triad: 148.2 1.622773 1.619742 1.626540
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
LTE FPU emulation is very fast - more than 25% of SPE FPU native code. Unfortunatelly majority of 3D games nor works with LTE and interpretative emulator must be used, and it is very slow.
AmigaOS3: Amiga 1200 AmigaOS4: Micro A1-C, AmigaOne XE, Pegasos II, Sam440ep, Sam440ep-flex, AmigaOne X1000 MorphOS: Efika 5200b, Pegasos I, Pegasos II, Powerbook, Mac Mini, iMac, Powermac Quad