Login
Username:

Password:

Remember me



Lost Password?

Register now!

Sections

Who's Online
106 user(s) are online (71 user(s) are browsing Forums)

Members: 0
Guests: 106

more...

Headlines

 
  Register To Post  

« 1 (2)
Re: A1222 support in the SDK and problems
Just can't stay away
Just can't stay away


See User information
@flash

Quote:

Also libc libraries (newlib/clib) needs to be recompiled for P1222 support, we need new sdk to produce the right binaries for A1222 without any workaround.


A special SPE compiled version of newlib.library for the Tabor/A1222 exists since version 53.54 (released to beta testers in October 2019).

The exposed ABI of the library's "main" interface is and has to remain that of generic PPC code (e.g. floating point parameters and results are passed in the emulated FPU registers) because otherwise it wouldn't be possible to run existing non-SPE compiled programs on the A1222.

What might however be possible would be to also expose the SPE ABI functions directly through another "main.spe" interface but in order for it to be usable special versions of the startup code and libc will likely also be needed.

The ABI for SPE code generated by gcc is identical to soft-float ABI in that double precision floats are passed as register pairs (r3/r4, r5/r6, r7/r8, r9/r10) even though for the SPE they could be passed in a single 64-bit register.

Go to top
Re: A1222 support in the SDK and problems
Just can't stay away
Just can't stay away


See User information
@sailor
Quote:
And please, how to use soft-float C library?
Is it something like: "gcc -mcrt=clib2 -msoft-float .... -lm" ?
Yes.

Quote:
And how is floating-point parameters passed when I used "-mcpu=powerpc -msoft-float" ? Via GPR registers? They are 32-bit in powerpc ABI. Or via stack?
Same as regular integers, first in the 8 registers, if more parameters are used on the stack.
float = int32 = one 32 bit register, double = int64 = two 32 bit registers.

A SPE C library is required, but as long as there is none and if for some reason rebuilding clib4 for it doesn't work, the old, already existing soft-float clib2 could be used for now:
- Build everything which doesn't use (much) float/double code with -msoft-float and use the soft-float C library.
- Put code which uses float/double calculations in separate sources compiled with -mabi=spe -mfloat-gprs=double instead.
- Make sure SPE functions called from soft-float code, and the other way round, are compatible, for example by only using pointers to float/double instead of direct float/double parameters. May not even be required if they are compatible anyway, as salass00 wrote.

@salass00
Quote:
What might however be possible would be to also expose the SPE ABI functions directly through another "main.spe" interface but in order for it to be usable special versions of the startup code and libc will likely also be needed.
Unless the way I implemented the newlib libc.(a|so) stub functions was changed only a new startup code (crtbegin) using interface "spe" instead of "main" should be required.


Edited by joerg on 2024/4/18 15:13:07
Edited by joerg on 2024/4/18 15:17:55
Go to top
Re: A1222 support in the SDK and problems
Not too shy to talk
Not too shy to talk


See User information
@joerg
@salass00

Thank you for explanation.

AmigaOS3: Amiga 1200
AmigaOS4: Micro A1-C, AmigaOne XE, Pegasos II, Sam440ep, Sam440ep-flex, AmigaOne X1000
MorphOS: Efika 5200b, Pegasos I, Pegasos II, Powerbook, Mac Mini, iMac, Powermac Quad
Go to top
Re: A1222 support in the SDK and problems
Not too shy to talk
Not too shy to talk


See User information
@joergQuote:

- Build everything which doesn't use (much) float/double code with -msoft-float and use the soft-float C library.
- Put code which uses float/double calculations in separate sources compiled with -mabi=spe -mfloat-gprs=double instead.

And what if I need to use math library functions ( sin,cos..)? Do you know, what is faster? To call it newlib + standard powerpc way, i.e. it uses LTE emulator, or to use clib2 + integer emulation from here?
Of course, I cam measure it, I am asking just for case.

Quote:
- Make sure SPE functions called from soft-float code, and the other way round, are compatible, for example by only using pointers to float/double instead of direct float/double parameters. May not even be required if they are compatible anyway, as salass00 wrote.

at least printf, fprintf and sin() are not identical ( newlib.library 53.84 )- calling from SPE code returns nonsence. These I tested.

AmigaOS3: Amiga 1200
AmigaOS4: Micro A1-C, AmigaOne XE, Pegasos II, Sam440ep, Sam440ep-flex, AmigaOne X1000
MorphOS: Efika 5200b, Pegasos I, Pegasos II, Powerbook, Mac Mini, iMac, Powermac Quad
Go to top
Re: A1222 support in the SDK and problems
Just can't stay away
Just can't stay away


See User information
@sailor
Quote:
at least printf, fprintf and sin() are not identical ( newlib.library 53.84 )- calling from SPE code returns nonsence.
There is no soft-float newlib, the function calls are -mhard-float using the PowerPC ABI with FPU registers, even if the A1222 version is internally using SPE code.
You have to use clib2 for now.

Go to top
Re: A1222 support in the SDK and problems
Just can't stay away
Just can't stay away


See User information
@joerg

Quote:

Unless the way I implemented the newlib libc.(a|so) stub functions was changed only a new startup code (crtbegin) using interface "spe" instead of "main" should be required.


Yes, that should still work.

Go to top
Re: A1222 support in the SDK and problems
Not too shy to talk
Not too shy to talk


See User information
My first SPE-modified application stream is finished!

I need some apps for bechmarking of A1222+, and if nearly no exists, I have to it myselves .
It is on OS4 depot now. It is only one small easy piece, but this is also my first c-code after 20+ years, so I am happy...

AmigaOS3: Amiga 1200
AmigaOS4: Micro A1-C, AmigaOne XE, Pegasos II, Sam440ep, Sam440ep-flex, AmigaOne X1000
MorphOS: Efika 5200b, Pegasos I, Pegasos II, Powerbook, Mac Mini, iMac, Powermac Quad
Go to top
Re: A1222 support in the SDK and problems
Just can't stay away
Just can't stay away


See User information
@sailor

Great job!!!
On my SAM460ex I got these results:
#stream 
-------------------------------------------------------------
STREAM version $Revision5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array 
size 10000000 (elements), Offset (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 
The *besttime for each kernel (excluding the first iteration)
 
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 371870 microseconds.
   (= 
185935 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test
.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For 
best resultsplease be sure you know the
precision of your system timer
.
-------------------------------------------------------------
Function    
Best Rate MB/s  Avg time     Min time     Max time
Copy
:             760.6     0.214048     0.210373     0.223806
Scale
:            328.4     0.496581     0.487278     0.506078
Add
:              429.6     0.564791     0.558662     0.571497
Triad
:            429.3     0.568371     0.559030     0.578980
-------------------------------------------------------------
Solution Validatesavg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
#

Go to top
Re: A1222 support in the SDK and problems
Home away from home
Home away from home


See User information
@jabirulo

Just for comparision, this is from an X1000
-------------------------------------------------------------
STREAM version $Revision5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array 
size 10000000 (elements), Offset (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 
The *besttime for each kernel (excluding the first iteration)
 
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 51567 microseconds.
   (= 
25783 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test
.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For 
best resultsplease be sure you know the
precision of your system timer
.
-------------------------------------------------------------
Function    
Best Rate MB/s  Avg time     Min time     Max time
Copy
:            2647.9     0.064595     0.060425     0.071738
Scale
:           4057.5     0.044220     0.039433     0.047337
Add
:             3822.7     0.067003     0.062782     0.075438
Triad
:           3823.4     0.069087     0.062771     0.071987
-------------------------------------------------------------
Solution Validatesavg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

People are dying.
Entire ecosystems are collapsing.
We are in the beginning of a mass extinction.
And all you can talk about is money and fairytales of eternal economic growth.
How dare you!
– Greta Thunberg
Go to top
Re: A1222 support in the SDK and problems
Quite a regular
Quite a regular


See User information
With QEMU amigaone I get good results (slower than real G4 but we know QEMU FPU is slow and this combines that with memory access that also gets slower for bigger blocks):
-------------------------------------------------------------
STREAM version $Revision5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array 
size 10000000 (elements), Offset (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 
The *besttime for each kernel (excluding the first iteration)
 
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 170922 microseconds.
   (= 
85461 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test
.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For 
best resultsplease be sure you know the
precision of your system timer
.
-------------------------------------------------------------
Function    
Best Rate MB/s  Avg time     Min time     Max time
Copy
:            2540.6     0.068012     0.062977     0.077582
Scale
:           1009.5     0.163934     0.158498     0.172685
Add
:             1217.1     0.208123     0.197197     0.223799
Triad
:            965.6     0.261186     0.248550     0.281111
-------------------------------------------------------------
Solution Validatesavg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

But with QEMU sam460ex something seems to be wrong:
-------------------------------------------------------------
STREAM version $Revision5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array 
size 10000000 (elements), Offset (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 
The *besttime for each kernel (excluding the first iteration)
 
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 3 microseconds.
Each test below will take on the order of 192721 microseconds.
   (= 
64240 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test
.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For 
best resultsplease be sure you know the
precision of your system timer
.
-------------------------------------------------------------
Function    
Best Rate MB/s  Avg time     Min time     Max time
Copy
:            1672.3     0.098496     0.095678     0.109226
Scale
:            771.9     0.212180     0.207291     0.219799
Add
:              968.0     0.253143     0.247930     0.264263
Triad
:            760.0     0.326087     0.315804     0.345664
-------------------------------------------------------------
Failed Validation on array a[], AvgRelAbsErr epsilon (1.000000e-13)
     
Expected Value1.153301e+12AvgAbsErr1.141173e+12AvgRelAbsErr9.894843e-01
     
For array a[], 9895936 errors were found.
Failed Validation on array b[], AvgRelAbsErr epsilon (1.000000e-13)
     
Expected Value2.306602e+11AvgAbsErr2.282429e+11AvgRelAbsErr9.895204e-01
     AvgRelAbsErr 
Epsilon (1.000000e-13)
     For array 
b[], 9895936 errors were found.
Failed Validation on array c[], AvgRelAbsErr epsilon (1.000000e-13)
     
Expected Value3.075469e+11AvgAbsErr3.043100e+11AvgRelAbsErr9.894753e-01
     AvgRelAbsErr 
Epsilon (1.000000e-13)
     For array 
c[], 9895936 errors were found.
-------------------------------------------------------------

which is odd as the FPU emulation is the same so maybe there's some memory access issues still left. (This is with my current development version but same result with QEMU 8.0.0 so at least not something I broke recently but don't know yet what causes it.)

Compiling stream.c with gcc 10.2.1 for Linux (with -mcpu=powerpc -O3 -DVERBOSE) and running it on QEMU sam460ex with Linux guest instead of AmigaOS I get:
-------------------------------------------------------------
STREAM version $Revision5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array 
size 10000000 (elements), Offset (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 
The *besttime for each kernel (excluding the first iteration)
 
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 208608 microseconds.
   (= 
104304 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test
.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For 
best resultsplease be sure you know the
precision of your system timer
.
-------------------------------------------------------------
Function    
Best Rate MB/s  Avg time     Min time     Max time
Copy
:            2126.6     0.076052     0.075236     0.077315
Scale
:            775.6     0.208164     0.206299     0.212734
Add
:              950.8     0.256040     0.252422     0.261350
Triad
:            825.2     0.294674     0.290838     0.302348
-------------------------------------------------------------
Solution Validatesavg error less than 1.000000e-13 on all three arrays
Results Validation Verbose Results

    
Expected a(1), b(1), c(1): 1153300781250.000000 230660156250.000000 307546875000.000000 
    Observed a
(1), b(1), c(1): 1153300781250.000000 230660156250.000000 307546875000.000000 
    Rel Errors on a
bc:     0.000000e+00 0.000000e+00 0.000000e+00 
-------------------------------------------------------------

So the validation error only happens on AmigaOS with the binary from @sailor. Could it be some problem with gcc 6 and -O3? I don't have AmigaOS compiler set up now so can't try it but if somebody can compile it with gcc 10 or without -O3 and verify if that runs correctly on QEMU sam460ex that may help further to locate why this fails. But the same binary runs on real Sam460EX as confirmed above so the problem must be in QEMU but I have no idea how to debug it.


Edited by balaton on 2024/4/28 13:56:30
Edited by balaton on 2024/4/28 15:50:34
Edited by balaton on 2024/4/28 15:51:35
Edited by balaton on 2024/4/28 19:10:58
Go to top
Re: A1222 support in the SDK and problems
Not too shy to talk
Not too shy to talk


See User information
A1222+ results:

native SPE FPU:
8.System:> Work:Benchmark/stream-5.10-AOS
8.Work:Benchmark/stream-5.10-AOSstream_spe 
-------------------------------------------------------------
STREAM version $Revision5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array 
size 10000000 (elements), Offset (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 
The *besttime for each kernel (excluding the first iteration)
 
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 293711 microseconds.
   (= 
146855 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test
.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For 
best resultsplease be sure you know the
precision of your system timer
.
-------------------------------------------------------------
Function    
Best Rate MB/s  Avg time     Min time     Max time
Copy
:             787.1     0.204503     0.203269     0.208423
Scale
:            492.9     0.326322     0.324588     0.329637
Add
:              568.0     0.424966     0.422508     0.427871
Triad
:            541.6     0.445014     0.443115     0.449225
-------------------------------------------------------------
Solution Validatesavg error less than 1.000000e-13 on all three arrays


standart powerpc FPU code with LTE emulator:
8.Work:Benchmark/stream-5.10-AOSstream
-------------------------------------------------------------
STREAM version $Revision5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array 
size 10000000 (elements), Offset (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 
The *besttime for each kernel (excluding the first iteration)
 
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 1032608 microseconds.
   (= 
516304 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test
.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For 
best resultsplease be sure you know the
precision of your system timer
.
-------------------------------------------------------------
Function    
Best Rate MB/s  Avg time     Min time     Max time
Copy
:             788.5     0.204721     0.202919     0.208100
Scale
:            148.0     1.081844     1.080804     1.085773
Add
:              154.7     1.554502     1.551267     1.557342
Triad
:            148.2     1.622773     1.619742     1.626540
-------------------------------------------------------------
Solution Validatesavg error less than 1.000000e-13 on all three arrays


LTE FPU emulation is very fast - more than 25% of SPE FPU native code.
Unfortunatelly majority of 3D games nor works with LTE and interpretative emulator must be used, and it is very slow.

AmigaOS3: Amiga 1200
AmigaOS4: Micro A1-C, AmigaOne XE, Pegasos II, Sam440ep, Sam440ep-flex, AmigaOne X1000
MorphOS: Efika 5200b, Pegasos I, Pegasos II, Powerbook, Mac Mini, iMac, Powermac Quad
Go to top

  Register To Post
« 1 (2)

 




Currently Active Users Viewing This Thread: 1 ( 0 members and 1 Anonymous Users )




Powered by XOOPS 2.0 © 2001-2023 The XOOPS Project