@Hans
Quote:
TW, these days achieving maximum memory copy speeds often seems to require using multiple cores.
Yes, I've noticed that bandwidth scales more or less linear when you run a multi-threaded memtest on X5000 linux (ramsmp-3.5.0). It produces the same kind of results as ragemem and my (non-cache hint instructions) memory test when run in single-threaded mode. It would be nice to see if X5040 owners see a ~4x increase for single threaded versus quad threaded test run.
https://openbenchmarking.org/test/pts/ramspeed uses the same ramsmp-3.5.0 in its test suite. Those results for the same type of DIMMs are much higher than we can achieve on a X5000. But this is not a fair comparision because there are vector enabled assembly routines for Intel and AMD.
However the disappointing part is that our GCC will never generate optimized code for the X5000. Because it simply doesn't understand the cache hint instructions for a full cacheline. Therefore "-fprefetch-loop-arrays" generates only marginal faster code. And slower than my "half cacheline" test.
According to the doc, a combination of the following instructions are enabled with the -mcpu switch. so probably altivec is also an issue:
Quote:
The -mcpu options automatically enable or disable the following options:
-maltivec -mfprnd -mhard-float -mmfcrf -mmultiple
-mpopcntb -mpopcntd -mpowerpc64
-mpowerpc-gpopt -mpowerpc-gfxopt
-mmulhw -mdlmzb -mmfpgpr -mvsx
-mcrypto -mhtm -mpower8-fusion -mpower8-vector
-mquad-memory -mquad-memory-atomic -mfloat128
-mfloat128-hardware -mprefixed -mpcrel -mmma
I will try to play with the options later this week.
EDIT: I think that I understand now why GCC doesn't support dcbtl/dcbal for the e5500. By default, L1CSR0[DCBZ32] is cleared. This means that dcbz and dcba are executed on the full 64 bytes cache line. So by default dcbz/dcba is executed as dcbzl/dcbal.
I will check L1CSR0[DCBZ32] if it is still cleared during runtime on AmigaOS. If not then and if the e5500 core allows for it during runtime, it would be even easier to simply clear L1CSR0[DCBZ32]. Makes sense actually. Both generic and e5500 specific code can have the same benefit. But I do wonder if there will be a catch. Because the 32 bytes limitation options looks pretty redundant to me. But they don't just waste silicon without good reason.
Edited by geennaam on 2023/2/22 12:26:17
Edited by geennaam on 2023/2/22 12:31:15
Edited by geennaam on 2023/2/22 12:32:18