x5000 benchmarks / speed up

	Bottom Previous Topic Next Topic
Register To Post

« 1 2 3 (4)

joerg

Re: x5000 benchmarks / speed up

Posted on: 2023/2/21 14:38 #61

Home away from home

@Hans
Quote:

The dcba instruction may also be very slow, just like with the G5. It's an illegal instruction on the G5, and is emulated by doing nothing (but with the overhead of an illegal instruction trap).

At least in AmigaOS 4.0 and for the 32 bit CPUs the kernel didn't emulate it, when using dcba on CPUs which don't support it at all the program crashes with an ISI exception.
But there may be some CPUs where dcba is implemented as no-op without causing an ISI exception.

@afxgroup
Quote:

So you have to use Exec functions since they are always optimized for the current machine. Maybe they aren't for x5000 but only kernel developers know

I've read somewhere that X5000 optimised functions (maybe X1000 and A1222 as well) are only included in beta kernels, not in public released ones yet, but I don't know if that's still the case.
For the X1000 the 74xy AltiVec versions could have been used with (next to) no changes, but X5000 (dcbzl instead of dcbz, or if there is such an instruction dcbal instead of dcba) and A1222 (integer accesses only, never any float/double loads/stores) need new implementations.

Edited by joerg on 2023/2/21 18:30:36

geennaam

Re: x5000 benchmarks / speed up

Posted on: 2023/2/21 20:03 #62

Quite a regular

Unfortunately dcbtl opcode is not recognized by any of the gcc versions in the latest SDK. Unless I use the compiler switch -mcpu=G5.
But the generated code will crash immediately of course.

Even mcpu=e5500 doesn't recognize those 64bytes cacheline opcodes.

DCBA is actually still supported by the e5500 ( see e5500 rm). But, depending on a bit in the l1 cache control register, it works on half or full cache lines. The new opcode e5500 dcbal opcode works always on full cachelines. But again not recognized by gcc.

Anyways, I am stuck now.

Edited by geennaam on 2023/2/21 20:21:20

Hans

Re: x5000 benchmarks / speed up

Posted on: 2023/2/22 2:27 #63

Home away from home

@geennaam

Quote:

Unfortunately dcbtl opcode is not recognized by any of the gcc versions in the latest SDK. Unless I use the compiler switch -mcpu=G5.
But the generated code will crash immediately of course.

Even mcpu=e5500 doesn't recognize those 64bytes cacheline opcodes.

Try: -mcpu=G5 -mno-powerpc64

Quote:

DCBA is actually still supported by the e5500 ( see e5500 rm). But, depending on a bit in the l1 cache control register, it works on half or full cache lines. The new opcode e5500 dcbal opcode works always on full cachelines. But again not recognized by gcc.

Interesting. One annoying thing about the cache hint instructions, is the varying behaviour on different PowerPC CPUs.

BTW, these days achieving maximum memory copy speeds often seems to require using multiple cores.

Hans

Join Kea Campus' Amiga Corner and support Amiga content creation
https://keasigmadelta.com/ - see more of my work

geennaam

Re: x5000 benchmarks / speed up

Posted on: 2023/2/22 10:51 #64

Quite a regular

@Hans

Quote:

TW, these days achieving maximum memory copy speeds often seems to require using multiple cores.

Yes, I've noticed that bandwidth scales more or less linear when you run a multi-threaded memtest on X5000 linux (ramsmp-3.5.0). It produces the same kind of results as ragemem and my (non-cache hint instructions) memory test when run in single-threaded mode. It would be nice to see if X5040 owners see a ~4x increase for single threaded versus quad threaded test run.

https://openbenchmarking.org/test/pts/ramspeed uses the same ramsmp-3.5.0 in its test suite. Those results for the same type of DIMMs are much higher than we can achieve on a X5000. But this is not a fair comparision because there are vector enabled assembly routines for Intel and AMD.

However the disappointing part is that our GCC will never generate optimized code for the X5000. Because it simply doesn't understand the cache hint instructions for a full cacheline. Therefore "-fprefetch-loop-arrays" generates only marginal faster code. And slower than my "half cacheline" test.

According to the doc, a combination of the following instructions are enabled with the -mcpu switch. so probably altivec is also an issue:
Quote:

The -mcpu options automatically enable or disable the following options:

-maltivec -mfprnd -mhard-float -mmfcrf -mmultiple
-mpopcntb -mpopcntd -mpowerpc64
-mpowerpc-gpopt -mpowerpc-gfxopt
-mmulhw -mdlmzb -mmfpgpr -mvsx
-mcrypto -mhtm -mpower8-fusion -mpower8-vector
-mquad-memory -mquad-memory-atomic -mfloat128
-mfloat128-hardware -mprefixed -mpcrel -mmma

I will try to play with the options later this week.

EDIT: I think that I understand now why GCC doesn't support dcbtl/dcbal for the e5500. By default, L1CSR0[DCBZ32] is cleared. This means that dcbz and dcba are executed on the full 64 bytes cache line. So by default dcbz/dcba is executed as dcbzl/dcbal.
I will check L1CSR0[DCBZ32] if it is still cleared during runtime on AmigaOS. If not then and if the e5500 core allows for it during runtime, it would be even easier to simply clear L1CSR0[DCBZ32]. Makes sense actually. Both generic and e5500 specific code can have the same benefit. But I do wonder if there will be a catch. Because the 32 bytes limitation options looks pretty redundant to me. But they don't just waste silicon without good reason.

Edited by geennaam on 2023/2/22 12:26:17
Edited by geennaam on 2023/2/22 12:31:15
Edited by geennaam on 2023/2/22 12:32:18

joerg

Re: x5000 benchmarks / speed up

Posted on: 2023/2/22 16:19 #65

Home away from home

@geennaam
Quote:

I will check L1CSR0[DCBZ32] if it is still cleared during runtime on AmigaOS. If not then and if the e5500 core allows for it during runtime, it would be even easier to simply clear L1CSR0[DCBZ32].

Only if you'd run your code inside IExec->Disable()/Enable(), or at least IExec->Forbid()/Permit(). If the 32 byte bit is set in AmigaOS 4.1 it means the kernel functions (IUtility->SetMem(), IExec->CopyMemQuick(), etc.) haven't been updated to 64 bytes cache lines yet but use old 32 byte functions and wont work if you change the bit.

Quote:

But I do wonder if there will be a catch. Because the 32 bytes limitation options looks pretty redundant to me. But they don't just waste silicon without good reason.

For dcbt(st) there should be no difference, except for using 2 dcbt instructions on each cache line if you use old 32 bytes code, but for dcb[az] it's a required workaround to be able to use old code which assumes a 32 byte cache line:
If dcbz clears 64 bytes executing such old code it may clear 32 bytes to much and a 64 bytes dcba may result in 32 bytes of random data.

joerg

Re: x5000 benchmarks / speed up

Posted on: 2023/2/23 18:38 #66

Home away from home

@geennaam
Quote:

But again not recognized by gcc.

Maybe using vbcc/vasm instead works? It doesn't have an e5500 option, but seems to support more PowerPC CPUs than gcc/gas does.

geennaam

Re: x5000 benchmarks / speed up

Posted on: 2024/1/6 21:57 #67

Quite a regular

Here's a small memory test tool which compares cache and ddr bandwidth for both normal transfers and cache hint "optimised" transfers. For now only dcba and dcbz are used. These instructions are supported by at least the e5500 and ppc440 cores.

The implementation is very basic and there's room for optimisation.

https://www.file-upload.org/60ox9m08d7ri

I will also try to implement dcbt dcbtst in a future update.

MigthyMax

Re: x5000 benchmarks / speed up

Posted on: 2024/1/8 5:33 #68

Just popping in

@geennaam

Quote:

DCBA is actually still supported by the e5500 ( see e5500 rm). But, depending on a bit in the l1 cache control register,
it works on half or full cache lines. The new opcode e5500 dcbal opcode works always on full cachelines.
But again not recognized by gcc.

If the current gas does not support it, try the binutils v2.40 (betas) , at least in the code there is something about 'DCBA', maybe they work for you.

https://kas1e.mikendezign.com/aos4/bin ... _binutils_2.40_beta01.zip

amiganuts

Re: x5000 benchmarks / speed up

Posted on: 2024/1/10 20:17 #69

Just popping in

I know I am late to the party here. I know this is a long shot but reading this post, I only have 1 2GB Kingston Module in my X5040. So I purchased from Amazon
Kingston Fury Beast 2x8GB DDR3-1833 Dual Rank (KF318C10BBK2/16) and should get it next week. Is there a chance that with fast bank writing with dual memory it would help with some of these weird Grim Reaper errors?

I have also noticed how bad the numbers get on RageMem

RAM
X5000 1995 My X5040
Read 32/64 726/1366 403/734 -44.49/-46.27%
Write 32/64 1551/1552 652/651 -57.96/-58.05%
Write 2339 2397 +2.48%

zerec

Re: x5000 benchmarks / speed up

Posted on: 2024/1/10 21:16 #70

Just popping in

RageMem 0.37 Results of my AmigaOne X5000/40 2200MHz

2x 4GB Corsair CL11
Max MIPS: 4399

Cache (Mb/s)
Cache L1 (kb): 32
Read32/64: 8253 / 16480
Write32/64: 8254/ 16484

Cache L2 (kb): 512
Read32/64: 4692 / 8495
Write32/64 : 5517 / 9689

RAM (Mb/s)
Read32/64: 615/ 1099
Write32/64 : 1095 / 1101
Writing: 2434

Video Bus (Mb/s)
Reading: 38
Writing: 542

Edited by zerec on 2024/1/10 21:52:52

OS4 Betatester

SAM460EX @ 1,10GHz, Tabor A1222 @ 2x 1,20GHz
X1000 @ 2x 1,80 GHz, X5000/40 4x 2,20 GHz

amiganuts

Re: x5000 benchmarks / speed up

Posted on: 2024/1/12 20:29 #71

Just popping in

@amiganuts

I got my Kingston Fury RAM. Uboot recognizes two banks of 8Gig but AmigaOS is only recognizing 2G. What limit variable in uboot do I change to set this?

geennaam

Re: x5000 benchmarks / speed up

Posted on: 2024/1/12 20:42 #72

Quite a regular

@amiganuts

AmigaOS, every variant, can only address 2GB. So this is normal. There's extended memory bu means of bank switching. But that is hardly used.

The reason to use such large memory is that it is dual rank memory. This gives a some bandwidth benefit over single rank.

amiganuts

Re: x5000 benchmarks / speed up

Posted on: 2024/1/12 21:04 #73

Just popping in

well, my numbers came up a little bit

2x 8GB Kingston Fury
Max MIPS: 4397

Cache (Mb/s)
Cache L1 (kb): 32
Read32/64: 8183 / 16343
Write32/64: 8185/ 16348

Cache L2 (kb): 512
Read32/64: 4659 / 8436
Write32/64 : 5497 / 9619

RAM (Mb/s)
Read32/64: 624/ 1087
Write32/64 : 11485 / 1488
Writing: 2456

Video Bus (Mb/s)
Reading: 18
Writing: 535

So RAM Read went from -44.49/-46.27% to -14.05/-20.42%
RAM Write: -57.96/-58.05% to -4.26/-4.12%
Write: +2.48% to +5.00%

My Video bus really sucks compared to the X5000 1995. I have a RX560 and 3.7 drivers
Read: -64.81%
Write: -.93

Maybe more tweeking of Uboot?

joerg

Re: x5000 benchmarks / speed up

Posted on: 2024/1/12 21:11 #74

Home away from home

@geennaam
Quote:

AmigaOS, every variant, can only address 2GB. So this is normal. There's extended memory bu means of bank switching. But that is hardly used.

AmigaOS 4.x can address 4 GB, the single, ancient exec function which used to limit the address space to 2 GB in AmigaOS 3.x is no longer used since more than 20 years and was replaced by a function supporting the complete 4 GB.
However, the upper 2 GB of the 32 bit 4 GB (virtual) address space are reserved for PCI(e), U-Boot, etc. in AmigaOS 4.x, resulting in only 2 GB for normal applications, which don't support ExtMem, in AmigaOS 4.x as well.

Register To Post	« 1 2 3 (4)
	Top Previous Topic Next Topic

Currently Active Users Viewing This Thread: 1 ( 0 members and 1 Anonymous Users )