@balaton
Quote:
I've sent a patch which improves dcbz a bit but it only removes a small overhead that was an easy fix but most of the problem still remains.
DCBZ should only be used in the G2 and G3 parts of AmigaOS as replacement for DCBA. 60x CPUs don't support DCBA and on 750 CPUs it's an "optional instruction" according to 
https://www.nxp.com/docs/en/reference-manual/MPC750UM.pdfBut IIRC at least one of 750FX or 750GX supports DCBA.
DCBA has the advantage that it's a no-op on cache-inhibited, write-through and unmapped memory while DCBZ causes an exception in those cases.
DCBZ is emulated in the kernel exception handler, for example in case someone uses it on cache-inhibited VRAM, but of course that's extremely slow.
G4 CPUs support DCBA, and 405 based CPUs according to the PPC405 Core User's Manual as well.
On G4 CPUs neither DCBA nor DCBZ should be used but the better vector data stream prefetch instructions instead.
Or simply nothing (slower than using the prefetch instructions, but faster than using DCBA or DCBZ) as 2 consecutive, cache-line aligned 128 bit vector stores don't read the cache-line fist but just store the 32 bytes.
Allocating a cache-line to skip the read/modify/write cache-line cycle is only required for smaller stores, for example 4 64 bit double or 8 32 bit integer stores in the main loop of a memcpy().
My newlib.library memcpy() implementations, on which at least some of the optimized memory copy functions in AmigaOS were based on, definitely used DCBA on CPUs supporting it and DCBZ as replacement only on the CPUs not supporting it, for example the 60x CPUs in classic Amigas.
A possible reason to use DCBZ instead of DCBA or the vector stream prefetch instructions is to have common code for all CPUs, instead of different code for each of the supported CPUs as it's the case in the AmigaOS kernel and used to be in newlib.library.
That may be the reason Hans is using it in his VRAM copy functions.