The newlib memory copy functions are optimised for the host hardware as best they can be. You might squeeze a bit more, but they are well tried and tested.
Simon
Comments made in any post are personal opinion, and are in no-way representative of any commercial entity unless specifically stated as such. ---- http://codebench.co.uk
If you're copying within RAM, then the exec.library's mem copy function should already be optimal for whatever machine you're on. Consequently, memcpy() should be too.
If you're transferring to/from VRAM, then WritePixelArray()/ReadPixelArray() are the best option (will even use DMA on platforms where DMA routines are available).
If you really want to embed it in your code, then things get complicated. On altivec machines, using altivec is best (in a cache-aligned manner), using doubles is optimal on most others, except for the e500 core (Tabor/A1222), which has a non-standard FPU. Then there are cache instructions that can help boost performance, but which one you should use depends on which CPU (e.g., dcbz on 32-bit CPUs, and dcbzl on 64-bit CPUs).
> 64bit memcpy using doubles > The newlib memory copy functions Will try both and test time
Do you know what is the theorical "write to VRAM" speed ?
> WritePixelArray [...] will even use DMA Interesting Can we imagine it can serve to copy ANY memory area from RAM to a location in VRAM if this destination pointer is encapsulated in a RastPort/bitmap with same pixel format ?
Did you notice low performance on memcpy or CopyMem? Does it concern a specific board? If so, which one?
In the past, I realized benchmarks on memory operations and functions written by hand are rarely optimal (or better on some models but worst on some others).
Do you know what is the theorical "write to VRAM" speed ?
GfxBench2D will give you that speed for multiple methods (incl. WritePixelArray()).
Quote:
Can we imagine it can serve to copy ANY memory area from RAM to a location in VRAM if this destination pointer is encapsulated in a RastPort/bitmap with same pixel format ?
Not sure exactly what you're asking. WritePixelArray() writes to a rastport. If both the source and destination are bitmaps, then use BltBitMap(). You should make the one in RAM "userprivate," with the same bytes-per-row as the one in VRAM. See CompositeYUVBlitStream example here.
@corto My concern was to write to a VBO in GPU VRAM the more efficient way >Did you notice low performance Not especially : I was just hoping it exists "the more efficient memcpy in ppc asm" so I can exclude that part of my program as cause for eventual slowlyness
In fact Clib/memcpy do that so I got my answer
@Hans I was meaning something like that Src is in RAM Dst is a VBO in GPU VRAM If i set up a (fake) rastport/bitmap that point to Dst then can WritePixelArray copy the data with DMA ? I mean if pixel formats are same in Src and Dst(array=RGBA,bitmap=RGBA) WritePixelArray should copy transparently with no changes to data, no ?
I was meaning something like that Src is in RAM Dst is a VBO in GPU VRAM If i set up a (fake) rastport/bitmap that point to Dst then can WritePixelArray copy the data with DMA ?
I doubt it, because you're actually copying to a shadow buffer in RAM. The driver then converts between big and little endian while copying the data into VRAM.
I'm not sure WritePixelArray() would use DMA for RAM=>RAM copies. Added to that, it's a pretty nasty hack that'll stop working the moment the graphics library's bitmap structure changes. Best treat the bitmap internals like a black box...
Using memcpy() should be fine. Theoretically that could be DMA accelerated too, but I'm not sure if that's done on any of our platforms.
Quote:
I mean if pixel formats are same in Src and Dst(array=RGBA,bitmap=RGBA) WritePixelArray should copy transparently with no changes to data, no ?
Yes, WritePixelArray() does a direct byte-for-byte copy when the formats match.
@thellier Using asm is not always (and even rarely) the solution. Before focusing no optimization, that's better to measure performance and then identify problems.
Using memcpy() should be fine. Theoretically that could be DMA accelerated too, but I'm not sure if that's done on any of our platforms.
Both CopyMem() and CopyMemQuick() use DMA on Sam440 and Sam460 for larger memory copies, and since memcpy() is just wrapper for CopyMem() it does so too.