Login
Username:

Password:

Remember me



Lost Password?

Register now!

Sections

Who's Online
115 user(s) are online (66 user(s) are browsing Forums)

Members: 0
Guests: 115

more...

Headlines

 
  Register To Post  

« 1 2 (3) 4 »
Re: x5000 benchmarks / speed up
Quite a regular
Quite a regular


See User information
@mufa

Your values are really low. It looks to me that this is caused by the combination of high latencies and single rank. But it also looks like your memory controller is running at 666MHz too. I cannot believe that it's the P5040 fault. Afterall, it needs to feed 4 CPU cores instead of 2. Hence the update in memory controller clock speed.

We need a CPU-Z like program to check the SPD and memory controller settings.

I'll create one if I can find the right motivation to do so.



Go to top
Re: x5000 benchmarks / speed up
Not too shy to talk
Not too shy to talk


See User information
@mufa

Is that screenshot of a P5040? (at 2Ghz)

Documentation states that a P5040 is running at 2.2 Ghz..

X5000/020 with P5020 Dual Core CPU running at a clockspeed of 2Ghz
X5000/040 with P5040 Rev C Quad Core CPU running at a clockspeed of 2.2Ghz


AmigaOne X5000 -> 2GHz / 16GB RAM / Radeon RX 550 / ATI X1950 / M-Audio 5.1 -> AmigaOS 4.1 FE / Linux / MorphOS
Amiga 1200 -> Recapped / 68ec020 ACA 1221ec / CF HDD / RetroNET connected to the NET
Vampire V4SE TrioBoot
RPI4 AmiKit XE
Go to top
Re: x5000 benchmarks / speed up
Not too shy to talk
Not too shy to talk


See User information
@Skateman

Sysmon in the Benchmark tab does not show the processor type, although as you can see, my Amiga has a CPU by more than 10% faster than X5020 and has more MIPS.

Well, if there was no doubt on what computer did I test here you have a different Sysmon tab, where you can see the type of CPU.

Resized Image



Go to top
Re: x5000 benchmarks / speed up
Home away from home
Home away from home


See User information
@geennaam
Quote:

Your values are really low. It looks to me that this is caused by the combination of high latencies and single rank. But it also looks like your memory controller is running at 666MHz too. I cannot believe that it's the P5040 fault. Afterall, it needs to feed 4 CPU cores instead of 2. Hence the update in memory controller clock speed.


There is some problem with x5000/040 in terms of bandwidth (and that happens not only on Mufa's hardware but on all others). We find the same issue with Hans some time ago (he has x5000/040), where all the FPS we compare in all the games, always worse for him, even if we have the same cards/components/etc.

Some games give 30% slower results in FPS on 040, and i don't hear if anyone in the team are reporting those issues or finding the roots.

Even the GART support for graphics drivers shows _SLOWER_ speed in comparison between 020 and 040 (020 are faster). That without Ragemem, but our internal tests. I.e. let's say for 020 it give us 550mb/s , and for 040 it gives let's say 350mb/s.

As i not have 040, i can't test it all properly and find out the roots and made a proper bug report, it's the works of 040 beta testers, but the fact that there are some issues somewhere, and it is unknown by now in what components: hardware, or software (kernel and stuff).

My IMHO is that can be something about uboot default values, or default kernel initialization, or something of that sort (as I can't believe that it can be hardware, as it done by Varisys as well), but again, no one made a proper report about, but there is 101% some issue somewhere on 040.

Join us to improve dopus5!
AmigaOS4 on youtube
Go to top
Re: x5000 benchmarks / speed up
Not too shy to talk
Not too shy to talk


See User information
@mufa

I see.... thanks for the screenshot.


AmigaOne X5000 -> 2GHz / 16GB RAM / Radeon RX 550 / ATI X1950 / M-Audio 5.1 -> AmigaOS 4.1 FE / Linux / MorphOS
Amiga 1200 -> Recapped / 68ec020 ACA 1221ec / CF HDD / RetroNET connected to the NET
Vampire V4SE TrioBoot
RPI4 AmiKit XE
Go to top
Re: x5000 benchmarks / speed up
Quite a regular
Quite a regular


See User information
I am still curious what is going on with the DDR3 performance of the X5000.

I wrote a little benchmark program in plain C to test the three cache levels until we reach the DDR3 memory. I wasn't able to build a specific e5500 optimised binary because gcc in the SDK doesn't support the -mcpu=e5500 target. So it's a generic PPC build.

Unlike rumours on certain forums, the 2MByte CPC is enabled and configured as L3 copy back cache.

Each test transfers 2GByte of data. The amount of passes for each test is 2GByte/blocksize.

The result is the average time of the passes.

4.Work:Development/projects/memspeedmemspeed 
Memspeed V0.1
:

Write 32bit integer:
--------------------------------------
Block size         1 Kb7296.37 MB/s
Block size         2 Kb
7303.30 MB/s
Block size         4 Kb
7496.34 MB/s
Block size         8 Kb
7529.59 MB/s
Block size        16 Kb
7546.76 MB/s
Block size        32 Kb
7509.58 MB/s
Block size        64 Kb
5283.98 MB/s
Block size       128 Kb
5327.65 MB/s
Block size       256 Kb
5323.72 MB/s
Block size       512 Kb
5189.57 MB/s
Block size      1024 Kb
4718.91 MB/s
Block size      2048 Kb
4398.85 MB/s
Block size      4096 Kb
1994.95 MB/s
Block size      8192 Kb
1764.40 MB/s
Block size     16384 Kb
1715.94 MB/s
Block size     32768 Kb
1720.12 MB/s
Block size     65536 Kb
1721.06 MB/s
Block size    131072 Kb
1710.87 MB/s
Block size    262144 Kb
1720.05 MB/s


Write 64bit integer
:
--------------------------------------
Block size         1 Kb7625.49 MB/s
Block size         2 Kb
7488.31 MB/s
Block size         4 Kb
7697.63 MB/s
Block size         8 Kb
7735.36 MB/s
Block size        16 Kb
7756.57 MB/s
Block size        32 Kb
7702.04 MB/s
Block size        64 Kb
5143.28 MB/s
Block size       128 Kb
5255.21 MB/s
Block size       256 Kb
5255.51 MB/s
Block size       512 Kb
5180.61 MB/s
Block size      1024 Kb
4774.23 MB/s
Block size      2048 Kb
4499.13 MB/s
Block size      4096 Kb
1993.73 MB/s
Block size      8192 Kb
1750.30 MB/s
Block size     16384 Kb
1716.59 MB/s
Block size     32768 Kb
1712.13 MB/s
Block size     65536 Kb
1718.73 MB/s
Block size    131072 Kb
1706.55 MB/s
Block size    262144 Kb
1712.28 MB/s


Read 32bit integer
:
--------------------------------------
Block size          1 Kb7360.47 MB/s
Block size          2 Kb
7432.67 MB/s
Block size          4 Kb
7592.63 MB/s
Block size          8 Kb
7628.53 MB/s
Block size         16 Kb
7628.96 MB/s
Block size         32 Kb
7520.78 MB/s
Block size         64 Kb
4413.41 MB/s
Block size        128 Kb
4486.52 MB/s
Block size        256 Kb
4455.85 MB/s
Block size        512 Kb
4442.14 MB/s
Block size       1024 Kb
1752.00 MB/s
Block size       2048 Kb
1504.30 MB/s
Block size       4096 Kb
778.60 MB/s
Block size       8192 Kb
717.67 MB/s
Block size      16384 Kb
713.41 MB/s
Block size      32768 Kb
714.39 MB/s
Block size      65536 Kb
711.96 MB/s
Block size     131072 Kb
713.14 MB/s
Block size     262144 Kb
712.18 MB/s


Read 64bit integer
:
--------------------------------------
Block size          1 Kb7660.88 MB/s
Block size          2 Kb
7648.96 MB/s
Block size          4 Kb
7724.62 MB/s
Block size          8 Kb
7766.72 MB/s
Block size         16 Kb
7650.40 MB/s
Block size         32 Kb
7694.40 MB/s
Block size         64 Kb
4497.88 MB/s
Block size        128 Kb
4420.66 MB/s
Block size        256 Kb
4498.84 MB/s
Block size        512 Kb
4457.12 MB/s
Block size       1024 Kb
1761.04 MB/s
Block size       2048 Kb
1525.33 MB/s
Block size       4096 Kb
776.64 MB/s
Block size       8192 Kb
717.97 MB/s
Block size      16384 Kb
713.57 MB/s
Block size      32768 Kb
712.62 MB/s
Block size      65536 Kb
714.85 MB/s
Block size     131072 Kb
713.39 MB/s
Block size     262144 Kb
713.53 MB/s


Write 32bit float
:
--------------------------------------
Block size        1 Kb7344.76 MB/s
Block size        2 Kb
7486.04 MB/s
Block size        4 Kb
7468.08 MB/s
Block size        8 Kb
7579.15 MB/s
Block size       16 Kb
7591.77 MB/s
Block size       32 Kb
7560.02 MB/s
Block size       64 Kb
5238.74 MB/s
Block size      128 Kb
5209.86 MB/s
Block size      256 Kb
5281.30 MB/s
Block size      512 Kb
5258.00 MB/s
Block size     1024 Kb
4843.57 MB/s
Block size     2048 Kb
4593.66 MB/s
Block size     4096 Kb
1985.41 MB/s
Block size     8192 Kb
1748.74 MB/s
Block size    16384 Kb
1708.76 MB/s
Block size    32768 Kb
1709.08 MB/s
Block size    65536 Kb
1698.84 MB/s
Block size   131072 Kb
1708.93 MB/s
Block size   262144 Kb
1706.82 MB/s


Write 64bit float
:
--------------------------------------
Block size        1 Kb14671.21 MB/s
Block size        2 Kb
14723.52 MB/s
Block size        4 Kb
14958.48 MB/s
Block size        8 Kb
15108.87 MB/s
Block size       16 Kb
15149.16 MB/s
Block size       32 Kb
15088.17 MB/s
Block size       64 Kb
9300.81 MB/s
Block size      128 Kb
8960.95 MB/s
Block size      256 Kb
9294.13 MB/s
Block size      512 Kb
9258.15 MB/s
Block size     1024 Kb
5371.73 MB/s
Block size     2048 Kb
4845.98 MB/s
Block size     4096 Kb
2003.67 MB/s
Block size     8192 Kb
1766.85 MB/s
Block size    16384 Kb
1707.37 MB/s
Block size    32768 Kb
1716.26 MB/s
Block size    65536 Kb
1719.92 MB/s
Block size   131072 Kb
1714.04 MB/s
Block size   262144 Kb
1708.27 MB/s


Read 32bit float
:
--------------------------------------
Block size        1 Kb7228.52 MB/s
Block size        2 Kb
7508.05 MB/s
Block size        4 Kb
7585.09 MB/s
Block size        8 Kb
7623.84 MB/s
Block size       16 Kb
7555.20 MB/s
Block size       32 Kb
7559.27 MB/s
Block size       64 Kb
4440.56 MB/s
Block size      128 Kb
4383.84 MB/s
Block size      256 Kb
4407.89 MB/s
Block size      512 Kb
4404.87 MB/s
Block size     1024 Kb
1756.79 MB/s
Block size     2048 Kb
1517.62 MB/s
Block size     4096 Kb
777.15 MB/s
Block size     8192 Kb
715.84 MB/s
Block size    16384 Kb
712.32 MB/s
Block size    32768 Kb
713.52 MB/s
Block size    65536 Kb
712.54 MB/s
Block size   131072 Kb
712.34 MB/s
Block size   262144 Kb
712.24 MB/s


Read 64bit float
:
--------------------------------------
Block size        1 Kb14763.19 MB/s
Block size        2 Kb
14706.17 MB/s
Block size        4 Kb
15021.15 MB/s
Block size        8 Kb
14814.83 MB/s
Block size       16 Kb
15228.22 MB/s
Block size       32 Kb
14990.95 MB/s
Block size       64 Kb
8336.21 MB/s
Block size      128 Kb
8342.40 MB/s
Block size      256 Kb
8334.50 MB/s
Block size      512 Kb
8047.62 MB/s
Block size     1024 Kb
3170.09 MB/s
Block size     2048 Kb
2831.46 MB/s
Block size     4096 Kb
1354.34 MB/s
Block size     8192 Kb
1301.80 MB/s
Block size    16384 Kb
1286.87 MB/s
Block size    32768 Kb
1293.76 MB/s
Block size    65536 Kb
1291.12 MB/s
Block size   131072 Kb
1291.57 MB/s
Block size   262144 Kb
1289.57 MB/s


You can clearly distinguish between each cache level up to the DDR memory.

One thing that I noticed is that 64bit integer read/write speed is on the same level as 32bit integers. This is different for floats where the L1 and L2 cache speed nearly doubles for 64bit floats. I do not know if this is a limitation of the e5500 core or the fact that I've build a generic PPC binary. The 64bit float performance can also explain the write performance of the X1000. When ragemem is optimised for Altivec or otherwise can make use of the 128bit load/store unit, this effectively doubles the theoretical memory bandwidth compared to the X5000..

Another observation worth mentioning is that as soon as a write hits the L3 cache or the DDR3 memory, the speed drops to the same level for all four write scenarios (int32,int64,fp32,fp64). Only the 64bit float read can benefit from the 64bit access.

The e5500 cores and the L3 cache and DDR3 controllers are all connected to the Corenet Coherency Fabric (CCF). This internal bus runs at 800MHz and is advertised to be able to support a sustainable read performance of 128bytes/clock cycle. This means a sustained read bandwidth of ~100GByte/s. Furthermore, the reference manual claims that this is a "low latency" datapath to the L3 cache. The L3 cache is also advertised to support a sustained read bandwidth of ~100GByte/s.
So if the bandwidth of the L3 cache and the CoreNet Coherency Fabric are not the issue than the only viable option left is latency or another bottleneck between the e5500 and the Corenet coherency Fabric.
The reference manual doesn't mention how the DDR3 controller is connected to the CCF, but I can imagine that this is a 128bit interface. The large difference in speed between the level three cache and DDR3 controller can be explained by the introduction of latency (wait states) because of crossing clock domains (800 MHz vs 666MHz). And of course the latencies of the DDR3 memory itself.

The following topic on the NXP forum suggests that it's the latency that is killing performance in my test loops.

Quote:
We nowhere considered the L2 cache as a core's accelerator in the P2020. When enabled, it inserts additional latency to core's transactions due to time required for cache hit checking. If no hit, the transaction is sent to coherent bus, for example to DDR. L2 cache can help to speed up core's operations only if the data being read is absent in the L1 cache but valid in L2 one. This depends on the code and may or may not happen.
Main features of the L2 cache are as follows:
- allows access to the cache for the I/O masters (feature called 'stashing'), in this case the core reads data from the L2 cache instead of DDR;
- allows to share data between two cores.


So the QorIq L2 cache is optimised for I/O handling and inter core communication instead of raw processing speed. Which sounds logical considering the fact that this processor is designed to be a network processor.

This also explains that bandwidth for my simple copy loop drops considerable when the we go down the cache hierarchie.

I repeated my memory test with the L3 cache configured as a FIFO and even when it was completely disabled. But I could hardly notice a difference in DDR3 performance.

So the next test would be to see what happens when I disable L2 cache or test DDR3 performance with a DMA transfer.

Go to top
Re: x5000 benchmarks / speed up
Just can't stay away
Just can't stay away


See User information
@geennaam

I never tried to implement optimized read/write/copy code on my X5000, only did it for some of the other systems supported by AmigaOS4, but from your results it's quite obvious that you aren't using DCBT (for reads) and DCBA (for writes, if not supported by the 5020 use DCBZ instead).
On none of the PPC CPUs supported by AmigaOS4 you get anything near the max. speeds without either using the DCBxy instructions or, where supported, the AltiVec streaming instructions instead.
Important: The cache line size of the 5020 is, or may be depending on some CPU configuration bits, different to most other PPC CPUs supported by AmigaOS4.
AFAIK the OS4 kernel only implements correct (or at least much better optimized) 5020 code only in the beta versions of the kernel, not the public ones.

Very simple test for much faster writes: Use a DCBZ loop (or IUtility->SetMem(), BZero(), newlib memset(), bzero(), etc.). Of course that way you wont get results most software will get, but the results will be much closer to the hardware limits.

Probably not relevant for the X5000 at all, but for example with the 603e and 604 CPUs (most likely not a problem of the CPUs themselves but the rest of the BlizzardPPC/CyberstormPPC hardware) using the caches in write through instead of copy back mode resulted in overall faster system performance.


Edited by joerg on 2022/2/18 17:03:31
Go to top
Re: x5000 benchmarks / speed up
Just popping in
Just popping in


See User information
@geennaam

Any chance I could get a copy of your benchmark program? I'm curious what the X1000 numbers would be.

Go to top
Re: x5000 benchmarks / speed up
Just can't stay away
Just can't stay away


See User information
Replaced memory on my X5040:

Quote:

Detected UDIMM KF1600C10D3/8G
Detected UDIMM KF1600C10D3/8G


Ragemem before:

Quote:

READ32: 598 MB/Sec
READ64: 1000 MB/Sec
WRITE32: 812 MB/Sec
WRITE64: 812 MB/Sec
WRITE: 2431 MB/Sec (Tricky)


Ragemem after:

Quote:

READ32: 666 MB/Sec
READ64: 1224 MB/Sec
WRITE32: 1593 MB/Sec
WRITE64: 1599 MB/Sec
WRITE: 2471 MB/Sec (Tricky)


Shaderjoy C++ single thread compilation before: ~126 seconds, after: ~117 seconds.

Go to top
Re: x5000 benchmarks / speed up
Home away from home
Home away from home


See User information
@Capehill

I wouldn't have thought it would be so slow???

Here are my X1000 results:
Quote:

---> RAM <---
READ32: 2805 MB/Sec
READ64: 4049 MB/Sec
WRITE32: 2350 MB/Sec
WRITE64: 2266 MB/Sec
WRITE: 334 MB/Sec (Tricky)

Apart from the last one it wipes the floor with your X5040...huh?

People are dying.
Entire ecosystems are collapsing.
We are in the beginning of a mass extinction.
And all you can talk about is money and fairytales of eternal economic growth.
How dare you!
– Greta Thunberg
Go to top
Re: x5000 benchmarks / speed up
Home away from home
Home away from home


See User information
@Raziel

Read64/Write64 how is that implemented?
have looked at what assembler opcode it uses?
can be faster reading/writing as doubles.

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.
Go to top
Re: x5000 benchmarks / speed up
Home away from home
Home away from home


See User information
@LiveForIt

No idea.

I just let RageMem (v0.37) run (without options) from shell and that is what it spitted out

People are dying.
Entire ecosystems are collapsing.
We are in the beginning of a mass extinction.
And all you can talk about is money and fairytales of eternal economic growth.
How dare you!
– Greta Thunberg
Go to top
Re: x5000 benchmarks / speed up
Just popping in
Just popping in


See User information
@Raziel

Quote:
Apart from the last one it wipes the floor with your X5040...huh?

As noted earlier in this thread (post 36), the X1000 seems to be much faster than the X5000 at accessing memory, at least according to RageMem. Geennaam speculated (post 37) why this might be. And the author of RageMem has noted that it was designed for earlier, lower-spec NG Amigas, and may not be that accurate for X1/X5-class machines.

Go to top
Re: x5000 benchmarks / speed up
Home away from home
Home away from home


See User information
@All
Yeah, it looks like ragemem can be a little bit wrong there when compare speed of x5000 and x1000 in terms of memory access. I mean, it's cleary show that x1000 there better, but, real time tests shows that things faster on x5000 instead (at least 3D tests).

Join us to improve dopus5!
AmigaOS4 on youtube
Go to top
Re: x5000 benchmarks / speed up
Not too shy to talk
Not too shy to talk


See User information
X5000
Kingston Fury KF318C10BBK2/8
2x4GB DDR3 1866MHz

RAGEMEM v0.37 compiled 11/06/2010

CPU
Freescale P5020 (E5500 core1.2 1995 Mhz
Caches Sizes
L132 KB L2512 KB L3none
Cache Line
64

---> CPU <---
MAX MIPS:  3988

---> L1 <---
READ32:  7533 MB/Sec
READ64
:  15044 MB/Sec
WRITE32
7535 MB/Sec
WRITE64
15048 MB/Sec

---> L2 <---
READ32:  4286 MB/Sec
READ64
:  7722 MB/Sec
WRITE32
5020 MB/Sec
WRITE64
8817 MB/Sec

---> RAM <---
READ32:  681 MB/Sec
READ64
:  1217 MB/Sec
WRITE32
1477 MB/Sec
WRITE64
1482 MB/Sec
WRITE
2313 MB/Sec (Tricky)

---> 
VIDEO BUS <---
READ:  23 MB/Sec
WRITE
540 MB/Sec

Go to top
Re: x5000 benchmarks / speed up
Quite a regular
Quite a regular


See User information
@joerg

It's now clear to me that unlike modern AMD/Intel CPUs, the NXP PowerPCs lack hardware cache management. Hence the slow transfer speeds with GCC generated code.

I have implemented quick and dirty DCBT/DCBA and I can already see a speedup to DDR3.

(My assumption is that the cacheline size of the e5500 is 64bytes.)

Memspeed V0.2:

Write 32bit integer:
--------------------------------------
Block size     16384 Kb2785.91 MB/s  +62%


Write 64bit integer:
--------------------------------------
Block size     16384 Kb2040.55 MB/s  +19%


Read 32bit integer:
--------------------------------------
Block size      16384 Kb1587.62 MB/+123%


Read 64bit integer:
--------------------------------------
Block size      16384 Kb1292.63 MB/+80%


Write 32bit float:
--------------------------------------
Block size    16384 Kb2784.67 MB/+63%


Write 64bit float:
--------------------------------------
Block size    16384 Kb2211.16 MB/s  +30%


Read 32bit float:
--------------------------------------
Block size    16384 Kb1590.21 MB/+123%


Read 64bit float:
--------------------------------------
Block size    16384 Kb1718.09 MB/s  +33%



Is bcopy(), memset() and memcpy() "inspired" by the Apple powerpc assembly code? From what i've heard it's lightning fast.


Edited by geennaam on 2023/2/20 23:35:07
Go to top
Re: x5000 benchmarks / speed up
Quite a regular
Quite a regular


See User information
@joerg

I am probably doing something wrong but as soon as the buffers don't fit into CPU cache anymore, the copy performance is very poor.

Tried bcopy(), memcpy() and IExec->CopyMemQuick()

All more or less the same result:
Copy (CopyMemQuick):
--------------------------------------
Block size         1 Kb3423.14 MB/s
Block size         2 Kb
3657.73 MB/s
Block size         4 Kb
3751.87 MB/s
Block size         8 Kb
3845.11 MB/s
Block size        16 Kb
3722.72 MB/s
Block size        32 Kb
2970.01 MB/s
Block size        64 Kb
2959.63 MB/s
Block size       128 Kb
2979.04 MB/s
Block size       256 Kb
2940.45 MB/s
Block size       512 Kb
1487.38 MB/s
Block size      1024 Kb
1313.59 MB/s
Block size      2048 Kb
541.94 MB/s
Block size      4096 Kb
501.63 MB/s
Block size      8192 Kb
496.15 MB/s
Block size     16384 Kb
496.39 MB/s


I am not doing any DCBT/DCBA because it is my assumption that those functions are optimised already.

Go to top
Re: x5000 benchmarks / speed up
Home away from home
Home away from home


See User information
@geennaam

For 64-bit CPUs such as the P50x0, you should use dcbzl. Dcbz may (or may not) zero only half a cache line so that it's behaviour is consistent with older PowerPC CPUs. This forces it to fetch the remaining 32 bytes which isn't what you want.

The dcba instruction may also be very slow, just like with the G5. It's an illegal instruction on the G5, and is emulated by doing nothing (but with the overhead of an illegal instruction trap).

Hans


P.S., You can query the cache line size via the exec.library'S GetCPUInfo() function (GCIT_CacheLineSize).

http://hdrlab.org.nz/ - Amiga OS 4 projects, programming articles and more.
https://keasigmadelta.com/ - more of my work
Go to top
Re: x5000 benchmarks / speed up
Just can't stay away
Just can't stay away


See User information
@geennaam
Quote:
It's now clear to me that unlike modern AMD/Intel CPUs, the NXP PowerPCs lack hardware cache management. Hence the slow transfer speeds with GCC generated code.
GCC has options to add data cache instructions, for example -fprefetch-loop-arrays, but with GCC 2.95.x and 3.4.x I got very poor results with it, maybe even slower than without using it.

Quote:
I have implemented quick and dirty DCBT/DCBA and I can already see a speedup to DDR3.
For best speed, at least on the old CPUs for which I implemented the memcpy() etc. functions, you need to do the DCB* not for the next reads/writes but 32 or 64 bytes in advance, for example 1 or 2 DCBT/DCBA before a loop and inside the loop current address + 32 or + 64 bytes.
On 64 byte cache line size CPUs try + 64 or + 128 bytes instead, or maybe even more, you'll have to test how many cache lines have to be touched in advance for best results.

Quote:
Is bcopy(), memset() and memcpy() "inspired" by the Apple powerpc assembly code?
No, I implemented them for newlib myself, using different code for different CPUs, but only for 603/4, 750, 74xy and 440ep CPUs, and only using integer and/or double accesses.
For example on CPUs with 32 byte cache line size using 2 128 bit cache line aligned vector writes you don't need DCBA/DCBZ nor vector streaming instructions, the cache line reads before writes aren't done. But using 4 64 bit writes (integer or double) instead is too slow and there is always a cache line read before the write if you don't use DCB* or AltiVec streaming instructions.
But someone else implemented the AltiVec version using the vector streaming instructions for the 74xy CPUs.

AFAIK later the code was moved from newlib.library to Exec or the HAL and newlib.library only calls the Exec/Utility library functions. For the 440 (and probably 460 CPUs too) versions using DMA were added, but the DMA code is only used for very large copies because of the setup overhead.
Very likely optimised versions for the X1000, X50[24]0 and A1222 were implemented as well, but I don't know anything about such newer parts of newlib and/or the kernel.


Edited by joerg on 2023/2/21 7:50:08
Go to top
Re: x5000 benchmarks / speed up
Amigans Defender
Amigans Defender


See User information
All Exec functions are optimized for CPUs. All newlib code is in Exec. So you have to use Exec functions since they are always optimized for the current machine. Maybe they aren't for x5000 but only kernel developers know

i'm really tired...
Go to top

  Register To Post
« 1 2 (3) 4 »

 




Currently Active Users Viewing This Thread: 1 ( 0 members and 1 Anonymous Users )




Powered by XOOPS 2.0 © 2001-2023 The XOOPS Project