Login
Username:

Password:

Remember me



Lost Password?

Register now!

Sections

Who's Online
52 user(s) are online (29 user(s) are browsing Forums)

Members: 0
Guests: 52

more...

Headlines

Forum Index


Board index » All Posts (geennaam)




Re: Should I expect better video playback performance? (Southern Island/X1000 Updates)
Just popping in
Just popping in


This thread is hilarious!!

The faithfull are still cheering for the hometeam on the grandstand in front of Hyperion HQ. And while you can hear the Crickets singing in Hyperion HQ and watch tumbleweeds slowly passing by on the parking lot. They are still boo'ing and throwing cans of beer towards the grandstand of Aeon.
"How dare you to progress to a point that it cannot run on top of a bare OS4 anymore", the angry mobs shouts. "Just wait and watch the tumbleweeds pass by like us"!!!. Replacing software that is stuck in the 80s with 21 century software for 21 century hardware is work of the devil!!! Yes, we want your commercial software, yes, we know that it is a best seller when you reach 50 sales. But still we demand discounts!! Or even better, hand over your commercial software to Hyperion for free. We want your software, but we just don't want you. Because we only cheer for the hometeam.
And release the d#mn Tabor, because than we can be angry at you why you didn't deliver Aos4 for it like promised!! Oh, and we know that it was Hyperions task, but blame you anyways. Because that is what we do!! And no, we don't understand the reason for those replacement commands, because we don't want to. We just want to blame.

It took me half a year to realise that the death of AmigaOS4 is caused by the enemy from within. Those people, at nearly twice my age, using childish and pathetic words like MathyOS or Morphos2.0, are the final nail the AmigaOS4's coffin.

Go to top


Re: x5000 benchmarks / speed up
Just popping in
Just popping in


I am still curious what is going on with the DDR3 performance of the X5000.

I wrote a little benchmark program in plain C to test the three cache levels until we reach the DDR3 memory. I wasn't able to build a specific e5500 optimised binary because gcc in the SDK doesn't support the -mcpu=e5500 target. So it's a generic PPC build.

Unlike rumours on certain forums, the 2MByte CPC is enabled and configured as L3 copy back cache.

Each test transfers 2GByte of data. The amount of passes for each test is 2GByte/blocksize.

The result is the average time of the passes.

4.Work:Development/projects/memspeedmemspeed 
Memspeed V0.1
:

Write 32bit integer:
--------------------------------------
Block size         1 Kb7296.37 MB/s
Block size         2 Kb
7303.30 MB/s
Block size         4 Kb
7496.34 MB/s
Block size         8 Kb
7529.59 MB/s
Block size        16 Kb
7546.76 MB/s
Block size        32 Kb
7509.58 MB/s
Block size        64 Kb
5283.98 MB/s
Block size       128 Kb
5327.65 MB/s
Block size       256 Kb
5323.72 MB/s
Block size       512 Kb
5189.57 MB/s
Block size      1024 Kb
4718.91 MB/s
Block size      2048 Kb
4398.85 MB/s
Block size      4096 Kb
1994.95 MB/s
Block size      8192 Kb
1764.40 MB/s
Block size     16384 Kb
1715.94 MB/s
Block size     32768 Kb
1720.12 MB/s
Block size     65536 Kb
1721.06 MB/s
Block size    131072 Kb
1710.87 MB/s
Block size    262144 Kb
1720.05 MB/s


Write 64bit integer
:
--------------------------------------
Block size         1 Kb7625.49 MB/s
Block size         2 Kb
7488.31 MB/s
Block size         4 Kb
7697.63 MB/s
Block size         8 Kb
7735.36 MB/s
Block size        16 Kb
7756.57 MB/s
Block size        32 Kb
7702.04 MB/s
Block size        64 Kb
5143.28 MB/s
Block size       128 Kb
5255.21 MB/s
Block size       256 Kb
5255.51 MB/s
Block size       512 Kb
5180.61 MB/s
Block size      1024 Kb
4774.23 MB/s
Block size      2048 Kb
4499.13 MB/s
Block size      4096 Kb
1993.73 MB/s
Block size      8192 Kb
1750.30 MB/s
Block size     16384 Kb
1716.59 MB/s
Block size     32768 Kb
1712.13 MB/s
Block size     65536 Kb
1718.73 MB/s
Block size    131072 Kb
1706.55 MB/s
Block size    262144 Kb
1712.28 MB/s


Read 32bit integer
:
--------------------------------------
Block size          1 Kb7360.47 MB/s
Block size          2 Kb
7432.67 MB/s
Block size          4 Kb
7592.63 MB/s
Block size          8 Kb
7628.53 MB/s
Block size         16 Kb
7628.96 MB/s
Block size         32 Kb
7520.78 MB/s
Block size         64 Kb
4413.41 MB/s
Block size        128 Kb
4486.52 MB/s
Block size        256 Kb
4455.85 MB/s
Block size        512 Kb
4442.14 MB/s
Block size       1024 Kb
1752.00 MB/s
Block size       2048 Kb
1504.30 MB/s
Block size       4096 Kb
778.60 MB/s
Block size       8192 Kb
717.67 MB/s
Block size      16384 Kb
713.41 MB/s
Block size      32768 Kb
714.39 MB/s
Block size      65536 Kb
711.96 MB/s
Block size     131072 Kb
713.14 MB/s
Block size     262144 Kb
712.18 MB/s


Read 64bit integer
:
--------------------------------------
Block size          1 Kb7660.88 MB/s
Block size          2 Kb
7648.96 MB/s
Block size          4 Kb
7724.62 MB/s
Block size          8 Kb
7766.72 MB/s
Block size         16 Kb
7650.40 MB/s
Block size         32 Kb
7694.40 MB/s
Block size         64 Kb
4497.88 MB/s
Block size        128 Kb
4420.66 MB/s
Block size        256 Kb
4498.84 MB/s
Block size        512 Kb
4457.12 MB/s
Block size       1024 Kb
1761.04 MB/s
Block size       2048 Kb
1525.33 MB/s
Block size       4096 Kb
776.64 MB/s
Block size       8192 Kb
717.97 MB/s
Block size      16384 Kb
713.57 MB/s
Block size      32768 Kb
712.62 MB/s
Block size      65536 Kb
714.85 MB/s
Block size     131072 Kb
713.39 MB/s
Block size     262144 Kb
713.53 MB/s


Write 32bit float
:
--------------------------------------
Block size        1 Kb7344.76 MB/s
Block size        2 Kb
7486.04 MB/s
Block size        4 Kb
7468.08 MB/s
Block size        8 Kb
7579.15 MB/s
Block size       16 Kb
7591.77 MB/s
Block size       32 Kb
7560.02 MB/s
Block size       64 Kb
5238.74 MB/s
Block size      128 Kb
5209.86 MB/s
Block size      256 Kb
5281.30 MB/s
Block size      512 Kb
5258.00 MB/s
Block size     1024 Kb
4843.57 MB/s
Block size     2048 Kb
4593.66 MB/s
Block size     4096 Kb
1985.41 MB/s
Block size     8192 Kb
1748.74 MB/s
Block size    16384 Kb
1708.76 MB/s
Block size    32768 Kb
1709.08 MB/s
Block size    65536 Kb
1698.84 MB/s
Block size   131072 Kb
1708.93 MB/s
Block size   262144 Kb
1706.82 MB/s


Write 64bit float
:
--------------------------------------
Block size        1 Kb14671.21 MB/s
Block size        2 Kb
14723.52 MB/s
Block size        4 Kb
14958.48 MB/s
Block size        8 Kb
15108.87 MB/s
Block size       16 Kb
15149.16 MB/s
Block size       32 Kb
15088.17 MB/s
Block size       64 Kb
9300.81 MB/s
Block size      128 Kb
8960.95 MB/s
Block size      256 Kb
9294.13 MB/s
Block size      512 Kb
9258.15 MB/s
Block size     1024 Kb
5371.73 MB/s
Block size     2048 Kb
4845.98 MB/s
Block size     4096 Kb
2003.67 MB/s
Block size     8192 Kb
1766.85 MB/s
Block size    16384 Kb
1707.37 MB/s
Block size    32768 Kb
1716.26 MB/s
Block size    65536 Kb
1719.92 MB/s
Block size   131072 Kb
1714.04 MB/s
Block size   262144 Kb
1708.27 MB/s


Read 32bit float
:
--------------------------------------
Block size        1 Kb7228.52 MB/s
Block size        2 Kb
7508.05 MB/s
Block size        4 Kb
7585.09 MB/s
Block size        8 Kb
7623.84 MB/s
Block size       16 Kb
7555.20 MB/s
Block size       32 Kb
7559.27 MB/s
Block size       64 Kb
4440.56 MB/s
Block size      128 Kb
4383.84 MB/s
Block size      256 Kb
4407.89 MB/s
Block size      512 Kb
4404.87 MB/s
Block size     1024 Kb
1756.79 MB/s
Block size     2048 Kb
1517.62 MB/s
Block size     4096 Kb
777.15 MB/s
Block size     8192 Kb
715.84 MB/s
Block size    16384 Kb
712.32 MB/s
Block size    32768 Kb
713.52 MB/s
Block size    65536 Kb
712.54 MB/s
Block size   131072 Kb
712.34 MB/s
Block size   262144 Kb
712.24 MB/s


Read 64bit float
:
--------------------------------------
Block size        1 Kb14763.19 MB/s
Block size        2 Kb
14706.17 MB/s
Block size        4 Kb
15021.15 MB/s
Block size        8 Kb
14814.83 MB/s
Block size       16 Kb
15228.22 MB/s
Block size       32 Kb
14990.95 MB/s
Block size       64 Kb
8336.21 MB/s
Block size      128 Kb
8342.40 MB/s
Block size      256 Kb
8334.50 MB/s
Block size      512 Kb
8047.62 MB/s
Block size     1024 Kb
3170.09 MB/s
Block size     2048 Kb
2831.46 MB/s
Block size     4096 Kb
1354.34 MB/s
Block size     8192 Kb
1301.80 MB/s
Block size    16384 Kb
1286.87 MB/s
Block size    32768 Kb
1293.76 MB/s
Block size    65536 Kb
1291.12 MB/s
Block size   131072 Kb
1291.57 MB/s
Block size   262144 Kb
1289.57 MB/s


You can clearly distinguish between each cache level up to the DDR memory.

One thing that I noticed is that 64bit integer read/write speed is on the same level as 32bit integers. This is different for floats where the L1 and L2 cache speed nearly doubles for 64bit floats. I do not know if this is a limitation of the e5500 core or the fact that I've build a generic PPC binary. The 64bit float performance can also explain the write performance of the X1000. When ragemem is optimised for Altivec or otherwise can make use of the 128bit load/store unit, this effectively doubles the theoretical memory bandwidth compared to the X5000..

Another observation worth mentioning is that as soon as a write hits the L3 cache or the DDR3 memory, the speed drops to the same level for all four write scenarios (int32,int64,fp32,fp64). Only the 64bit float read can benefit from the 64bit access.

The e5500 cores and the L3 cache and DDR3 controllers are all connected to the Corenet Coherency Fabric (CCF). This internal bus runs at 800MHz and is advertised to be able to support a sustainable read performance of 128bytes/clock cycle. This means a sustained read bandwidth of ~100GByte/s. Furthermore, the reference manual claims that this is a "low latency" datapath to the L3 cache. The L3 cache is also advertised to support a sustained read bandwidth of ~100GByte/s.
So if the bandwidth of the L3 cache and the CoreNet Coherency Fabric are not the issue than the only viable option left is latency or another bottleneck between the e5500 and the Corenet coherency Fabric.
The reference manual doesn't mention how the DDR3 controller is connected to the CCF, but I can imagine that this is a 128bit interface. The large difference in speed between the level three cache and DDR3 controller can be explained by the introduction of latency (wait states) because of crossing clock domains (800 MHz vs 666MHz). And of course the latencies of the DDR3 memory itself.

The following topic on the NXP forum suggests that it's the latency that is killing performance in my test loops.

Quote:
We nowhere considered the L2 cache as a core's accelerator in the P2020. When enabled, it inserts additional latency to core's transactions due to time required for cache hit checking. If no hit, the transaction is sent to coherent bus, for example to DDR. L2 cache can help to speed up core's operations only if the data being read is absent in the L1 cache but valid in L2 one. This depends on the code and may or may not happen.
Main features of the L2 cache are as follows:
- allows access to the cache for the I/O masters (feature called 'stashing'), in this case the core reads data from the L2 cache instead of DDR;
- allows to share data between two cores.


So the QorIq L2 cache is optimised for I/O handling and inter core communication instead of raw processing speed. Which sounds logical considering the fact that this processor is designed to be a network processor.

This also explains that bandwidth for my simple copy loop drops considerable when the we go down the cache hierarchie.

I repeated my memory test with the L3 cache configured as a FIFO and even when it was completely disabled. But I could hardly notice a difference in DDR3 performance.

So the next test would be to see what happens when I disable L2 cache or test DDR3 performance with a DMA transfer.

Go to top


Re: x5000 benchmarks / speed up
Just popping in
Just popping in


@mufa

Your values are really low. It looks to me that this is caused by the combination of high latencies and single rank. But it also looks like your memory controller is running at 666MHz too. I cannot believe that it's the P5040 fault. Afterall, it needs to feed 4 CPU cores instead of 2. Hence the update in memory controller clock speed.

We need a CPU-Z like program to check the SPD and memory controller settings.

I'll create one if I can find the right motivation to do so.



Go to top


Re: x5000 benchmarks / speed up
Just popping in
Just popping in


@msteed

First of all, the ragemem results are bogus for both the X1000 and X5000. As a rule of the thumb for the maximum thoughput calculation, you can use 80% of the maximum bus speed.
X1000: 0.8x1600*8= 10240Mbyte/s ( 20480 MByte/s for controller interleaving)
X5000: 0.8x1333*8= 8531Mbyte/s ( 17062 MByte/s for controller interleaving.
So the ragemem results are about 10% of the expected thoughput value.

I don't have enough knowledge of the X1000, X5000 and ragemem to pinpoint where the difference comes from. But I can think of (a combination of) several reasons that can explain the difference.

1) Memory is simply faster. 1600MT/s versus 1333MT/s can explain 20% at the same timing. But the CL5-5-5 (at 1600MT/s) timing is noticeable faster than the standard CL9-9-9 for DDR3. So there is another 10%-15%.

2) X5000 internal bus seems to run at 800MHz. Memory at 666MHz. So there are wait states involved when crossing clock demains. Same to/ from cache. The memory controller might contain FIFOs to speedup throughtput with DMA transfers. But under software control this would benefit only write access. This might explain why writes are faster than reads. (I don't know anything about bus and cache speed of the X1000.)

3) Ragemem will most likely not use DMA. So it might be a software loop. Maybe the X1000 has a lower branch delay and therefore the memory controller gets it's data faster.

4) DDR3 has a fixed burst lenght of 8 words. DDR2 has a programmable burst lenght of 4 or 8. If ddr2 burst lenght is programmed to 4 and the software loop can only fill one word in the burst, then less cycle are wasted until the next memory access. (This looks like the main reason why we only get 10% of the to be expected thoughput in ragemem)

5) The X1000 memory controller is simply more efficient


It would be interesting to see the results of the lucky few with a X5000/40. The P5040 memory controller is also capable of running at 1600MT/s.

Go to top


Re: x5000 benchmarks / speed up
Just popping in
Just popping in


@kas1e

You are right, it's 2x8GB =16GB. I've corrected it.

Go to top


Re: x5000 benchmarks / speed up
Just popping in
Just popping in


@kas1e

I've dug a bit deeper into the subject and it all makes sense to me now. So here is what I know so far. (But you can also skip directly to the end for the performance values )

The technology:
The first implementation of dual channel memory on PCs merged two 64bit DDR3 buses into one 128bit bus (as was my understanding of this technology until today). This doubles the bandwidth of the memory bus. The timing of both DIMMs had to be closely matched for the 128bit bus to work properly. But as it turned out, most consumer applications saw little benefit of a 128bit DDR3 memory bus. So, the 128bit bus idea was dropped and replaced with technology that divided (interleaved) the memory access over the two channels (e.g., the even access on channel 1 and the odd access on channel 2). Since Both channels operate in a concurrent fashion, and the fact that an internal access must wait for the ( by comparison) slow DDR3 interface anyways. The effective memory bandwidth is doubled again.
The difference with the former 128bit wide dual channel bus is that the bus remains 64bit for the consumer applications but at twice the perceived speed. And as a result, a lot of those applications saw a considerable increase of performance. Another advantage is that the DIMMs operate independent from each other. So, timing doesn't have to be closely matched anymore. Only the memory layout needs to be the same. It is still preferable that both DIMMs have a similar timing. But this is not required. The slowest timing determines the timing for both DIMMs.

Now back to the X5000. The P5020 supports two tricks to increase the effective bandwidth of DDR3 memory accesses. The first trick is controller interleaving and the second one is rank interleaving.

Controller interleaving:
P5020 controller interleaving supports 4 modes (cache-line, page, bank and super-bank). The controller cache-line interleaving mode is basically the same as modern dual channel technology on PCs. The DDR3 memory access with the size of a cache line is divided (interleaved) over the two available memory controllers. This effectively doubles the memory bandwidth for an application on the X5000 too. The other two modes are like rank interleaving and are meant to reduce latencies. But cache-line interleaving will provide the most performance benefit by far.

Rank (chip select) interleaving:
First a bit background on the internals of DDR3 memory. DDR3 DRAM is organized in multiple two-dimensional arrays of rows and columns called banks. A row (also called a page) in a DDR3 DIMM bank has a total length of 8 Kbyte (for 8x 8bit DDR3 chips) and must be opened, by means of an activation command, before the individual bytes in an 8kbyte page can be accessed with column addressing. So, the same page in the same bank for all 8 8bit DDR3 chips in the same rank are opened at once. When access (read/write) to the page columns is completed, the complete row/page must be closed with a pre-charge command before you can open the next row/page with a new activation command.
Rank interleaving reorganizes the memory addressing in such a way that the consecutive memory address is not on the next row in the same bank of the rank but on the same row/page of the same bank on the next rank (chip select). One activation command opens the same row in the same bank across all chip selects. In case of dual rank memory with 2 chip selects and 8x 8bit chips on each chip select, the effective row/page size doubles from 8kbyte to 16kbyte. The benefit of this addressing method is that it reduces the amount of row/page activation and closing commands and therefore reducing the average read/write latency. This results in lower overall access time and therefore an increase of effective bandwidth. (Note that the DIMM is still 64bit. So, while the activation and closing commands can be send simultaneously to both ranks, still only 64 bits of data can be accessed at the same time).

The impact of the memory controller and DIMM itself on overall memory performance of the X5000 can be ranked from high to low in the following order:
1. Cache line controller interleaving
2. Rank interleaving
3. Faster timing of the DIMM itself. (Lower latencies with one or two clocks has less impact than omitting commands with latencies of >10 clocks)
Normally, the clock speed of the DDR3 DIMM would matter as well for the true latency of a DIMM, but the X5000/20 is limited to just 666MHz/1333MT/s.

The DIMMs:
It turns out that the CPU-Z screenshots in my previous post only tell half of the story. In the early days of DDR3 DIMMs, DDR3 chips with the double density were often more than twice as expensive compared to two DDR3 chips of half the density (for the same annual quantity of DIMMs). So, it was more economical to fit 4GB DIMMs with two ranks of 2GB each. When the price of DDR3 silicon went down, the package price became dominant. So, manufacturers started to fit their DIMMs with single ranks of 4GB because that was now the most economical configuration. Unfortunately, this happened often using the same sku.
So, the CPU-Z screenshots are not wrong. They simply do not apply to the DIMM of kas1e (and mine) anymore. I verified this by removing the heat spreader on my 4GB corsair DIMM and it contains indeed 8x 8bit DDR chips (only one side is fitted). So, a single rank. And that is why uboot produced the error message on rank interleaving. Because you cannot interleave ranks with just a single rank available on your DIMM. The DIMM of Skateman is an 8GB DIMM. This DIMM has 8x 8bit DDR3 chips on both sides of the DIMM (16 chips total). This 128bit in total is divided over two ranks. That's why rank interleaving works for Skateman.
Since the DIMMS of both kas1e and Skateman run on similar SPD JEDEC timing, we can already notice that rank interleaving results in about 11% higher write speeds in rage mem.
The first post in this thread shows that controller interleaving results in about 65% higher write speeds in rage mem.
The result of faster DIMM timing is a bit trickier to predict because we currently cannot control the latencies like on a PC BIOS/UEFI. Uboot spl simply takes the values from the SPD eeprom (often JEDEC defined CL9-9-9-24 at 1333MT/s) on a DIMM and applies them to the memory controller. And these so called JEDEC timings are more relaxed compared to the maximum capability of the DIMM. But fortunately, DIMMs like the Kingston Fury Beast 1866MT modules come with more optimized timing values in their SPD EEPROM (CL8-9-8-24 instead of JEDEC CL9-9-9-24 at 1333MT/s). So, the modules are not necessarily faster than equal DIMMs from other manufacturers but uboot will configure the P5020 memory controller with faster timing from the SPD eeprom.

Test result:
For the sake of amiga science, I've bought both the Fury Beast single rank (2x4GB) and dual rank (2x8Gb) kits.

Memory modules used in this test:
- Corsair Vengeance LP 2x4GB DDR3-1600 Single Rank (CML8GX3M2A1600C9) -> CL-9-9-9-24@1333MT/s
- Kingston Fury Beast 2x4GB DDR3-1833 Single Rank (KF318C10BRK2/8) -> CL-8-9-8-24@1333MT/s
- Kingston Fury Beast 2x8GB DDR3-1833 Dual Rank (KF318C10BBK2/16) -> CL-8-9-8-24@1333MT/s

Here are the ragemem results with their Uboot initialization output:

Corsair Vengeance LP:
DRAM:  Initializing....using SPD
Detected UDIMM CML8GX3M2A1600C9
Detected UDIMM CML8GX3M2A1600C9
Not enough bank
(chip-select) for CS0+CS1 on controller 0interleaving disabled!
Not enough bank(chip-select) for CS0+CS1 on controller 1interleaving disabled!
6 GiB left unmapped
8 GiB 
(DDR364-bitCL=9ECC off)
       
DDR Controller Interleaving Modecache line   

Read32
:  665 MB/Sec
Read64
:  1195 MB/Sec
Write32
1429 MB/Sec
Write64
1433 MB/Sec


Kingston Fury Beast 2x4GB (Single rank):
DRAM:  Initializing....using SPD
Detected UDIMM KF1866C10D3
/4G
Detected UDIMM KF1866C10D3
/4G
Not enough bank
(chip-select) for CS0+CS1 on controller 0interleaving disabled!
Not enough bank(chip-select) for CS0+CS1 on controller 1interleaving disabled!
6 GiB left unmapped
8 GiB 
(DDR364-bitCL=8ECC off)
       
DDR Controller Interleaving Modecache line
 
Read32
:  682 MB/Sec
Read64
:  1225 MB/Sec
Write32
1479 MB/Sec
Write64
1483 MB/Sec

Kingston Fury Beast 2x8GB (Dual rank):
DRAM:  Initializing....using SPD
Detected UDIMM KF1866C10D3
/8G
Detected UDIMM KF1866C10D3
/8G
14 GiB left unmapped
16 GiB 
(DDR364-bitCL=8ECC off)
       
DDR Controller Interleaving Modecache line
       DDR Chip
-Select Interleaving ModeCS0+CS1

Read32
:  685 MB/Sec
Read64
:  1261 MB/Sec
Write32
1638 MB/Sec
Write64
1644 MB/Sec


---------: --- CL9 SR --|---- CL8 SR ----| ---- CL8 DR ----
Read32 : 665 ( base ) | 682 ( +2.6% ) | 685 ( +3.0%)
Read64 : 1195 (base) | 1225 ( +2.5%) | 1261 ( +5.5%)
Write32: 1429 (base) | 1479 ( +3.5%) | 1638 (+14.6%)
Write64: 1433 (base) | 1483 ( +3.5%) | 1644 (+14.7%)

Conclusion:
As predicted, the controller interleaving mode gives the biggest boost in performance (Write: ~+65% ; see post #1). Rank interleaving comes second (Write: ~+10.9%) and improved timing (CL9 -> CL8 ; Write +3.5%) comes third. But I am sure that I could have pushed this module further if our uboot would allow for manual editing. At CL6 timing we could see a similar boost in performance as for rank interleaving alone.


Edited by geennaam on 2022/2/3 20:02:34
Go to top


Re: x5000 benchmarks / speed up
Just popping in
Just popping in


@kas1e

This is the fastest single rank DDR3 kit that I can find: Kingston KF318C10BBK2/8. (also available in red and blue ).

So I'll order a set in the next days and see if it makes a difference..

Go to top


Re: x5000 benchmarks / speed up
Just popping in
Just popping in


@kas1e

You can always revive the DIMM offline with the right tools. The SPD is just an I2C EEPROM.

Go to top


Re: x5000 benchmarks / speed up
Just popping in
Just popping in


@kas1e

You could change the register values of the DDR controllers directly, but changing DDR parameters on-the-fly is bad idea and will most likely result in a freeze of the system.

I don't see anything usefull in the CPLD section of that document, but I've found the I2C controller and SPD addresses.

At least we could dump the contents of the SPD on each DIMM (must be equal). It would also be possible to change the JEDEC timing (if the SPD EEPROM is not write protected). The values would be applied on the next boot. But a mistake and machine will not boot anymore.


Go to top


Re: x5000 benchmarks / speed up
Just popping in
Just popping in


@Skateman

You are as fast as your ram

Brand: Crucial (DR) | Kingston (SR) |
READ32 : 665 | 670 MB/Sec | >+1%
READ64 : 1195 | 1218 MB/Sec | +1.9%
WRITE32: 1429 | 1585 MB/Sec | +10.9%
WRITE64: 1433 | 1589 MB/Sec | +10.9%
WRITE: 2318 | 2320 MB/Sec (Tricky) | >+1%

Sigh, 11% is worth the effort. Now I have to buy new ram

Go to top


Re: x5000 benchmarks / speed up
Just popping in
Just popping in


@kas1e


See here for information about the uboot memory controller options.

Uboot should also support some low level DDR tweaking. It's supposedly enabled by creating a new env called "ddr_interactive" and set it to any value.

After reset, it should bring you to the fsl ddr debugger. The debugger allows you to change DDR parameters and print the contents of the SPD.

But unfortunately this doesn't work for me. The env variable is simply ignored on our uboot.

Go to top


Re: x5000 benchmarks / speed up
Just popping in
Just popping in


@kas1e

I've checked your hwconfig env with mine and I have the same values.

"ctlr_intlv" sets the memory controller interleaving mode.
"bank_intlv" sets the chip-select interleaving mode within a single controller or DIMM.

According to page 9 and 14 of NXP AN3939. Both interleaving types can be activated at the same time.

The following uboot output shows that the chip-select interleaving mode is failing. But the controller interleaving mode is still set to cache-line.
DRAM:  Initializing....using SPD
Detected UDIMM CML8GX3M2A1600C9
Detected UDIMM CML8GX3M2A1600C9
Not enough bank
(chip-select) for CS0+CS1 on controller 0interleaving disabled!
Not enough bank(chip-select) for CS0+CS1 on controller 1interleaving disabled!
6 GiB left unmapped
8 GiB 
(DDR364-bitCL=9ECC off)
       
DDR Controller Interleaving Modecache line


I've tried to set different controller interleaving modes, but uboot always sets it to cache-line (Is it restricted or hard coded?)

The funny part is that bank interleaving for Skateman using CS0+CS1 succeeds but should not be available because his single rank memory uses only one CS for each controller.

So either UBOOT is doing something fishy. Or the CPU-Z output is wrong and Skatemans DIMMS support dual rank afterall (I've found a dual rank "B1" version of Skateman's memory DIMMs). In that case, the problem, if there is a problem at all, is still a mystery.

The NXP application note mentions that interleaving performance depends on the application anyways.

@skateman

Can you post the ram performance output of ragemem. If your values are close to ours than the topic can be closed.


Go to top


Re: x5000 benchmarks / speed up
Just popping in
Just popping in


@kas1e

The P5020 has two memory controllers. Both memory controllers each feature physical Chip Select lines CS0-CS3. So each controller offers the maximum amount of CS lines of a single DIMM slot.

I'm not sure what "CS0+CS1" is supposed to mean in the context of uboot. Physical or simply the number of chip selects in use. But the uboot documentation suggests that it likes to interleave controllers in cacheline mode and not chip selects.

But if uboot tries to "cacheline"interleave physical CS0 and CS1 of memory controller 1 than the issue is that kas1e's corsair memory is dual rank. Dual rank memory needs two chips selects. Interleaving them in cacheline mode will not work because they are on the same controller.

The memory from Skateman is single rank. If uboot means to interleave the first chip select in use (to DIMM slot 1) and the second chip select in use ( to DIMM slot 2). Then cacheline interleaving makes sense.

Edit:
Found this document that explains that Qoriq DDR interleaving modes for the four featured interleaving modes (cacheline, page, bank and superbank) are indeed implemented between controllers and not between individual banks on the same rank like on PCs. Therefore, the wiki page linked in the os4welt post doesn't directly apply to the X5000. So if uboot sees CS0 and CS1 on the same memory controller than interleaving apparently fails. There is the option to interleave between banks on each controller seperately. But I doubt that this is implemented in our uboot.

This explains why the memory of skateman works with interleaving while kas1e can't use interleaving. And might be fixed in uboot by grouping CS0+CS2 and CS1+CS3 in case of dual rank memory. Alternatively, independant single controller bank (chip select) interleaving can be implemented

Edit2:

Some manufacturers show the rank configuration on the sticker on the dimm itself. 1Rx8 is single rank. 2Rx8 is dual rank. Alternatively you can search for the memory type number in combination with CPU-Z. If you are lucky than someone has posted a screenshot of the spd tab of the one that you intend to buy. Kingston has an overview and selector with amount of ranks on their site.

Cosrsair (Kas1e):
Resized Image









Kingston (Skateman):
Resized Image


Edited by geennaam on 2022/1/31 16:43:10
Edited by geennaam on 2022/1/31 16:44:05
Edited by geennaam on 2022/1/31 16:44:58
Edited by geennaam on 2022/1/31 17:46:22
Edited by geennaam on 2022/1/31 17:53:03
Edited by geennaam on 2022/1/31 17:58:44
Edited by geennaam on 2022/1/31 18:01:29
Edited by geennaam on 2022/1/31 18:04:29
Edited by geennaam on 2022/1/31 18:05:22
Edited by geennaam on 2022/1/31 18:12:26
Edited by geennaam on 2022/1/31 18:46:51
Edited by geennaam on 2022/2/1 8:15:43
Go to top


Re: X5000 - GfxBench2D score
Just popping in
Just popping in


@SinanSam460

Ok. Didn't know that. I thought that only the video decoding issues were solved for the RX550/560/570.

So maybe 2.8 isn't the latest driver as well. Time to run updater.

Go to top


Re: X5000 - GfxBench2D score
Just popping in
Just popping in


@sailor


I think we see noticable differences in that score.
1) RadeonRX driver version 2.10. Iirc, the latest public release is 2.8

2) It's a Gigabyte RX580 4 GB. Maybe this means that the RX580 is now supported for 3D and/or video acceleration as well?

3) When I compare the score with my own X5000/RX570 score, the difference seems mostly caused by improved memcopy performance: WritePixelArray Memory copy ~525MB/s -> ~1100MB/s ; ReadPixelArray ~25MB/s -> ~1000MB/s. Is this due to DMA/GART support?

The second best score is with a RX570. Although it's a different RX570, it's better for comparing with my setup. RadeonRX driver is 2.11 (i have 2.8) and pcigraphis.card is 54.1 (i have 53.18).

Somehow, the performance of everything except memcopy and "random score" is worse. This can be explained by a difference in GPU clock speed. The second one can be explained by the huge memcopy improvement as it is part of the random test.
The memcopy performance improvement is similar as for the RX580 card above: WritePixelArray Memory copy ~525MB/s -> ~1050MB/s ; ReadPixelArray ~25MB/s -> ~950MB/s.
The bottomline score improved from 8248 to 9171.



Edited by geennaam on 2022/1/28 10:50:01
Edited by geennaam on 2022/1/28 10:53:37
Go to top


Re: Are 4k screens supported by any AmigaNG compatible gfx-card?
Just popping in
Just popping in


The current RX driver already supports 4k and uwxga.
My main monitor works like a charm at 3440x1440@61Hz. (Asus VG34VQL1B).
It also works with my LG C9 OLED tv at 3840x2160@30Hz. The 30Hz makes window movement slugish so it's not really useable. But it works.

Let's hope that the updated timing will allow for 4k@60Hz.

Go to top


Re: Amiga X5k CPU Cooler
Just popping in
Just popping in


@Gregor

I just moved my X5000 to the Cooler Master SL600M Black edition. There are two 200mm fans mounted in the bottom of the case and air comes out at the top. The graphics card is mounted vertically to avoid blocking the airflow. As a result, the mainboard receives the full airflow from the 200mm fans. But reason for buying this new housing was not cooling but to make the PCIe x4 slot available.

The fans are bit noisy so I will replace them with two Noctua 200mm fans tomorrow.

The box of the 60mm Noctua fan doesn't mention the type number of the LNA cable. It just mentions that the maximum rpm with included LNA cable is reduced to 2300.

Go to top


Re: Amiga X5k CPU Cooler
Just popping in
Just popping in


My X5000 came with a Gelid Silent 6 fan. Noise level is not an issue because the two fans on my GFX card are louder. Changed it nevertheless to the Noctua (just because I can). Unfortunately, the Gelid fan was mounted with M2.5x20mm screws. These are not compatible with the Noctua fan (too long) . So I had to get myself four shorter M2.5 screws with washers to mount the Noctua fan.
So beware.
The new version that I have isn't shipped with a reducer because it now supports PWM. So I have it connected to the CPU fan connector.

According to the cputemp docky, the cpu temperature is stable at 49 degC at a cpu load of 100%. The same docky indicates that the fan is spinning at 2880 rpm with a PWM setting of 31%. Both cannot be true because this fan can only do 3000(+/- 10%) at PWM 100% (12V DC).

But anyways, it works. And it runs about 4 degC cooler compared to the Gelid fan.

Edit: It actually came with a reducer afterall. It's a so called LNA cable (Low noise adapter) . After mounting this cable, the RPM value dropped to 1860 while the PMW stays at 31%. The LNA cable should reduce the RPM from 3000 to 2300. Considering the 2880 reading without LNA cable, I'd expected a value around 2200 rpm. Not sure if this is an issue with the cable or the RPM reading
Temperature is still at 49 degC. So either I've hit the limit of the heatsink or the cpu is able to cool a lot to the mainboard.


Edited by geennaam on 2022/1/18 21:16:14
Edited by geennaam on 2022/1/18 21:16:52
Go to top


Re: Odyssey 1.23 progress: r5 RC2
Just popping in
Just popping in


@kas1e

Are those static linked libraries plain C or C++?

Go to top


Re: Odyssey 1.23 progress: r5 RC2
Just popping in
Just popping in


@kas1e

If I understand you correcly, you want to split the static linked library into a regular amiga shared library (.library) and a static linked wrapper library.
The static linked wrapper library contains calls to the newly created amiga library.
To make it a clean implementation, the static linked wrapper library must open and close this library each time the amiga library function is executed afaik.

In case of the jpeg library, you could also create a wrapper for jpeg datatypes.

Go to top



TopTop
(1) 2 3 4 ... 10 »




Powered by XOOPS 2.0 © 2001-2016 The XOOPS Project