My MiniGL experiments,recompilation,tips,etc...

	Bottom Previous Topic Next Topic
Register To Post

« 1 2 (3)

Karlos

Re: My MiniGL experiments,recompilation,tips,etc...

Posted on: 2015/7/6 10:08 #41

Just popping in

@LiveForIt

> Is there any part that can optimized by using AltiVec?

Not without introducing extra complexity and branches in the code to cater for machines that aren't altivec enabled. There are some functions that have altivec alternatives implemented but the vast majority of the code does not.

> Is there any complex switches can be replaced table lookups?

On faster CPUs (covering multiple architectures), I have observed a trend that switch case is almost always faster than table lookups. The compiler is free to convert any switch case into one or more jump tables anyway.

> Is there any loops that can be unrolled?
> (Other micro optimisations)

The compiler does this already.

> Is there any malloc() / free(), that are called too often, maybe there is way's workaround it.

I seem to recall there was some MGLPolygon allocation going on in the past, but I replaced all that ages ago.

> Data being uneasily being copied as parameters, when they can be global, parameter passing does generate extra store operations.

That's not really a suitable approach for shared libraries - writeable global data aren't thread safe. I had to fix some issues caused by that very problem in the past. All the common stuff is stored in one or more structures that are passed by reference.

The fat MGLVertex was a promising lead, but it seems that it's not necessarily the major limiting factor here (it should still be trimmed however).

AmigaBlitter

Re: My MiniGL experiments,recompilation,tips,etc...

Posted on: 2015/7/6 10:26 #42

Quite a regular

Probably many of you already knows, but....

https://www-01.ibm.com/chips/techlib/t ... PowerPC_440_Embedded_Core

http://www.ibm.com/developerworks/systems/library/es-plib1app.html

https://www.ibm.com/developerworks/library/l-ppc/#resources
https://www.ibm.com/developerworks/library/l-ppc/

http://www.powerdeveloper.org/forums/viewtopic.php?t=1426

https://www.ibm.com/developerworks/library/l-ppc/#resources

http://www.codeproject.com/Articles/6 ... C-and-C-Code-Optimization

https://cs.brown.edu/courses/cs033/doc ... /c_optimization_notes.pdf

http://people.cs.clemson.edu/~dhouse/ ... s/405/papers/optimize.pdf

http://arxiv.org/ftp/arxiv/papers/1203/1203.0681.pdf

http://leto.net/docs/C-optimization.php

Retired

Radov

Re: My MiniGL experiments,recompilation,tips,etc...

Posted on: 2015/7/6 11:45 #43

Just popping in

Let me add my 2 cents ;)

PPC 440 core is claimed to be able to store up to four load misses ("up to three outstanding line fills, up to
four outstanding load misses"). Consider vertex data structure of 96 bytes and vertex processing loop (in a pseudo-code) as a:
dcbt currentVertexStructure
dcbt (currentVertexStructure+32
dcbt (currentVertexStructure64)
loop:
dcbt nextVertexStructure
dcbt (nextVertexStructure+32
dcbt (nextVertexStructure64)

(...currentVertexStructure processing...)
branch loop:

Touching data lines early shoould allow to minimize the cache misses penatly.

LiveForIt

Re: My MiniGL experiments,recompilation,tips,etc...

Posted on: 2015/7/6 19:10 #44

Home away from home

@Karlos

Quote:

Not without introducing extra complexity and branches in the code to cater for machines that aren't altivec enabled. There are some functions that have altivec alternatives implemented but the vast majority of the code does not.

Just compile two version of library, use pre processor directives, or overload interface table, or something like that. (That’s sort of what FFMPEG does).

Quote:

On faster CPUs (covering multiple architectures), I have observed a trend that switch case is almost always faster than table lookups. The compiler is free to convert any switch case into one or more jump tables anyway.

In some cases, GCC might automatic optimize your code to use table lookups, but you do not have control over what is checked first, GCC might decide that you check default first, even if default case, might be least used case.

But sometimes GCC will not be able to do that for you, because the case numbers is not sequential numbers.

Quote:

> Is there any loops that can be unrolled?
> (Other micro optimisations)

The compiler does this already.

True, but it can be worth decompiling code, and check, GCC does not always do what you expect. what is generated is not always generates the most efficient code, often things are punched back to RAM, pulled from RAM, when a value might have been keep in registry.

Quote:

> Is there any malloc() / free(), that are called too often, maybe there is way's workaround it.

I seem to recall there was some MGLPolygon allocation going on in the past, but I replaced all that ages ago.

I any case storing things on stack, is faster than on RAM, if you're only going keep things for short while.

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.

Karlos

Re: My MiniGL experiments,recompilation,tips,etc...

Posted on: 2015/7/7 10:15 #45

Just popping in

So I've implemented the "last used index per vertex" check in the Permedia2's implementation of W3D_DrawElements().

It certainly isn't any slower, but I need to write some synthetic tests on vertex-sharing indexed triangle lists to see the effect on performance. It should be reasonable as the Permedia2 driver is generally up against the bus limit.

corto

Re: My MiniGL experiments,recompilation,tips,etc...

Posted on: 2015/7/7 22:08 #46

Not too shy to talk

I'm sorry to come late in the discussion that I've read since it started.
It seems that MiniGL can receive improvements and that's great to see some have already begun.

You talked about Valgrind and I confirm such tool can't really be ported on AmigaOS. But that's sure tools like that are necessary.

I launched Quake3 on my MicroAOne (that is at the maximum of its capabilities) and using my profiler Hieronymus, I've found that much time is spent in W3D_Radeon.library. But maybe Quake3 is not the best example.

Note that I recently added alternative mode in my profiler that allows (at least on my G3 CPU at the moment) to profile on L2 cache misses. That will be interesting too.

I will have to test Cow3D and compile MiniGL with debug symbols to confirm some results.

Karlos

Re: My MiniGL experiments,recompilation,tips,etc...

Posted on: 2015/7/7 23:07 #47

Just popping in

How does the instrumentation in Hieronymus work, exactly?

Warp3d drivers and MiniGL can be compiled with basic inbuilt profiling that I added to help me diagnose some performance issues, but it isn't a true hierarchial profiler as only registered functions are timed.

Hans

Re: My MiniGL experiments,recompilation,tips,etc...

Posted on: 2015/7/8 0:03 #48

Home away from home

@corto
Quote:

I launched Quake3 on my MicroAOne (that is at the maximum of its capabilities) and using my profiler Hieronymus, I've found that much time is spent in W3D_Radeon.library. But maybe Quake3 is not the best example.

How much compared to the time spent in MiniGL?

Please be aware that MiniGL will drop down to sending one triangle/primitive at a time under certain conditions, and that could increase the amount of time spent in the driver by a fair amount. In Quake 3's case, its engine uses compiled vertex arrays MiniGL will drop down to sending a single triangle to Warp3D at a time if even one triangle in the array is clipped. This happens very frequently; so frequently that an early version of the W3D_SI driver managed only a few fps with Open Arena (which uses the Q3 engine).

Hans

Join Kea Campus' Amiga Corner and support Amiga content creation
https://keasigmadelta.com/ - see more of my work

corto

Re: My MiniGL experiments,recompilation,tips,etc...

Posted on: 2015/7/8 4:56 #49

Not too shy to talk

@Karlos
Hieronymus is a statistical profiler. He collect samples (let's say 50 or 60 times per second) that indicate the address of the instruction that was executed. Then, it finds the corresponding program and function.
Statistically, that gives proportions of times consumed by the different running applications.

So that is not intrusive and gives a great view of the system activity. And when you run a program, you also see the percentage of time spent in library that it uses.

The idea is the same than the tool "perf" that comes with Linux.

@Hans
Thanks for information. About results, I was not very comfortable to give them now, so take them as early and live results, not confirmed yet. On CPU time:
59% in W3D_Radeon.library
13% in ATIRadeon.chip
11% in Quake3
1% in minigl.library

Note that is given by the alternative mode of sampling I've just developped (using the performance monitor) so I would like to compare with the "standard" mode.
I obtained them yesterday very late, and was too tired to make other runs or check with other programs.

Karlos

Re: My MiniGL experiments,recompilation,tips,etc...

Posted on: 2015/7/8 7:27 #50

Just popping in

@Corto

Interesting. However, is such a method not prone to sample aliasing problems? How can you differentiate between code that spends 1% time in a function called at (any multiple of) the sampling frequency that you just happen be in when you measure and code that spends 99% of it's time elsewhere?

corto

Re: My MiniGL experiments,recompilation,tips,etc...

Posted on: 2015/7/8 20:03 #51

Not too shy to talk

@Karlos
The use case you describe can theorically happen, nothing is impossible. But statistically, there is no workload like that. If a program consumes 1% of the CPU time, on 10 seconds sampling at 50 Hz, you will meet it more or less 5 times.

Karlos

Re: My MiniGL experiments,recompilation,tips,etc...

Posted on: 2015/7/8 22:47 #52

Just popping in

That would seem to depend on the sample rate. A simpler example, code that spends 20ms in function A and 20ms in function B() alternatively for some compute bound period of time. If you sample at 50Hz you are far more likely to see that it spend 100% in A or B than you are any other distribution of the two. Which one would depend purely on the relative latency of the monitor versus execution of the code. It wouldn't matter how long you profile for, in the absence of any other factors, you'd get an all A or all B result. Unless you managed to start sampling at 20ms in and got the transition point.

Admittedly a contrived example but I guess there's no perfect way to profile. You either do it non intrusively and get approximate or potentially biased results. Or you add instrumentation and accept the changes you make to the running code affect the results.

I suppose only a cycle exact cpu simulator that gathers the statistics on cache misses etc. as it executes would give a true reflection of how code should perform.

corto

Re: My MiniGL experiments,recompilation,tips,etc...

Posted on: 2015/7/9 6:47 #53

Not too shy to talk

@Karlos
Right, a simulator could be more accurate. An hardware trace système would be even better.
But for now, I think a statistical profiler is useful ans could show dôme surprises. About the sampling frequency,you're Wright, this is why Brendan Gregg (a master about system performance) uses 99 Hz.

shadowsun

Re: My MiniGL experiments,recompilation,tips,etc...

Posted on: 2015/7/16 6:09 #54

Just popping in

@thellier

I noticed some graphicals bugs with your 2.22 minigl library with celestia on a SamFlex 800 / 9200 :
http://www.os4depot.net/index.php?fun ... y/scientific/celestia.lha

thellier

Re: My MiniGL experiments,recompilation,tips,etc...

Posted on: 2015/7/20 8:23 #55

Not too shy to talk

@shadowsun
@all

Please do not use the minigl.library that I have build:
It is not faster with Quake even slower
It got some bugs that I introduced with the modifications I made (lines and others stuff in Glexcess..)

My sources and binary are only given as an example of "a modification that may have accelerated minigl but that didnt"

Alain Thellier

jabirulo

Re: My MiniGL experiments,recompilation,tips,etc...

Posted on: 2015/7/20 21:46 #56

Just can't stay away

@thellier

Using waZp3d I get a GR/crshalog after quitting cow3d, is there anything you can fix or is some wazp3d setting I'm missing(setting wrong)?

I updated cow3d source code to use AOS4.1FE's gfx_lib. but original cow3d crashes here too when quitting.
my system: sam460ex/2GB/RadeonHD6570_1GB and radeonhd.chip V2.10


Crash log for task "Cow3D-AmigaOS4"

Generated by GrimReaper 53.19

Crash occured in module Warp3D.library at address 0x7CB236DC

Type of crash: DSI (Data Storage Interrupt) exception

Alert number: 0x80000003



Register dump:

GPR (General Purpose Registers):

   0: FFFF6F90 4B159AE0 00000000 4DEF3034 4B341034 5EF29320 00000000 00000001 

   8: 00000000 5EDFB488 00000064 5EDFB488 4B159AE0 5B6505E0 53150000 53150000 

  16: 53150000 53150000 53150000 4F5F0000 52EE0000 53150000 4BD7A034 4E00A764 

  24: 4BD6A034 5EE00000 5EE00000 4BD7EDA8 5EF292B4 5EF29320 4B159AE8 00000001 





FPR (Floating Point Registers, NaN = Not a Number):

   0:              nan              0.9                0                0 

   4:                0       4.5036e+15                1                1 

   8:                0       4.5036e+15                1                0 

  12:          0.99999                0     -1.00582e+16    -8.77077e+305 

  16:    -3.97084e+249    -1.09544e+305     3.58194e+142     -1.00147e-80 

  20:     5.61745e+306      7.37886e-81    -2.29218e+231    -1.25693e+308 

  24:      -4.7407e-20      -8.0026e-11    -8.95249e+307    -3.57625e+295 

  28:     2.84197e+182     4.02526e+305     -5.34013e-05     1.37112e+146 



FPSCR (Floating Point Status and Control Register): 0x82008000





SPRs (Special Purpose Registers):

           Machine State (msr) : 0x0002F030

                Condition (cr) : 0x4DCE5D80

      Instruction Pointer (ip) : 0x7CB236DC

       Xtended Exception (xer) : 0x0181A874

                   Count (ctr) : 0x4DCE60F8

                     Link (lr) : 0x0002000E

            DSI Status (dsisr) : 0x5A9D2D34

            Data Address (dar) : 0x4DCE60F8

..

Symbol info:

Instruction pointer 0x7CB236DC belongs to module "Warp3D.library" (PowerPC) 

Symbol: SOFT3D_FreeTexture + 0xC in section 1 offset 0x0002F6B8



Stack trace:

    SOFT3D_FreeTexture()+0xc (section 1 @ 0x2F6B8)

    W3D_FreeTexObj()+0x19c (section 1 @ 0x443E0)

    [CoW3D-5.c:1518] CloseWarp3D()+0xd8 (section 1 @ 0x2E74)

    [CoW3D-5.c:1753] main()+0x78 (section 1 @ 0x8370)

    native kernel module newlib.library.kmod+0x000020ac

    native kernel module newlib.library.kmod+0x00002d14

    native kernel module newlib.library.kmod+0x00002ef0

    _start()+0x170 (section 1 @ 0x16C)

    native kernel module dos.library.kmod+0x00025678

    native kernel module kernel+0x0003caf0

    native kernel module kernel+0x0003cb70

Capehill

Re: My MiniGL experiments,recompilation,tips,etc...

Posted on: 2017/1/27 19:51 #57

Just can't stay away

@BSzili

Maybe I missed the info but did you finish the vertex buffer for clipped triangles? Any conclusions there?

Did anybody try inlining hg_ClipCode or V_ToScreen functions?

And what is the meaning of "align" member? It was documented for padding usage but it seems to be used as a condition in the code...

Capehill

Re: My MiniGL experiments,recompilation,tips,etc...

Posted on: 2017/2/5 19:37 #58

Just can't stay away

Hacked the vertex array code a bit, now it draws all triangles with one call (when using compiled vertex arrays a la Q3). On my Sam440 there seems to be about 1% FPS boost so not much to write home about but at least the direction is correct.

It would be interesting to hear how faster system perform though.

Lib + dirty patch here: http://capehill.kapsi.fi/minigl/

Register To Post	« 1 2 (3)
	Top Previous Topic Next Topic

Currently Active Users Viewing This Thread: 1 ( 0 members and 1 Anonymous Users )