Login
Username:

Password:

Remember me



Lost Password?

Register now!

Sections

Who's Online
113 user(s) are online (61 user(s) are browsing Forums)

Members: 0
Guests: 113

more...

Headlines

 
  Register To Post  

« 1 2 (3)
Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


See User information
@LiveForIt

> Is there any part that can optimized by using AltiVec?

Not without introducing extra complexity and branches in the code to cater for machines that aren't altivec enabled. There are some functions that have altivec alternatives implemented but the vast majority of the code does not.

> Is there any complex switches can be replaced table lookups?

On faster CPUs (covering multiple architectures), I have observed a trend that switch case is almost always faster than table lookups. The compiler is free to convert any switch case into one or more jump tables anyway.

> Is there any loops that can be unrolled?
> (Other micro optimisations)

The compiler does this already.

> Is there any malloc() / free(), that are called too often, maybe there is way's workaround it.

I seem to recall there was some MGLPolygon allocation going on in the past, but I replaced all that ages ago.

> Data being uneasily being copied as parameters, when they can be global, parameter passing does generate extra store operations.

That's not really a suitable approach for shared libraries - writeable global data aren't thread safe. I had to fix some issues caused by that very problem in the past. All the common stuff is stored in one or more structures that are passed by reference.

The fat MGLVertex was a promising lead, but it seems that it's not necessarily the major limiting factor here (it should still be trimmed however).

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Quite a regular
Quite a regular


See User information

Retired
Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


See User information
Let me add my 2 cents ;)

PPC 440 core is claimed to be able to store up to four load misses ("up to three outstanding line fills, up to
four outstanding load misses"). Consider vertex data structure of 96 bytes and vertex processing loop (in a pseudo-code) as a:
dcbt currentVertexStructure
dcbt (currentVertexStructure+32
dcbt (currentVertexStructure64)
loop:
dcbt nextVertexStructure
dcbt (nextVertexStructure+32
dcbt (nextVertexStructure64)

(...currentVertexStructure processing...)
branch loop:

Touching data lines early shoould allow to minimize the cache misses penatly.

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Home away from home
Home away from home


See User information
@Karlos

Quote:
Not without introducing extra complexity and branches in the code to cater for machines that aren't altivec enabled. There are some functions that have altivec alternatives implemented but the vast majority of the code does not.


Just compile two version of library, use pre processor directives, or overload interface table, or something like that. (That’s sort of what FFMPEG does).

Quote:
On faster CPUs (covering multiple architectures), I have observed a trend that switch case is almost always faster than table lookups. The compiler is free to convert any switch case into one or more jump tables anyway.


In some cases, GCC might automatic optimize your code to use table lookups, but you do not have control over what is checked first, GCC might decide that you check default first, even if default case, might be least used case.

But sometimes GCC will not be able to do that for you, because the case numbers is not sequential numbers.

Quote:

> Is there any loops that can be unrolled?
> (Other micro optimisations)

The compiler does this already.


True, but it can be worth decompiling code, and check, GCC does not always do what you expect. what is generated is not always generates the most efficient code, often things are punched back to RAM, pulled from RAM, when a value might have been keep in registry.

Quote:
> Is there any malloc() / free(), that are called too often, maybe there is way's workaround it.

I seem to recall there was some MGLPolygon allocation going on in the past, but I replaced all that ages ago.


I any case storing things on stack, is faster than on RAM, if you're only going keep things for short while.

(NutsAboutAmiga)

Basilisk II for AmigaOS4
AmigaInputAnywhere
Excalibur
and other tools and apps.
Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


See User information
So I've implemented the "last used index per vertex" check in the Permedia2's implementation of W3D_DrawElements().

It certainly isn't any slower, but I need to write some synthetic tests on vertex-sharing indexed triangle lists to see the effect on performance. It should be reasonable as the Permedia2 driver is generally up against the bus limit.

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Not too shy to talk
Not too shy to talk


See User information
I'm sorry to come late in the discussion that I've read since it started.
It seems that MiniGL can receive improvements and that's great to see some have already begun.

You talked about Valgrind and I confirm such tool can't really be ported on AmigaOS. But that's sure tools like that are necessary.

I launched Quake3 on my MicroAOne (that is at the maximum of its capabilities) and using my profiler Hieronymus, I've found that much time is spent in W3D_Radeon.library. But maybe Quake3 is not the best example.

Note that I recently added alternative mode in my profiler that allows (at least on my G3 CPU at the moment) to profile on L2 cache misses. That will be interesting too.

I will have to test Cow3D and compile MiniGL with debug symbols to confirm some results.

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


See User information
How does the instrumentation in Hieronymus work, exactly?

Warp3d drivers and MiniGL can be compiled with basic inbuilt profiling that I added to help me diagnose some performance issues, but it isn't a true hierarchial profiler as only registered functions are timed.

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Home away from home
Home away from home


See User information
@corto
Quote:
I launched Quake3 on my MicroAOne (that is at the maximum of its capabilities) and using my profiler Hieronymus, I've found that much time is spent in W3D_Radeon.library. But maybe Quake3 is not the best example.

How much compared to the time spent in MiniGL?

Please be aware that MiniGL will drop down to sending one triangle/primitive at a time under certain conditions, and that could increase the amount of time spent in the driver by a fair amount. In Quake 3's case, its engine uses compiled vertex arrays MiniGL will drop down to sending a single triangle to Warp3D at a time if even one triangle in the array is clipped. This happens very frequently; so frequently that an early version of the W3D_SI driver managed only a few fps with Open Arena (which uses the Q3 engine).

Hans

http://hdrlab.org.nz/ - Amiga OS 4 projects, programming articles and more.
https://keasigmadelta.com/ - more of my work
Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Not too shy to talk
Not too shy to talk


See User information
@Karlos
Hieronymus is a statistical profiler. He collect samples (let's say 50 or 60 times per second) that indicate the address of the instruction that was executed. Then, it finds the corresponding program and function.
Statistically, that gives proportions of times consumed by the different running applications.

So that is not intrusive and gives a great view of the system activity. And when you run a program, you also see the percentage of time spent in library that it uses.

The idea is the same than the tool "perf" that comes with Linux.

@Hans
Thanks for information. About results, I was not very comfortable to give them now, so take them as early and live results, not confirmed yet. On CPU time:
59% in W3D_Radeon.library
13% in ATIRadeon.chip
11% in Quake3
1% in minigl.library

Note that is given by the alternative mode of sampling I've just developped (using the performance monitor) so I would like to compare with the "standard" mode.
I obtained them yesterday very late, and was too tired to make other runs or check with other programs.

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


See User information
@Corto

Interesting. However, is such a method not prone to sample aliasing problems? How can you differentiate between code that spends 1% time in a function called at (any multiple of) the sampling frequency that you just happen be in when you measure and code that spends 99% of it's time elsewhere?


Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Not too shy to talk
Not too shy to talk


See User information
@Karlos
The use case you describe can theorically happen, nothing is impossible. But statistically, there is no workload like that. If a program consumes 1% of the CPU time, on 10 seconds sampling at 50 Hz, you will meet it more or less 5 times.

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


See User information
That would seem to depend on the sample rate. A simpler example, code that spends 20ms in function A and 20ms in function B() alternatively for some compute bound period of time. If you sample at 50Hz you are far more likely to see that it spend 100% in A or B than you are any other distribution of the two. Which one would depend purely on the relative latency of the monitor versus execution of the code. It wouldn't matter how long you profile for, in the absence of any other factors, you'd get an all A or all B result. Unless you managed to start sampling at 20ms in and got the transition point.

Admittedly a contrived example but I guess there's no perfect way to profile. You either do it non intrusively and get approximate or potentially biased results. Or you add instrumentation and accept the changes you make to the running code affect the results.

I suppose only a cycle exact cpu simulator that gathers the statistics on cache misses etc. as it executes would give a true reflection of how code should perform.

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Not too shy to talk
Not too shy to talk


See User information
@Karlos
Right, a simulator could be more accurate. An hardware trace système would be even better.
But for now, I think a statistical profiler is useful ans could show dôme surprises. About the sampling frequency,you're Wright, this is why Brendan Gregg (a master about system performance) uses 99 Hz.

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


See User information
@thellier

I noticed some graphicals bugs with your 2.22 minigl library with celestia on a SamFlex 800 / 9200 :
http://www.os4depot.net/index.php?fun ... y/scientific/celestia.lha

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Not too shy to talk
Not too shy to talk


See User information
@shadowsun
@all

Please do not use the minigl.library that I have build:
It is not faster with Quake even slower
It got some bugs that I introduced with the modifications I made (lines and others stuff in Glexcess..)

My sources and binary are only given as an example of "a modification that may have accelerated minigl but that didnt"

Alain Thellier





Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Just can't stay away
Just can't stay away


See User information
@thellier

Using waZp3d I get a GR/crshalog after quitting cow3d, is there anything you can fix or is some wazp3d setting I'm missing(setting wrong)?

I updated cow3d source code to use AOS4.1FE's gfx_lib. but original cow3d crashes here too when quitting.
my system: sam460ex/2GB/RadeonHD6570_1GB and radeonhd.chip V2.10

Crash log for task "Cow3D-AmigaOS4"
Generated by GrimReaper 53.19
Crash occured in module Warp3D
.library at address 0x7CB236DC
Type of crash
DSI (Data Storage Interruptexception
Alert number
0x80000003

Register dump
:
GPR (General Purpose Registers):
   
0FFFF6F90 4B159AE0 00000000 4DEF3034 4B341034 5EF29320 00000000 00000001 
   8
00000000 5EDFB488 00000064 5EDFB488 4B159AE0 5B6505E0 53150000 53150000 
  16
53150000 53150000 53150000 4F5F0000 52EE0000 53150000 4BD7A034 4E00A764 
  24
4BD6A034 5EE00000 5EE00000 4BD7EDA8 5EF292B4 5EF29320 4B159AE8 00000001 


FPR 
(Floating Point RegistersNaN Not a Number):
   
0:              nan              0.9                0                0 
   4
:                0       4.5036e+15                1                1 
   8
:                0       4.5036e+15                1                0 
  12
:          0.99999                0     -1.00582e+16    -8.77077e+305 
  16
:    -3.97084e+249    -1.09544e+305     3.58194e+142     -1.00147e-80 
  20
:     5.61745e+306      7.37886e-81    -2.29218e+231    -1.25693e+308 
  24
:      -4.7407e-20      -8.0026e-11    -8.95249e+307    -3.57625e+295 
  28
:     2.84197e+182     4.02526e+305     -5.34013e-05     1.37112e+146 

FPSCR 
(Floating Point Status and Control Register): 0x82008000


SPRs 
(Special Purpose Registers):
           
Machine State (msr) : 0x0002F030
                Condition 
(cr) : 0x4DCE5D80
      Instruction Pointer 
(ip) : 0x7CB236DC
       Xtended Exception 
(xer) : 0x0181A874
                   Count 
(ctr) : 0x4DCE60F8
                     Link 
(lr) : 0x0002000E
            DSI Status 
(dsisr) : 0x5A9D2D34
            Data Address 
(dar) : 0x4DCE60F8
..
Symbol info:
Instruction pointer 0x7CB236DC belongs to module "Warp3D.library" (PowerPC
SymbolSOFT3D_FreeTexture 0xC in section 1 offset 0x0002F6B8

Stack trace
:
    
SOFT3D_FreeTexture()+0xc (section 1 0x2F6B8)
    
W3D_FreeTexObj()+0x19c (section 1 0x443E0)
    [
CoW3D-5.c:1518CloseWarp3D()+0xd8 (section 1 0x2E74)
    [
CoW3D-5.c:1753main()+0x78 (section 1 0x8370)
    
native kernel module newlib.library.kmod+0x000020ac
    native kernel module newlib
.library.kmod+0x00002d14
    native kernel module newlib
.library.kmod+0x00002ef0
    _start
()+0x170 (section 1 0x16C)
    
native kernel module dos.library.kmod+0x00025678
    native kernel module kernel
+0x0003caf0
    native kernel module kernel
+0x0003cb70

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Just can't stay away
Just can't stay away


See User information
@BSzili

Maybe I missed the info but did you finish the vertex buffer for clipped triangles? Any conclusions there?

Did anybody try inlining hg_ClipCode or V_ToScreen functions?

And what is the meaning of "align" member? It was documented for padding usage but it seems to be used as a condition in the code...

Go to top
Re: My MiniGL experiments,recompilation,tips,etc...
Just can't stay away
Just can't stay away


See User information
Hacked the vertex array code a bit, now it draws all triangles with one call (when using compiled vertex arrays a la Q3). On my Sam440 there seems to be about 1% FPS boost so not much to write home about but at least the direction is correct.

It would be interesting to hear how faster system perform though.

Lib + dirty patch here: http://capehill.kapsi.fi/minigl/

Go to top

  Register To Post
« 1 2 (3)

 




Currently Active Users Viewing This Thread: 1 ( 0 members and 1 Anonymous Users )




Powered by XOOPS 2.0 © 2001-2023 The XOOPS Project