Login
Username:

Password:

Remember me



Lost Password?

Register now!

Sections

Who's Online
40 user(s) are online (26 user(s) are browsing Forums)

Members: 0
Guests: 40

more...

Headlines

Forum Index


Board index » All Posts (Karlos)




Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


That would seem to depend on the sample rate. A simpler example, code that spends 20ms in function A and 20ms in function B() alternatively for some compute bound period of time. If you sample at 50Hz you are far more likely to see that it spend 100% in A or B than you are any other distribution of the two. Which one would depend purely on the relative latency of the monitor versus execution of the code. It wouldn't matter how long you profile for, in the absence of any other factors, you'd get an all A or all B result. Unless you managed to start sampling at 20ms in and got the transition point.

Admittedly a contrived example but I guess there's no perfect way to profile. You either do it non intrusively and get approximate or potentially biased results. Or you add instrumentation and accept the changes you make to the running code affect the results.

I suppose only a cycle exact cpu simulator that gathers the statistics on cache misses etc. as it executes would give a true reflection of how code should perform.

Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


@Corto

Interesting. However, is such a method not prone to sample aliasing problems? How can you differentiate between code that spends 1% time in a function called at (any multiple of) the sampling frequency that you just happen be in when you measure and code that spends 99% of it's time elsewhere?


Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


How does the instrumentation in Hieronymus work, exactly?

Warp3d drivers and MiniGL can be compiled with basic inbuilt profiling that I added to help me diagnose some performance issues, but it isn't a true hierarchial profiler as only registered functions are timed.

Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


So I've implemented the "last used index per vertex" check in the Permedia2's implementation of W3D_DrawElements().

It certainly isn't any slower, but I need to write some synthetic tests on vertex-sharing indexed triangle lists to see the effect on performance. It should be reasonable as the Permedia2 driver is generally up against the bus limit.

Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


@LiveForIt

> Is there any part that can optimized by using AltiVec?

Not without introducing extra complexity and branches in the code to cater for machines that aren't altivec enabled. There are some functions that have altivec alternatives implemented but the vast majority of the code does not.

> Is there any complex switches can be replaced table lookups?

On faster CPUs (covering multiple architectures), I have observed a trend that switch case is almost always faster than table lookups. The compiler is free to convert any switch case into one or more jump tables anyway.

> Is there any loops that can be unrolled?
> (Other micro optimisations)

The compiler does this already.

> Is there any malloc() / free(), that are called too often, maybe there is way's workaround it.

I seem to recall there was some MGLPolygon allocation going on in the past, but I replaced all that ages ago.

> Data being uneasily being copied as parameters, when they can be global, parameter passing does generate extra store operations.

That's not really a suitable approach for shared libraries - writeable global data aren't thread safe. I had to fix some issues caused by that very problem in the past. All the common stuff is stored in one or more structures that are passed by reference.

The fat MGLVertex was a promising lead, but it seems that it's not necessarily the major limiting factor here (it should still be trimmed however).

Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


@Hans

It must be the interleaved functions that get the packed colours wrong because I unit tested all the color formats for the separate pointers code and fixed all the broken conversions there. I didn't test the v5 functions at that time however, as I was under the impression it was not broken. The R100 and R200 do support packed colour formats directly as far as I recall.

Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


@BSzili

In theory, because all the T&L is done by this stage, you could convert the colour values from normalized float to uint8 (just use a single uint32 rather than four separate bytes).

That would reduce the size a fair bit (by 24 bytes) and there are already routines for this in util.h; see fast_normalized_to_u8(). They do absolutely require that the input values are clamped 0.0 - 1.0 however.

The colour format passed to W3D would need to be updated accordingly. For Permedia, this representation is used anyway and for R100 and R200 it's directly supported too.

I've managed to get my A1200 out of storage and will investigate what can be done to optimise DrawElements when discrete triangles are passed in strip/fan order. If this works, it might be applicable more generally.

Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


@BSzili

Not bad! How big is the residual drawing part now? Does it contain, for example, surplus sets of texture coordinates for texture units that aren't present? I wonder what could be trimmed.

Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


@Hans

Actually strip optimisations are discussed in the page you linked as something to do as a final stage optimisation. The reason given for using the triangles function was to limit the number of different GL calls. It pretty much renders everything using that one call.

Regarding the increase in traffic, its not modest at the driver level. I don't know about the radeon HD code but the older drivers will repeatedly fetch each indexed vertex's data every time that index is used. Indexing only reduces the amount of data you define, not how much goes over the bus. Whether it's duplicated data in a radeon command packet or duplicated data in a FIFO, the notion of the index is in source data retreival only, not what is then passed to the GPU.

I think I can see an obvious optimisation for the permedia here at least. Just keep track of the last index positions associated with each vertex register set (starting with your index trackers set to something impossible like -1 at the start of each draw elements call) and only submit updates to any given vertex register set if the index changes. If the indexes are passed in fan or strip order, the end result would be that you would get the same behaviour as if you explicitly requested a fan or strip: one vertex update per triangle rendered.

This approach might also work on radeon, but I would need to revisit the documentation for it.

Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


If the quake 3 engine really is hammering GL_TRIANGLES but passing vertex indecies in fan/strip order then it ought to be theoretically possible to convert the incoming discrete triangles into fans or strips where they share edges. That could give up to a 2/3 reduction in the amount of data that needs to be transferred over the bus. Combine this with more efficient vertex packing in the first place and you might have a result.

Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


@BSzili

That's pretty conspicuous, a 31% drop in performance.

Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


@Hans
Quote:
The accesses don't have to be totally random. For example, this document about optimizing drivers for Quake 3 states that the vertices in the GL_TRIANGLES array are in tri-strip order. So, in that case the vertex accesses are always within a small window.


Well, random was meant to be the pathological worst case. However, the above also sounds bad, unless I have misunderstood this: Why render triangle strips out of discrete triangles when drivers can render triangle strips directly?

Rendering a strip (or fan) requires 1 extra vertex for each additional triangle, whereas you need 3 for each discrete triangle.

Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


Even without cachegrind we aren't helpless. One way to test the fat vertex hypothesis would be to write synthetic tests for warp 3d directly:

1) test W3D_DrawArray/Elements with a triangle list using a compact, minimal vertex structure.

2) the same tests again using a fat vertex structure (same size as MGL_Vertex) in which the w3d vertex strucrure is embedded. Initialise the extra space with any old crap to ensure the cache isn't hot for just the parts we care about.

Time both tests carefully for different sized vertex arrays.

I wouldn't be surprised if, once the vertex array exceeds some CPU dependent magic number, that the performance drops suddenly, especially for DrawElements which generally makes random accesses into the vertex array. This would correspond to the point at which you keep having to refetch the data from RAM because you can't keep enough vertices in your cache.

Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


What is really needed is the ability to run parts of this code through a tool like cachegrind. I've done it for whole binaries on Linux but not sure what we can do here.

Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


@thellier

I will try to find some time this weekend to merge your changes back into the svn repository. I will start with just the compiler warning fixes for now and examine what else can be incorporated without pulling in any new bugs such as the line draw you mention.

Don't be too disheartened at the lack of apparent performance improvements at this stage. You have eliminated one of several potential areas and it might be that your changes further increase performance after other, more significant bottlenecks are eliminated.



Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Just popping in
Just popping in


Speeding up MiniGL was always going to be a "simple" task for someone as long as it was just talk.

The reality is that it's not so easy. I wrote some basic built-in profiling to try and identify slow or often called code and performed some optimisations based on that. But the problems are mostly not going to be solved that way. Your optimisations will probably help the slowest machines, on faster ones, other factors become more important. Cache utilisation is a much bigger issue there I think. It is interesting to note that a lot of older MiniGL stuff is faster despite using theoretically* slower V3 Warp3D calls. This is probably due to having more compact MiniGL vertex structures back then (supporting fewer features), which leads to better cache usage.

*In practise, simpler and easier to write optimised code for in a driver than split/interleaved pointer even if the legacy W3D_Vertex format is a bit silly.

Go to top


Re: more Warp3d testing with Microbe 3D
Just popping in
Just popping in


If you are using an R100/200 system, try with and without the z buffer options described here:

http://wiki.amigaos.net/wiki/UserDoc:Warp3D#Configuration_5

I have previously observed a noticeable performance improvement for applications that don't need a stencil buffer when using a 16 bit Z buffer

Go to top


Re: Where are the OS41FE's MiniGL sources ?
Just popping in
Just popping in


I should point out that I used the guides linked above.

Go to top


Re: Where are the OS41FE's MiniGL sources ?
Just popping in
Just popping in


@thellier

All the versions I compiled were from the updates-kc branch (which was created to coincide with concurrent updates to Warp3D upon which it depended).

I used a cross compiler environment from linux:

ppc-amigaos-gcc -v
Using built-in specs.
Target: ppc-amigaos
Configured with: ../adtools-gcc-4.4.2/configure --target=ppc-amigaos --prefix=/usr/local/amiga --enable-languages=c,c++ --enable-haifa --enable-sjlj-exceptions --disable-libstdcxx-pch
Thread model: single
gcc version 4.4.3 (GCC)

One or two versions may have been compiled on the A1 but that was ages ago.

Go to top


Re: BSzili port requests
Just popping in
Just popping in


@BSzili

It's quite likely you'll run into problems with 2K 32-bit textures on 16-bit displays.

http://forum.hyperion-entertainment.b ... ic.php?f=14&t=2955#p33905

Go to top



TopTop
(1) 2 »




Powered by XOOPS 2.0 © 2001-2016 The XOOPS Project