Login
Username:

Password:

Remember me



Lost Password?

Register now!

Sections

Who's Online
106 user(s) are online (53 user(s) are browsing Forums)

Members: 0
Guests: 106

more...

Headlines

Forum Index


Board index » All Posts (corto)




Re: SDL1 open issues
Not too shy to talk
Not too shy to talk


@Capehill

I opened issues #16 and #17 that are still valid. Clarification (documentation) is required about the build (options, native build or using cross-compilation, used config file, ...).

Go to top


Re: (solved) Hieronymus not working on Sam440 ?
Not too shy to talk
Not too shy to talk


@Severin

The message on the X1000 does not say much but if it has done, it would have been as obscure than on Sam440

The initial implementation of Hieronymus relies on a system feature that was not fully implemented on Sam440 and on X1000. That was fixed for the Sam440, let's say ... in Update 5. That is still not the case on X1000.

For the X1000, I developed a second implementation that works well on the MicroAOne but there is a problem with the X1000. But these days, I have thought that there is maybe a remaining hope. We will see after some tests.

Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Not too shy to talk
Not too shy to talk


@Karlos
Right, a simulator could be more accurate. An hardware trace système would be even better.
But for now, I think a statistical profiler is useful ans could show dôme surprises. About the sampling frequency,you're Wright, this is why Brendan Gregg (a master about system performance) uses 99 Hz.

Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Not too shy to talk
Not too shy to talk


@Karlos
The use case you describe can theorically happen, nothing is impossible. But statistically, there is no workload like that. If a program consumes 1% of the CPU time, on 10 seconds sampling at 50 Hz, you will meet it more or less 5 times.

Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Not too shy to talk
Not too shy to talk


@Karlos
Hieronymus is a statistical profiler. He collect samples (let's say 50 or 60 times per second) that indicate the address of the instruction that was executed. Then, it finds the corresponding program and function.
Statistically, that gives proportions of times consumed by the different running applications.

So that is not intrusive and gives a great view of the system activity. And when you run a program, you also see the percentage of time spent in library that it uses.

The idea is the same than the tool "perf" that comes with Linux.

@Hans
Thanks for information. About results, I was not very comfortable to give them now, so take them as early and live results, not confirmed yet. On CPU time:
59% in W3D_Radeon.library
13% in ATIRadeon.chip
11% in Quake3
1% in minigl.library

Note that is given by the alternative mode of sampling I've just developped (using the performance monitor) so I would like to compare with the "standard" mode.
I obtained them yesterday very late, and was too tired to make other runs or check with other programs.

Go to top


Re: My MiniGL experiments,recompilation,tips,etc...
Not too shy to talk
Not too shy to talk


I'm sorry to come late in the discussion that I've read since it started.
It seems that MiniGL can receive improvements and that's great to see some have already begun.

You talked about Valgrind and I confirm such tool can't really be ported on AmigaOS. But that's sure tools like that are necessary.

I launched Quake3 on my MicroAOne (that is at the maximum of its capabilities) and using my profiler Hieronymus, I've found that much time is spent in W3D_Radeon.library. But maybe Quake3 is not the best example.

Note that I recently added alternative mode in my profiler that allows (at least on my G3 CPU at the moment) to profile on L2 cache misses. That will be interesting too.

I will have to test Cow3D and compile MiniGL with debug symbols to confirm some results.

Go to top


Re: Any altivec experts? (H.264 codec)
Not too shy to talk
Not too shy to talk


@feanor
Ok, a difference between theory and reality
I did read things like that about that in Altivec related documents but only store instructions (dstst) were discouraged (potentially using dcbz to clear memory aimed to be written, avoiding fetching data from RAM).

Go to top


Re: Any altivec experts? (H.264 codec)
Not too shy to talk
Not too shy to talk


@K-L
All mentioned processors (G4, G5, PA6T) use the same Altivec (VMX, in the IBM terminology) instruction set, even if the implementation is different. For example, G5 and PA6T can issue 3 instructions issued per cycle but have 2 dispatch units, sub-units. In the past, I thought that PA6T Altivec was weaker but now ... I don't know. Another interesting point to study (with the dnetc case).

@feanor
Are you sure dst instructions only exist on G4s? The G5 user mannual mentions it and the 970 has even 8 streams, instead of 4 on G4.

Go to top


Re: Any altivec experts? (H.264 codec)
Not too shy to talk
Not too shy to talk


@zzd10h
No, I haven't recompiled it for OS4 yet, as I investigate another part of the code with tools (mainly perf, in fact) only available for Linux ...

@tommysammy
I use this command line:
./ffmpeg_g -cpuflags altivec -benchmark -i Prometheus-1080p-30s.mp4 -f null /dev/null

And really, I often run it prefixed with the perf command.

Note that just like feanor, I extracted 30 seconds from the original video, even if in my case that was not from the very beginning but after a 30-second delay:

./ffmpeg -i ~/Videos/Prometheus\ -\ Trailer.mp4 -ss 30 -t 30 -vcodec copy -acodec copy ~/Videos/Prometheus-1080p-30s.mp4

Go to top


Re: Any altivec experts? (H.264 codec)
Not too shy to talk
Not too shy to talk


@zzd10h I used this video prometheus-trailer.zip that Severin previously recommended in this thread.

Note that with feanor's patches available on github, I also get a 5% improvement on H264 decoding with 1080p videos (Prometheus and Bourne Ultimatum trailers), on my MacMini under Linux.

Go to top


Re: Finding where a .library is allocated in memory ?
Not too shy to talk
Not too shy to talk


@thellier

If crashes are memory related, and after all, in any cases, check that the compiler helps you at the maximum. Have you activated warning options like: -Wall -Wextra -Wwrite-strings ?

Maybe compiling with another optimization level (that is to say -O0) would change the behavior and give a clue.

Another strategy would be to compile the library as a static library, crashes could be easier to track.

Maybe check on another system like OS4 (if that makes sense ... maybe your code is specific to UAE) or compile with vbcc that could warn on other things.

Go to top


Re: Any altivec experts? (H.264 codec)
Not too shy to talk
Not too shy to talk


@tommysammy

Thank you for opening the bounty, what will hopefully allow feanor to work on that soon. But I think the description is very basic: there is no numbers to give a baseline, there is only one video given as reference, only one small mention to H264 (the initial discussion was specially about optimization of this codec), the title specifies OS4 but there is a word about MorphOS in the description (for me, this work is simply PowerPC related), ...
Goals description are "speed advantage" on hardware that feanor does not own. That will be difficult to see any speedup.

Go to top


Re: Any altivec experts? (H.264 codec)
Not too shy to talk
Not too shy to talk


@gregthecanuck, @Hans

Thanks! At the moment, results give a subset of functions that are time consuming (to optimize):
ff_h264_decode_mb_cabac
get_cabac
hl_decode_mb_simple_8
loop_filter
fill_decode_caches

To check which could be Altivec optimized.

About the performance monitor, note that there are specific counters for Altivec, that could be used to find specific Altivec problems, notably if an vector unit "is waiting for an operand".

Other events could also be checked, like L2 cache misses, DTLB misses, ...

I looked at get_cabac functions, not sure that could be Altivec optimized but they have asm version in ARM and x86. It rather looks like there is register pressure there, and that these asm codes tend to become branchless.

@Hans Not sure perf is able to list time spent in each subfunction but it can provide a callgraph.
You're right, the chosen duration is maybe too short for accurate results but it is ok to find functions that cause penalty.

@feanor What machines do you own?

I've started to look the generated code of get_cabac ...

Go to top


Re: Any altivec experts? (H.264 codec)
Not too shy to talk
Not too shy to talk


Compared to other architectures, the ffmpeg project misses Altivec specific code, what will improve performance. I will donate if a bounty is created, to obtain such code but also to read how the work has been done (I would like to learn about the approach in this kind of work).

But are we sure this is the main bottleneck in H264 decoding?

I did some tests, not with mplayer but with ffmpeg directly, because we need to focus on what we want to measure (video decoding). Keeping the mplayer layer will make things heavier and more complicated.

When talking about optimization, we must have a benchmark that is:
- measurable
- reproducible
- configurable (easy to set/unset AltiVec)

I extracted a 20-second sequence of the Prometheus video and I only measured the decoding (that takes 100% of the CPU), on my MacMini under Debian 7.

With Altivec:
Quote:

./ffmpeg_g -benchmark -i ~/Videos/Prometheus-1080p-20s.mp4 -f null /dev/null
frame= 402 fps= 17 q=0.0 Lsize=N/A time=00:00:20.18 bitrate=N/A
video:25kB audio:3444kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
bench: utime=21.728s
bench: maxrss=28596kB


Without Altivec:
Quote:

./ffmpeg_g -cpuflags 0 -benchmark -i ~/Videos/Prometheus-1080p-20s.mp4 -f null /dev/null
frame= 402 fps= 13 q=0.0 Lsize=N/A time=00:00:20.18 bitrate=N/A
video:25kB audio:3444kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
bench: utime=29.008s
bench: maxrss=28668kB


So the first conclusion is with Altivec, the video takes almost 22 seconds to be decoded, compared to 29 seconds (a 24% speedup).
Note that ffmpeg has an option to set/unset the use of Altivec!

Then, I used the Linux perf tool in various ways.

If we want to control improvements, we have to set references and describe a process. I don't mean a complex protocol.

Results here show statistically where the time is spent (again, with and without Altivec enabled):

Quote:

sudo perf record -a ./ffmpeg_g -cpuflags altivec -benchmark -i ~/Videos/Prometheus-1080p-20s.mp4 -f null /dev/null
sudo perf report --stdio

9.29% ffmpeg_g ffmpeg_g [.] ff_h264_decode_mb_cabac
7.62% ffmpeg_g ffmpeg_g [.] put_h264_chroma_mc8_altivec
6.59% ffmpeg_g ffmpeg_g [.] get_cabac
5.83% ffmpeg_g ffmpeg_g [.] hl_decode_mb_simple_8
3.94% ffmpeg_g ffmpeg_g [.] loop_filter
3.89% ffmpeg_g ffmpeg_g [.] ff_put_pixels16_altivec
3.74% ffmpeg_g ffmpeg_g [.] fill_decode_caches

sudo perf record -a ./ffmpeg_g -cpuflags 0 -benchmark -i ~/Videos/Prometheus-1080p-20s.mp4 -f null /dev/null
sudo perf report --stdio

7.16% ffmpeg_g ffmpeg_g [.] ff_h264_decode_mb_cabac
6.89% ffmpeg_g ffmpeg_g [.] put_h264_qpel8_h_lowpass_8
6.13% ffmpeg_g ffmpeg_g [.] put_h264_chroma_mc8_8_c
4.73% ffmpeg_g ffmpeg_g [.] get_cabac
4.70% ffmpeg_g ffmpeg_g [.] put_h264_qpel8_hv_lowpass_8
4.65% ffmpeg_g ffmpeg_g [.] hl_decode_mb_simple_8
4.52% ffmpeg_g ffmpeg_g [.] put_h264_qpel8_v_lowpass_8
3.98% ffmpeg_g ffmpeg_g [.] put_h264_qpel16_mc00_8_c
3.42% ffmpeg_g ffmpeg_g [.] avg_h264_chroma_mc8_8_c
3.13% ffmpeg_g ffmpeg_g [.] loop_filter


That gives an idea of the most time consuming functions.

Then, I ran "perf stat" to get an overview and it reports 18% of branch misses, what seems to be high!

Finally I measured changing the perf event:

Quote:

sudo perf record -a -e instructions ./ffmpeg_g -cpuflags altivec -benchmark -i ~/Videos/Prometheus-1080p-20s.mp4 -f null /dev/null

# Events: 22K instructions
#
# Overhead Command Shared Object Symbol
# ........ ............. ........................... ..................................................
#
12.79% ffmpeg_g ffmpeg_g [.] get_cabac
7.35% ffmpeg_g ffmpeg_g [.] ff_h264_decode_mb_cabac
4.31% ffmpeg_g ffmpeg_g [.] hl_decode_mb_simple_8
4.29% ffmpeg_g ffmpeg_g [.] decode_cabac_residual_nondc_internal
4.17% ffmpeg_g ffmpeg_g [.] loop_filter
3.93% ffmpeg_g ffmpeg_g [.] get_cabac_noinline
3.89% ffmpeg_g ffmpeg_g [.] put_h264_qpel8_h_lowpass_8
3.85% ffmpeg_g ffmpeg_g [.] fill_decode_caches
3.62% ffmpeg_g ffmpeg_g [.] put_h264_chroma_mc8_altivec
3.52% ffmpeg_g ffmpeg_g [.] ff_h264_filter_mb

# Events: 22K branch-misses
#
# Overhead Command Shared Object Symbol
# ........ ............... ........................... ..............................................
#
22.30% ffmpeg_g ffmpeg_g [.] ff_h264_decode_mb_cabac
13.48% ffmpeg_g ffmpeg_g [.] decode_cabac_residual_nondc_internal
11.27% ffmpeg_g ffmpeg_g [.] ff_h264_filter_mb
7.56% ffmpeg_g ffmpeg_g [.] fill_decode_caches
4.10% ffmpeg_g ffmpeg_g [.] get_cabac
3.69% ffmpeg_g ffmpeg_g [.] hl_decode_mb_simple_8
3.17% ffmpeg_g ffmpeg_g [.] h264_idct_add8_altivec



Go to top


Re: Any altivec experts? (H.264 codec)
Not too shy to talk
Not too shy to talk


If the wasted processing time is in the decoder, I suggest that we avoid the mplayer layer, using ffmpeg only instead. With mplayer, we see that we will have to take care about the version, the operating systems and their versions, etc.
It will also be easier to compare on x86 and ARM, building ffmpeg for them with and without SIMD.

Let's choose:
- an ffmpeg revision
- 3 videos (the 1080p prometheus being the first one) to check different parts of the code are exercized
- 2 or 3 pieces of hardware

Go to top


Re: Any altivec experts? (H.264 codec)
Not too shy to talk
Not too shy to talk


@tlosm Some of us already did ... since AmigaBlitter previously posted this link

I didn't unpacked and read each file though.

Go to top


Re: Any altivec experts? (H.264 codec)
Not too shy to talk
Not too shy to talk


I am not sure everyone who were listed are expert in AltiVec ... but doesn't matter.

I am ok to ask to markos (from freevec.org) but whatever we choose, we have to:
1. Select few H264 videos to have the same common references
2. Measure the performance baseline with ffmpeg on selected machines
3. Profile to know if bottlenecks are really where we suppose
4. Define targets / expected improvements

These are prerequisites to any bounties or contracts.

Go to top


Re: Any altivec experts? (H.264 codec)
Not too shy to talk
Not too shy to talk


@AmigaBlitter

Quote:

Using altivec, btw, will cut out the AmigaOne 500 and sam owners.

For those, i would like to suggest you to check out the PPC 440 and 460 internal DSP. This dsp have 24 instructions that can improve audio video decoding.

Here are some interesting documents you could check:

https://www-01.ibm.com/chips/techlib/t ... PowerPC_440_Embedded_Core

this is especially interesting:
https://www-01.ibm.com/chips/techlib/t ... Optimized_dsp_440_app.pdf


I did read these docs and I also profiled ffmpeg on 440 years ago. I tried to optimize but effects were not visible. ffmpeg developers know how to program and I think the code is already efficient.
Many other CPU features could be used but I'm afraid the MAC instructions won't be enough.

By the way, there is already a macro in ffmpeg to use one of there MAC instructions in some places.

Looking again at this topic would be another interesting task!

Quote:

something similar exist for the 460 too.


Right. The CPU core is basically the same.

Go to top


Re: Any altivec experts? (H.264 codec)
Not too shy to talk
Not too shy to talk


@Hans: I think that a request could be made at the ffmpeg team. I also want to point on this possible opportunity: freevec.org offers his services. He is specialized in SIMD and AltiVec and recently proposed his services (being paid for them).

Go to top


Re: Still interested in Huno's games ?
Not too shy to talk
Not too shy to talk


@fingus

You wrote: "Another story is the fact that wapz3d isn´t running that smooth on my NG, maybe with new RadeonHD-Driver i will dig into the World of 3D Edoshooters again"

There is certainly something to do about that. Which machine do you have?

To answer to my friend K-L about games of my friend Huno :)

Unfortunately I have not enough time to play games. And if I would have, I will spend it to develop. Huno: You know that one day we will really work on a common project!

Like Hans, I think the work of Huno is well received and appreciated.

Go to top



TopTop
« 1 2 3 (4) 5 6 7 ... 14 »




Powered by XOOPS 2.0 © 2001-2023 The XOOPS Project