Due to Hans being the developer of the RadeonHD driver and surely having access to a beta system i guess he knows that the new driver and FE won't speed up anything.
Hence he brought up this thread
People are dying. Entire ecosystems are collapsing. We are in the beginning of a mass extinction. And all you can talk about is money and fairytales of eternal economic growth. How dare you! – Greta Thunberg
I have seen the prometheus trailer being used as a baseline. How far on average is that from being perfect? This difference from baseline to perfect could be the stated improvement required for a bounty to be deemed successful.
The prometheus trailer IS the baseline, it displays at 23.976 fps, for most films you need 25 fps or even 30 fps plus these tests are without sound so you need to add another couple of fps to allow for audio decoding plus another one for window mode.
What about waiting the release of Final Edition and see how FE + RadeonHD 2.4 perform together before taking a decision for this bounty? Or is it a known fact that it won't increase the video framerates and it's only decoding that needs improvement?
As Severin said, his results are with FE + RadeonHD 2.4, and yes, it most certainly ddoes make a difference.
Faster decoding would improve performance even more. Hence this thread...
@Raziel Quote:
Due to Hans being the developer of the RadeonHD driver and surely having access to a beta system i guess he knows that the new driver and FE won't speed up anything.
As I said above, the FE + new driver most certainly does speed things up. That was the whole point of implementing composited video!
@ddni Quote:
@Hans
Thanks.
Yes, I am aware of the benchmark. What I meant was how much of an improvement is expected / acceptable to justify the expense?
I have seen the prometheus trailer being used as a baseline. How far on average is that from being perfect? This difference from baseline to perfect could be the stated improvement required for a bounty to be deemed successful.
Severin's tests (with FE + RadeonHD 2.4) show that it runs at 18 fps with the loopfilter enabled, which would need a 33%+ improvement to hit the needed 24 fps. At this stage we have no idea how much improvement we could expect if the missing altivec code were added, so it would be unrealistic to expect/demand a 33% improvement. Don't forget that the H.264 codec in ffmpeg is already partially altivec optimized. You can't expect feanor to do all the work and have his payment contingent to something he can't control.
Severin also tested with the loop filter disabled (which is faster at the expense of quality), and got 22 fps. That would need a 10%+ improvment, but I still couldn't give you a realistic estimate for how much of an improvement we could expect.
It really is a bit of a gamble. The added altivec code will make a difference, but we have no idea how much.
Compared to other architectures, the ffmpeg project misses Altivec specific code, what will improve performance. I will donate if a bounty is created, to obtain such code but also to read how the work has been done (I would like to learn about the approach in this kind of work).
But are we sure this is the main bottleneck in H264 decoding?
I did some tests, not with mplayer but with ffmpeg directly, because we need to focus on what we want to measure (video decoding). Keeping the mplayer layer will make things heavier and more complicated.
When talking about optimization, we must have a benchmark that is: - measurable - reproducible - configurable (easy to set/unset AltiVec)
I extracted a 20-second sequence of the Prometheus video and I only measured the decoding (that takes 100% of the CPU), on my MacMini under Debian 7.
So the first conclusion is with Altivec, the video takes almost 22 seconds to be decoded, compared to 29 seconds (a 24% speedup). Note that ffmpeg has an option to set/unset the use of Altivec!
Then, I used the Linux perf tool in various ways.
If we want to control improvements, we have to set references and describe a process. I don't mean a complex protocol.
Results here show statistically where the time is spent (again, with and without Altivec enabled):
Quote:
sudo perf record -a ./ffmpeg_g -cpuflags altivec -benchmark -i ~/Videos/Prometheus-1080p-20s.mp4 -f null /dev/null sudo perf report --stdio
How detailed are the perf tool's reports? I assume that ff_h264_decode_mb_cabac calls lots of other functions. Can it list how much it spends in each? Perhaps a video that's decoded in 22-29 seconds is too short to build up enough statistical data for more detail.
It would be great if we could get similar profiler reports for ffmpeg on an x86. If it's detailed enough then we could gain insights into which functions are SSE optimised, and how much of a difference they make.
Thanks! At the moment, results give a subset of functions that are time consuming (to optimize): ff_h264_decode_mb_cabac get_cabac hl_decode_mb_simple_8 loop_filter fill_decode_caches
To check which could be Altivec optimized.
About the performance monitor, note that there are specific counters for Altivec, that could be used to find specific Altivec problems, notably if an vector unit "is waiting for an operand".
Other events could also be checked, like L2 cache misses, DTLB misses, ...
I looked at get_cabac functions, not sure that could be Altivec optimized but they have asm version in ARM and x86. It rather looks like there is register pressure there, and that these asm codes tend to become branchless.
@Hans Not sure perf is able to list time spent in each subfunction but it can provide a callgraph. You're right, the chosen duration is maybe too short for accurate results but it is ok to find functions that cause penalty.
@feanor What machines do you own?
I've started to look the generated code of get_cabac ...
You're right, the chosen duration is maybe too short for accurate results but it is ok to find functions that cause penalty.
Okay. If you look for longer videos, do realize that CABAC is one of many options in H.264. Not all H.264 videos use it. So, you'll have to look for videos that were encoded with similar settings. I'm not sure what you could use to check the encoding settings, though.
Interestingly, the wikipedia article about CABAC says that it's hard to parallelize and vectorize.
I don't think it's a good idea for me to handle the bounty personally.
Kickstarter isn't a bounty system. It's more of a micro-investment system, where people can invest in a project (i.e., you get the money once it has been funded, and not after the project is done).
I only suggested it because we're not getting anywhere with setting up a bounty. There are disagreements over which bounty website to use, and some want the bounty to come with minimum performance guarantees that you simply can't provide.
I know, I've created one in the past, the thing is that this case is not a full blown vectorization effort to justify a kickstarter project, that is, we're not really vectorizing a full codec (like x265 for example), but only adding a few optimizations where needed to get the extra % of performance.
Also, with a kickstarter there is always a risk of never gathering the necessary funds, but also not receiving them. I think a bounty is safer as you can always request your money back if the bounty fails (at least I've done it before). As a suggestion, here is a possible workflow on bountysource:
1) forking ffmpeg on github 2) attaching it to bountysource, by installing the relevant plugin on the github project (one can login to bountysource by their github account) 3) create the particular tickets on github (they appear automatically on bountysource) 4) I can state that I will work on this particular (or more) tickets for the requested amount 5) people can donate to this particular tickets 6) when done, I can post the patch inside the ticket itself or even do a pull request from my own tree and claim the bounty 7) people can review it and when happy, they can accept the claim
that's it. I've already done this before on bountysource (I've done the VSX port of Eigen, which was coincidentally an IBM bounty on bountysource, and I was already 70% done when I discovered that so it was easy for me :)
I'd never heard of bountysource before, but it sounds like a good system. Whoever creates the bounty should bear in mind the 10% withdrawal fee, though.
Since we speek about mplayer why not enlarge a bit the bounty, if possible, to get an extended altivec optimization where possible on whole code?
And maybe feanor can be interested to get a full altivec AmigaOS4 (maybe an x1000)capable machine in change of its work if A-Eon itself will made a discount to encourage it?
Dunno how much money was needed for that but maybe we can get a new developer in that way, and what a developer! :D
Thank you for your kind words, I really appreciate it, but I have to be modest. I consider myself a decent developer, with enough skills to cope with many tasks (not all, and kernel/firmware stuff is where I draw the line, I just don't have the necessary experience), maybe my 'advantage' is that I *really* like SIMD stuff in general to have devoted literally thousands of hours in this. Now with regards to getting a new Amiga system in exchange for my services, as much as I would like that, I'm pretty sure my wife would object to that as that doesn't really help to feed my family, as I'm sure many of you already know. :)
I have already too many projects I'm working for free, like Debian/armhf, Eigen(where I maintain both NEON and Altivec/VSX ports), working on a new portable SIMD library -yes, I know there are many, but I'm working on some features not found in others, at least I hope it works that way- and writing a SIMD book (that of course covers Altivec), and have a day job on the side (that unfortunately does not include SIMD in the slightest). If I'm going to justify allocating yet more hours out of my already pressed schedule, then I'll have to make sure that it helps my family, otherwise I'm just not going to do it.
Regarding mplayer itself, I think it's prudent to take it in small steps. Let's see if/how this works out, if it satisfies the requirements people have set, if there is more room for improvement, etc. Usually, there is *always* more room for more optimizations, but it has to justify the means.