Who's Online |
106 user(s) are online ( 53 user(s) are browsing Forums)
Members: 0
Guests: 106
more...
|
|
Headlines |
-
amidisas.tar.bz2 - development/utility
Apr 16, 2024
-
wildmidi.lha - audio/play
Apr 15, 2024
-
liba52.lha - development/library/audio
Apr 14, 2024
-
libcurl.lha - development/library/misc
Apr 14, 2024
-
libopenssl.lha - development/library/misc
Apr 14, 2024
-
bermudasyndrome.lha - game/action
Apr 14, 2024
-
amigagpt.lha - network/chat
Apr 14, 2024
-
curl.lha - network/misc
Apr 14, 2024
-
dgen_sdl.lha - emulation/gamesystem
Apr 12, 2024
-
amiarcadia.lha - emulation/gamesystem
Apr 11, 2024
|
|
|
|
Re: SDL1 open issues
|
Posted on: 2016/2/8 21:36
#61
|
Not too shy to talk
|
@Capehill
I opened issues #16 and #17 that are still valid. Clarification (documentation) is required about the build (options, native build or using cross-compilation, used config file, ...).
|
|
|
|
Re: (solved) Hieronymus not working on Sam440 ?
|
Posted on: 2015/8/16 22:22
#62
|
Not too shy to talk
|
@Severin The message on the X1000 does not say much but if it has done, it would have been as obscure than on Sam440 The initial implementation of Hieronymus relies on a system feature that was not fully implemented on Sam440 and on X1000. That was fixed for the Sam440, let's say ... in Update 5. That is still not the case on X1000. For the X1000, I developed a second implementation that works well on the MicroAOne but there is a problem with the X1000. But these days, I have thought that there is maybe a remaining hope. We will see after some tests.
|
|
|
|
Re: My MiniGL experiments,recompilation,tips,etc...
|
Posted on: 2015/7/9 7:47
#63
|
Not too shy to talk
|
@Karlos Right, a simulator could be more accurate. An hardware trace système would be even better. But for now, I think a statistical profiler is useful ans could show dôme surprises. About the sampling frequency,you're Wright, this is why Brendan Gregg (a master about system performance) uses 99 Hz.
|
|
|
|
Re: My MiniGL experiments,recompilation,tips,etc...
|
Posted on: 2015/7/8 21:03
#64
|
Not too shy to talk
|
@Karlos The use case you describe can theorically happen, nothing is impossible. But statistically, there is no workload like that. If a program consumes 1% of the CPU time, on 10 seconds sampling at 50 Hz, you will meet it more or less 5 times.
|
|
|
|
Re: My MiniGL experiments,recompilation,tips,etc...
|
Posted on: 2015/7/8 5:56
#65
|
Not too shy to talk
|
@Karlos Hieronymus is a statistical profiler. He collect samples (let's say 50 or 60 times per second) that indicate the address of the instruction that was executed. Then, it finds the corresponding program and function. Statistically, that gives proportions of times consumed by the different running applications.
So that is not intrusive and gives a great view of the system activity. And when you run a program, you also see the percentage of time spent in library that it uses.
The idea is the same than the tool "perf" that comes with Linux.
@Hans Thanks for information. About results, I was not very comfortable to give them now, so take them as early and live results, not confirmed yet. On CPU time: 59% in W3D_Radeon.library 13% in ATIRadeon.chip 11% in Quake3 1% in minigl.library
Note that is given by the alternative mode of sampling I've just developped (using the performance monitor) so I would like to compare with the "standard" mode. I obtained them yesterday very late, and was too tired to make other runs or check with other programs.
|
|
|
|
Re: My MiniGL experiments,recompilation,tips,etc...
|
Posted on: 2015/7/7 23:08
#66
|
Not too shy to talk
|
I'm sorry to come late in the discussion that I've read since it started. It seems that MiniGL can receive improvements and that's great to see some have already begun.
You talked about Valgrind and I confirm such tool can't really be ported on AmigaOS. But that's sure tools like that are necessary.
I launched Quake3 on my MicroAOne (that is at the maximum of its capabilities) and using my profiler Hieronymus, I've found that much time is spent in W3D_Radeon.library. But maybe Quake3 is not the best example.
Note that I recently added alternative mode in my profiler that allows (at least on my G3 CPU at the moment) to profile on L2 cache misses. That will be interesting too.
I will have to test Cow3D and compile MiniGL with debug symbols to confirm some results.
|
|
|
|
Re: Any altivec experts? (H.264 codec)
|
Posted on: 2015/1/26 11:28
#67
|
Not too shy to talk
|
@feanor Ok, a difference between theory and reality I did read things like that about that in Altivec related documents but only store instructions (dstst) were discouraged (potentially using dcbz to clear memory aimed to be written, avoiding fetching data from RAM).
|
|
|
|
Re: Any altivec experts? (H.264 codec)
|
Posted on: 2015/1/26 10:05
#68
|
Not too shy to talk
|
@K-L All mentioned processors (G4, G5, PA6T) use the same Altivec (VMX, in the IBM terminology) instruction set, even if the implementation is different. For example, G5 and PA6T can issue 3 instructions issued per cycle but have 2 dispatch units, sub-units. In the past, I thought that PA6T Altivec was weaker but now ... I don't know. Another interesting point to study (with the dnetc case).
@feanor Are you sure dst instructions only exist on G4s? The G5 user mannual mentions it and the 970 has even 8 streams, instead of 4 on G4.
|
|
|
|
Re: Any altivec experts? (H.264 codec)
|
Posted on: 2015/1/21 9:53
#69
|
Not too shy to talk
|
@zzd10h No, I haven't recompiled it for OS4 yet, as I investigate another part of the code with tools (mainly perf, in fact) only available for Linux ...
@tommysammy I use this command line: ./ffmpeg_g -cpuflags altivec -benchmark -i Prometheus-1080p-30s.mp4 -f null /dev/null
And really, I often run it prefixed with the perf command.
Note that just like feanor, I extracted 30 seconds from the original video, even if in my case that was not from the very beginning but after a 30-second delay:
./ffmpeg -i ~/Videos/Prometheus\ -\ Trailer.mp4 -ss 30 -t 30 -vcodec copy -acodec copy ~/Videos/Prometheus-1080p-30s.mp4
|
|
|
|
Re: Any altivec experts? (H.264 codec)
|
Posted on: 2015/1/21 9:38
#70
|
Not too shy to talk
|
@zzd10h I used this video prometheus-trailer.zip that Severin previously recommended in this thread. Note that with feanor's patches available on github, I also get a 5% improvement on H264 decoding with 1080p videos (Prometheus and Bourne Ultimatum trailers), on my MacMini under Linux.
|
|
|
|
Re: Finding where a .library is allocated in memory ?
|
Posted on: 2014/12/18 8:54
#71
|
Not too shy to talk
|
@thellier
If crashes are memory related, and after all, in any cases, check that the compiler helps you at the maximum. Have you activated warning options like: -Wall -Wextra -Wwrite-strings ?
Maybe compiling with another optimization level (that is to say -O0) would change the behavior and give a clue.
Another strategy would be to compile the library as a static library, crashes could be easier to track.
Maybe check on another system like OS4 (if that makes sense ... maybe your code is specific to UAE) or compile with vbcc that could warn on other things.
|
|
|
|
Re: Any altivec experts? (H.264 codec)
|
Posted on: 2014/12/9 20:03
#72
|
Not too shy to talk
|
@tommysammy
Thank you for opening the bounty, what will hopefully allow feanor to work on that soon. But I think the description is very basic: there is no numbers to give a baseline, there is only one video given as reference, only one small mention to H264 (the initial discussion was specially about optimization of this codec), the title specifies OS4 but there is a word about MorphOS in the description (for me, this work is simply PowerPC related), ... Goals description are "speed advantage" on hardware that feanor does not own. That will be difficult to see any speedup.
|
|
|
|
Re: Any altivec experts? (H.264 codec)
|
Posted on: 2014/12/1 6:58
#73
|
Not too shy to talk
|
@gregthecanuck, @Hans
Thanks! At the moment, results give a subset of functions that are time consuming (to optimize): ff_h264_decode_mb_cabac get_cabac hl_decode_mb_simple_8 loop_filter fill_decode_caches
To check which could be Altivec optimized.
About the performance monitor, note that there are specific counters for Altivec, that could be used to find specific Altivec problems, notably if an vector unit "is waiting for an operand".
Other events could also be checked, like L2 cache misses, DTLB misses, ...
I looked at get_cabac functions, not sure that could be Altivec optimized but they have asm version in ARM and x86. It rather looks like there is register pressure there, and that these asm codes tend to become branchless.
@Hans Not sure perf is able to list time spent in each subfunction but it can provide a callgraph. You're right, the chosen duration is maybe too short for accurate results but it is ok to find functions that cause penalty.
@feanor What machines do you own?
I've started to look the generated code of get_cabac ...
|
|
|
|
Re: Any altivec experts? (H.264 codec)
|
Posted on: 2014/11/30 23:32
#74
|
Not too shy to talk
|
Compared to other architectures, the ffmpeg project misses Altivec specific code, what will improve performance. I will donate if a bounty is created, to obtain such code but also to read how the work has been done (I would like to learn about the approach in this kind of work). But are we sure this is the main bottleneck in H264 decoding? I did some tests, not with mplayer but with ffmpeg directly, because we need to focus on what we want to measure (video decoding). Keeping the mplayer layer will make things heavier and more complicated. When talking about optimization, we must have a benchmark that is: - measurable - reproducible - configurable (easy to set/unset AltiVec) I extracted a 20-second sequence of the Prometheus video and I only measured the decoding (that takes 100% of the CPU), on my MacMini under Debian 7. With Altivec: Quote: ./ffmpeg_g -benchmark -i ~/Videos/Prometheus-1080p-20s.mp4 -f null /dev/null frame= 402 fps= 17 q=0.0 Lsize=N/A time=00:00:20.18 bitrate=N/A video:25kB audio:3444kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown bench: utime=21.728s bench: maxrss=28596kB
Without Altivec: Quote: ./ffmpeg_g -cpuflags 0 -benchmark -i ~/Videos/Prometheus-1080p-20s.mp4 -f null /dev/null frame= 402 fps= 13 q=0.0 Lsize=N/A time=00:00:20.18 bitrate=N/A video:25kB audio:3444kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown bench: utime=29.008s bench: maxrss=28668kB
So the first conclusion is with Altivec, the video takes almost 22 seconds to be decoded, compared to 29 seconds (a 24% speedup). Note that ffmpeg has an option to set/unset the use of Altivec! Then, I used the Linux perf tool in various ways. If we want to control improvements, we have to set references and describe a process. I don't mean a complex protocol. Results here show statistically where the time is spent (again, with and without Altivec enabled): Quote: sudo perf record -a ./ffmpeg_g -cpuflags altivec -benchmark -i ~/Videos/Prometheus-1080p-20s.mp4 -f null /dev/null sudo perf report --stdio
9.29% ffmpeg_g ffmpeg_g [.] ff_h264_decode_mb_cabac 7.62% ffmpeg_g ffmpeg_g [.] put_h264_chroma_mc8_altivec 6.59% ffmpeg_g ffmpeg_g [.] get_cabac 5.83% ffmpeg_g ffmpeg_g [.] hl_decode_mb_simple_8 3.94% ffmpeg_g ffmpeg_g [.] loop_filter 3.89% ffmpeg_g ffmpeg_g [.] ff_put_pixels16_altivec 3.74% ffmpeg_g ffmpeg_g [.] fill_decode_caches
sudo perf record -a ./ffmpeg_g -cpuflags 0 -benchmark -i ~/Videos/Prometheus-1080p-20s.mp4 -f null /dev/null sudo perf report --stdio
7.16% ffmpeg_g ffmpeg_g [.] ff_h264_decode_mb_cabac 6.89% ffmpeg_g ffmpeg_g [.] put_h264_qpel8_h_lowpass_8 6.13% ffmpeg_g ffmpeg_g [.] put_h264_chroma_mc8_8_c 4.73% ffmpeg_g ffmpeg_g [.] get_cabac 4.70% ffmpeg_g ffmpeg_g [.] put_h264_qpel8_hv_lowpass_8 4.65% ffmpeg_g ffmpeg_g [.] hl_decode_mb_simple_8 4.52% ffmpeg_g ffmpeg_g [.] put_h264_qpel8_v_lowpass_8 3.98% ffmpeg_g ffmpeg_g [.] put_h264_qpel16_mc00_8_c 3.42% ffmpeg_g ffmpeg_g [.] avg_h264_chroma_mc8_8_c 3.13% ffmpeg_g ffmpeg_g [.] loop_filter
That gives an idea of the most time consuming functions. Then, I ran "perf stat" to get an overview and it reports 18% of branch misses, what seems to be high! Finally I measured changing the perf event: Quote: sudo perf record -a -e instructions ./ffmpeg_g -cpuflags altivec -benchmark -i ~/Videos/Prometheus-1080p-20s.mp4 -f null /dev/null
# Events: 22K instructions # # Overhead Command Shared Object Symbol # ........ ............. ........................... .................................................. # 12.79% ffmpeg_g ffmpeg_g [.] get_cabac 7.35% ffmpeg_g ffmpeg_g [.] ff_h264_decode_mb_cabac 4.31% ffmpeg_g ffmpeg_g [.] hl_decode_mb_simple_8 4.29% ffmpeg_g ffmpeg_g [.] decode_cabac_residual_nondc_internal 4.17% ffmpeg_g ffmpeg_g [.] loop_filter 3.93% ffmpeg_g ffmpeg_g [.] get_cabac_noinline 3.89% ffmpeg_g ffmpeg_g [.] put_h264_qpel8_h_lowpass_8 3.85% ffmpeg_g ffmpeg_g [.] fill_decode_caches 3.62% ffmpeg_g ffmpeg_g [.] put_h264_chroma_mc8_altivec 3.52% ffmpeg_g ffmpeg_g [.] ff_h264_filter_mb
# Events: 22K branch-misses # # Overhead Command Shared Object Symbol # ........ ............... ........................... .............................................. # 22.30% ffmpeg_g ffmpeg_g [.] ff_h264_decode_mb_cabac 13.48% ffmpeg_g ffmpeg_g [.] decode_cabac_residual_nondc_internal 11.27% ffmpeg_g ffmpeg_g [.] ff_h264_filter_mb 7.56% ffmpeg_g ffmpeg_g [.] fill_decode_caches 4.10% ffmpeg_g ffmpeg_g [.] get_cabac 3.69% ffmpeg_g ffmpeg_g [.] hl_decode_mb_simple_8 3.17% ffmpeg_g ffmpeg_g [.] h264_idct_add8_altivec
|
|
|
|
Re: Any altivec experts? (H.264 codec)
|
Posted on: 2014/11/26 8:58
#75
|
Not too shy to talk
|
If the wasted processing time is in the decoder, I suggest that we avoid the mplayer layer, using ffmpeg only instead. With mplayer, we see that we will have to take care about the version, the operating systems and their versions, etc. It will also be easier to compare on x86 and ARM, building ffmpeg for them with and without SIMD.
Let's choose: - an ffmpeg revision - 3 videos (the 1080p prometheus being the first one) to check different parts of the code are exercized - 2 or 3 pieces of hardware
|
|
|
|
Re: Any altivec experts? (H.264 codec)
|
Posted on: 2014/11/13 19:05
#76
|
Not too shy to talk
|
@tlosm Some of us already did ... since AmigaBlitter previously posted this link I didn't unpacked and read each file though.
|
|
|
|
Re: Any altivec experts? (H.264 codec)
|
Posted on: 2014/11/10 16:09
#77
|
Not too shy to talk
|
I am not sure everyone who were listed are expert in AltiVec ... but doesn't matter.
I am ok to ask to markos (from freevec.org) but whatever we choose, we have to: 1. Select few H264 videos to have the same common references 2. Measure the performance baseline with ffmpeg on selected machines 3. Profile to know if bottlenecks are really where we suppose 4. Define targets / expected improvements
These are prerequisites to any bounties or contracts.
|
|
|
|
Re: Any altivec experts? (H.264 codec)
|
Posted on: 2014/11/7 8:22
#78
|
Not too shy to talk
|
@AmigaBlitter Quote: Using altivec, btw, will cut out the AmigaOne 500 and sam owners.
For those, i would like to suggest you to check out the PPC 440 and 460 internal DSP. This dsp have 24 instructions that can improve audio video decoding.
Here are some interesting documents you could check:
https://www-01.ibm.com/chips/techlib/t ... PowerPC_440_Embedded_Core
this is especially interesting: https://www-01.ibm.com/chips/techlib/t ... Optimized_dsp_440_app.pdf
I did read these docs and I also profiled ffmpeg on 440 years ago. I tried to optimize but effects were not visible. ffmpeg developers know how to program and I think the code is already efficient. Many other CPU features could be used but I'm afraid the MAC instructions won't be enough. By the way, there is already a macro in ffmpeg to use one of there MAC instructions in some places. Looking again at this topic would be another interesting task! Quote: something similar exist for the 460 too.
Right. The CPU core is basically the same.
|
|
|
|
Re: Any altivec experts? (H.264 codec)
|
Posted on: 2014/11/6 9:00
#79
|
Not too shy to talk
|
@Hans: I think that a request could be made at the ffmpeg team. I also want to point on this possible opportunity: freevec.org offers his services. He is specialized in SIMD and AltiVec and recently proposed his services (being paid for them).
|
|
|
|
Re: Still interested in Huno's games ?
|
Posted on: 2014/8/7 9:24
#80
|
Not too shy to talk
|
@fingus You wrote: "Another story is the fact that wapz3d isn´t running that smooth on my NG, maybe with new RadeonHD-Driver i will dig into the World of 3D Edoshooters again" There is certainly something to do about that. Which machine do you have? To answer to my friend K-L about games of my friend Huno :) Unfortunately I have not enough time to play games. And if I would have, I will spend it to develop. Huno: You know that one day we will really work on a common project! Like Hans, I think the work of Huno is well received and appreciated.
|
|
|
|