@feanor
Just som leyman's question out of curiosity about f.ex:
ff_pred16x16_vert_altivec() {
...
// completely unroll the loop
VEC_ST(srcv, 0*stride, src);
VEC_ST(srcv, 1*stride, src);
VEC_ST(srcv, 2*stride, src);
VEC_ST(srcv, 3*stride, src);
VEC_ST(srcv, 4*stride, src);
VEC_ST(srcv, 5*stride, src);
VEC_ST(srcv, 6*stride, src);
VEC_ST(srcv, 7*stride, src);
VEC_ST(srcv, 8*stride, src);
VEC_ST(srcv, 9*stride, src);
VEC_ST(srcv,10*stride, src);
}
Is there really a gain in unrolling this loop considering more code to cache vs. one inc+branch?
How good is the compiler in optimizing 0*stride -> 0, 1*stride -> stride, 2*stride -> stride<<1, etc?
Is x<<1,2,3 really faster than 2,4,8*x on this CPU?
Just wondering, since you're the expert