Re: Auto-detect for 128-bit 4-way SSE2

Figures: tcatm

You should benchmark all implementations (using cpu time, not realtime) and choose the fastest and while benchmarking check whether the algorithm actually works.