You should benchmark all implementations (using cpu time, not realtime) and choose the fastest and while benchmarking check whether the algorithm actually works.
Re: Auto-detect for 128-bit 4-way SSE2
Figures: tcatm
You should benchmark all implementations (using cpu time, not realtime) and choose the fastest and while benchmarking check whether the algorithm actually works.