4 hashes parallel on SSE2 CPUs for 0.3.6
This patch will calculate four hashes on one core using vector instructions. There’s a test programm included that validates the new hash function against the old one so it should be correct.
The patch is against 0.3.6. Improves khash/s by roughly 115%.
http://pastebin.com/XN1JDb53
Performance of stock code (as measured by my test/benchmark program) is about 1500khash/s. My code does 3500khash/s. Both figures are for one core. It scales well because I do 128 hashes at once and keep the datastructures small enough to fit in the CPU cache.
I have two local collision attacks which will squeeze another 300khash/s out, but they are not stable yet.
Awesome, I’ll have to give it a try myself then. 😲
Tell me if it works 😊 Donations are welcome. 17asVKkzRGTFvvGH9dMGQaHe78xzfvgSSA
That’s amazing…
So are you saying you use 128-bit registers to SIMD four 32-bit data at once? I’ve wondered about that for a long time, but I didn’t think it would be possible due to addition carrying into the neighbour’s value.
Darn, it means the next release, the difficulty is going to have to increase to 1000 or so to keep up, LOL 😁
That’s amazing…
So are you saying you use 128-bit registers to SIMD four 32-bit data at once? I’ve wondered about that for a long time, but I didn’t think it would be possible due to addition carrying into the neighbour’s value.
That’s how it works. Four 32 bit values in a 128 bit vector. They’re calculated independently, but at the same time.
Btw. Why are you using this alignup<16> function when attribute ((aligned (16))) will tell the compiler to align at compiletime?
hmm…I wasn’t able to apply the patch (I’m a noobie). Here’s the command I ran from bitcoin-0.3.6/src # patch < XN1JDb53.txt
Output:
1 out of 1 hunk ignored (Stripping trailing CRs from patch.) patching file main.cpp Hunk #1 FAILED at 2555. Hunk #2 FAILED at 2701. 2 out of 2 hunks FAILED (Stripping trailing CRs from patch.) patching file makefile.unix Hunk #1 FAILED at 45. Hunk #2 FAILED at 58.
What’s the proper command to type into linux? Or do you have linux binaries?
Is it a AMD only optimization perhaps?
Or a 64-bit only optimization.
With the patch above, I was unable to build the test program. You?
The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client. I was even able to port its minor changes to 0.3.7 successfully, with the same results.
Is there a way we can confirm that the variables are being aligned properly? I’m wondering if the Intel procs are less tolerant of misalignment than the AMD’s.
With the patch above, I was unable to build the test program. You?
Under x86 I had to include cryptopp/obj/cpu.o in the list of object files, otherwise “make test” would fail. Under x86_64 I had no such issue.
Quote from: petree on August 02, 2010, 12:22:29 AM UTCThe original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client. I was even able to port its minor changes to 0.3.7 successfully, with the same results.
As I said above I did notice an imporvement in performace too, but I am not sure the patched version works correctly. Have you been able to generate any blocks with the patched version?
Quote from: knightmb on August 01, 2010, 11:47:04 PM UTCIs it a AMD only optimization perhaps?
Or a 64-bit only optimization.
I’m trying on a Q6600 running 64bit linux (ubuntu server) and it makes things slower there, so not 64bit only. And I’m running on my mac laptop which sports an Intel i5 (also 64 bit OSX 10.6), which great speed improvement there, so not AMD only.
Quote from: Ground Loop on August 02, 2010, 12:17:07 AM UTCWith the patch above, I was unable to build the test program. You?
Under x86 I had to include cryptopp/obj/cpu.o in the list of object files, otherwise “make test” would fail. Under x86_64 I had no such issue.
Quote from: petree on August 02, 2010, 12:22:29 AM UTCThe original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client. I was even able to port its minor changes to 0.3.7 successfully, with the same results.
As I said above I did notice an imporvement in performace too, but I am not sure the patched version works correctly. Have you been able to generate any blocks with the patched version?
Yes, since applying this patch I’ve generated 2 blocks.
Is it 2x fast on AMD and 1/2 fast on Intel?
Quote from: tcatm on July 31, 2010, 1:12:38 AM UTCBtw. Why are you using this alignup<16> function when attribute ((aligned (16))) will tell the compiler to align at compiletime?
Tried that, but it doesn’t work for things on the stack. I ran some tests.
It doesn’t even cause an error, it just doesn’t align it.
FWIW, there exists -mstackrealign and -mpreferred-stack-boundary=NUM
After 52 hours of trying with no blocks generated, I give up and I am switching back to the vanilla bitcoin.
The probability of getting no blocks within 52 hours at 51,000 khash/s is 0.011%. So I conclude that the patch doesn’t work and I am 99.989% confident about that. I hope that tcatm provides some explanation on how to use the supplied test program.
To use the test program download this file (or generate it yourself from the blockchain): http://ul.to/hz5wlg
The program will try to find the correct nonce in each block and detect if the hash function does work correctly. It’ll also benchmark the algorithm.
From what I’ve heard the patch does not work on 32 bit systems. I don’t know why. I’ve developed it on an AMD64 machine and it works fine. If it’s slower on Intel, try to disable Hyperthreading. The big loop in the SSE2 code doesn’t contain any “normal” x86 except for one jump at the end.
Btw, there’s a git repo at http://github.com/tcatm/bitcoin-cruncher/
Oh, that’s a part in the code my patch doesn’t touch. You could try to remove line 2741 (CRITICAL_BLOCK(cs_main)).
CRITICAL_BLOCK is a macro that contains a for loop. The assertion failure indicates that break has been called inside the body of the loop. The only break statement in this block is in line 2762. In the original source file, there is no break statement in this critical block. I think you must remove lines 2759-2762. The is nothing like that in the original main.cpp.
Thanks! That got probably mixed up when I patched the file with an older diff. I fixed the git: http://github.com/tcatm/bitcoin-cruncher
CRITICAL_BLOCK is a macro that contains a for loop. The assertion failure indicates that break has been called inside the body of the loop. The only break statement in this block is in line 2762. In the original source file, there is no break statement in this critical block. I think you must remove lines 2759-2762. The is nothing like that in the original main.cpp.
Sorry about that. CRITICAL_BLOCK isn’t perfect. You have to be careful not to break or continue out of it. There’s an assert that catches and warns about break. I can be criticized for using it, but the syntax would be so much more bloated and error prone without it.
Is there a chance the SSE2 code is slow on Intel because of some quirk that could be worked around? For instance, if something works but is slow if it’s not aligned, or thrashing the cache, or one type of instruction that’s really slow? I’m not sure how available it is, but I think Intel used to have a profiler for profiling on a per instruction level. I guess if tcatm doesn’t have a system with the slow processor to test with, there’s not much hope. But it would be really nice if this was working on most CPUs.
I can confirm that the patch now works just fine. I just generated my first 50 BTC with it. And since this patch doubles the speed I think it’s only fair if I donated half of that to tcatm.
I can confirm that the patch now works just fine. I just generated my first 50 BTC with it. And since this patch doubles the speed I think it’s only fair if I donated half of that to tcatm.
That’s nice to hear. Thanks for the donation and thanks to everyone else who donated!
I just tried again with the latest version from your github. I’m still seeing a drop in performance compared to the vanilla source.
My system went from ~7100 to ~4200.
This particular system has dual Intel Xeon Quad-Core CPUs (E5335) @ 2.00GHz.
It seems to be like this: everything before Core2 will be slower, everything starting with Core2 is faster. Can anyone test the code on an older AMD64? I know there was a change in the way SSE2 instructions are executed in recent architectures.
Can we implement a speed test, so different hashing engines are tried and the fastest is chosen?
It seems to be like this: everything before Core2 will be slower, everything starting with Core2 is faster. Can anyone test the code on an older AMD64? I know there was a change in the way SSE2 instructions are executed in recent architectures.
My Core2Quad (Q6600) slowed down 50%, my i5 improved ~200%, thus I don’t think what you state is accurate. Maybe starting at some specific Core2?
That big of a difference in speed, by a factor of 4 or 6, feels like it’s likely to be some quirky weak spot or instruction that the old chip is slow with. Unless it’s a touted feature of the i5 that they made SSE2 six times faster.
A quick summary:
Xeon Quad 41% slower
Core 2 Duo 55% slower
Core 2 Duo same (vess)
Core 2 Quad 50% slower
Core i5 200% faster (nelisky)
Core i5 100% faster (vess)
AMD Opteron 105% faster
aceat64:
My system went from ~7100 to ~4200.
This particular system has dual Intel Xeon Quad-Core CPUs (E5335) @ 2.00GHz.
impossible7:
on an Intel Core 2 Duo T7300 running x86_64 linux it was 55% slower compared to the stock version (r121)
nelisky:
My Core2Quad (Q6600) slowed down 50%,
my i5 improved ~200%,
impossible7:
on an AMD Opteron 2374 HE running x86_64 linux I got a 105% improvement (!)
My core i5 doubled in speed. My Core 2 Duo is the same speed.
Would be interesting to try it out on older AMD64. There’s been a change that would explain it there:
http://developer.amd.com/documentation/articles/pages/682007171.aspx
Maybe Intel did something similiar without announcing it?
I’ll just pitch in that Phenom and Phenom II processors doubled (roughly). No difference between the two that I can tell.
Sorry I dont have anything older than Phenoms available right now. Might be able to access a old X2 in a week.
- Does not work on 32-bit (though that’s not a problem with the algorithm).
- Patch is against older SVN. There’s a git repo at
http://github.com/tcatm/bitcoin-cruncher - Compiles on every 64bit Linux.
It’s not intended as a replacement for a standard client but for a dedicated bitcoinminer box. I’m planning a pluggable bitcoinminer someday. But at current difficulty it’s easier to work for bitcoins than finding faster ways for mining.
I would really like to have this feature included in an official build sometimes soon along with an internal speed test to determine which algorithm to use. You can always remove the speed test later once you figure out how to determine whether it will be faster or slower without running the speed test.
- Does not work on 32-bit (though that’s not a problem with the algorithm).
- Patch is against older SVN. There’s a git repo at
http://github.com/tcatm/bitcoin-cruncher- Compiles on every 64bit Linux.
It’s not intended as a replacement for a standard client but for a dedicated bitcoinminer box. I’m planning a pluggable bitcoinminer someday. But at current difficulty it’s easier to work for bitcoins than finding faster ways for mining.
-
Do we know why it doesn’t work on 32bit? Is is it because it’s using 128bits and if so, would it help if we dropped it to 64?
-
Thank you, I will look into implementing it on my 64bit systems.
-
Excellent to hear. I’m looking forward to using it.
I was planning on using it on a PE2650 dual proc Xeon @3.2GHz w/HT. I would really like to get this figured out to utilize that system. I am planning one as well. At current difficulty I would agree, except when the system needs to be run anyway and latency isn’t an issue.
MinGW on Windows has trouble compiling it:
g++ -c -mthreads -O2 -w -Wno-invalid-offsetof -Wformat -g -D__WXDEBUG__ -DWIN32 -D__WXMSW__ -D_WINDOWS -DNOPCH -I”/boost” -I”/db/build_unix” -I”/openssl/include” -I”/wxwidgets/lib/gcc_lib/mswud” -I”/wxwidgets/include” -msse2 -O3 -o obj/sha256.o sha256.cpp
sha256.cpp: In function `long long int vector Ch(long long int vector, long long int vector, long long int vector)’: sha256.cpp:31: internal compiler error: in perform_integral_promotions, at cp/typeck.c:1454 Please submit a full bug report, with preprocessed source if appropriate. See URL:http://www.mingw.org/bugs.shtml for instructions. make: *** [obj/sha256.o] Error 1
- Do we know why it doesn’t work on 32bit? Is is it because it’s using 128bits and if so, would it help if we dropped it to 64?
No idea, maybe some alignment problem. Someone was trying to figure it out on IRC. I don’t have a SSE2 capable 32bit system. The additional registers in 64bit mode are also useful. I don’t know if your PE2650 has a recent enough CPU. You might experience a performance drop of 50% if the CPU is too old.
Btw, did anyone with Intel CPU compare performance with Hyperthreading enabled/disabled? The SSE2 loop keeps the arithmetic units and pipelines pretty busy and I can imagine Hyperthreading might decrease performance.
MinGW on Windows has trouble compiling it:
g++ -c -mthreads -O2 -w -Wno-invalid-offsetof -Wformat -g -D__WXDEBUG__ -DWIN32 -D__WXMSW__ -D_WINDOWS -DNOPCH -I”/boost” -I”/db/build_unix” -I”/openssl/include” -I”/wxwidgets/lib/gcc_lib/mswud” -I”/wxwidgets/include” -msse2 -O3 -o obj/sha256.o sha256.cpp
sha256.cpp: In function `long long int vector Ch(long long int vector, long long int vector, long long int vector)’: sha256.cpp:31: internal compiler error: in perform_integral_promotions, at cp/typeck.c:1454 Please submit a full bug report, with preprocessed source if appropriate. See <URL:http://www.mingw.org/bugs.shtml\> (
http://www.mingw.org/bugs.shtml%5C%3E) for instructions. make: *** [obj/sha256.o] Error 1
Looks like we’re triggering a compiler bug in the tree optimizer. Can you try to compile it -O0?
If you haven’t already, try aligning thash. It might matter. Couldn’t hurt.
Quote from: tcatm on August 13, 2010, 3:53:07 PM UTCLooks like we’re triggering a compiler bug in the tree optimizer. Can you try to compile it -O0?
No help from -O0, same error.
MinGW is GCC 3.4.5. Probably the problem.
I’ll see if I can get a newer version of MinGW.
Got the test working on 32-bit with MinGW GCC 4.5. Exactly 50% slower than stock with Core 2.
MinGW GCC 4.5.0: Crypto++ doesn’t work, X86_SHA256_HashBlocks() never returns I only got 4-way working with test.cpp but not when called by BitcoinMiner
MinGW GCC 4.4.1:
Crypto++ works
4-way SIGSEGV
GCC is definitely not aligning __m128i.
Even if we align our own __m128i variables, the compiler may decide to use a __m128i behind the scenes as a temporary variable.
By making our __m128i variables aligned and changing these inlines to defines, I was able to get it to work on 4.4.1 with -O0 only: #define Ch(b, c, d) ((b & c) ^ (~b & d)) #define Maj(b, c, d) ((b & c) ^ (b & d) ^ (c & d)) #define ROTR(x, n) (_mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n)) #define SHR(x, n) _mm_srli_epi32(x, n)
But that’s with -O0.
Well, reporting back.
I got it to compile by specifying -msse and -msse2 to gcc when compiling. I first was hashing about 692kh/s (50% of SVN r130[1400kh/s]) but recompiled and am now receiving about ~1120kh/s. This is currently the equivalent of using both of my CPUs without HyperThreading, though I can verify that it IS using HyperThreading. With HyperThreading turned off, I get ~1350kh/s. Pretty close to the stock build.
Also, does the git contain the patched and updated code?
// SVN r130 Using HT.
08/14/10 19:02 hashmeter 4 CPUs 1392 khash/s
08/14/10 19:32 hashmeter 4 CPUs 1387 khash/s
08/14/10 20:02 hashmeter 4 CPUs 1386 khash/s
08/14/10 20:32 hashmeter 4 CPUs 1380 khash/s
08/14/10 21:02 hashmeter 4 CPUs 1363 khash/s
// With -msse -msse2, first run. Using HT.
08/14/10 21:32 hashmeter 4 CPUs 692 khash/s
08/14/10 22:06 hashmeter 4 CPUs 1011 khash/s
08/14/10 22:11 hashmeter 4 CPUs 1104 khash/s
08/14/10 22:16 hashmeter 4 CPUs 1120 khash/s
// NOT using HT.
08/14/10 22:21 hashmeter 2 CPUs 1359 khash/s
08/14/10 22:26 hashmeter 2 CPUs 1340 khash/s
Just wanted to tell my story and help with whatever information I could.
On both MinGW GCC 4.4.1 and 4.5.0 I have it working with test.cpp but SIGSEGV when called by BitcoinMiner. So now it doesn’t look like it’s the version of GCC, it’s something else, maybe just the luck of how the stack is aligned.
I have it working fine on GCC 4.3.3 on Ubuntu 32-bit.
I found the problem with Crypto++ on MinGW 4.5.0. Here’s the patch for that:
--- \old\sha.cpp Mon Jul 26 13:31:11 2010
+++
ew\sha.cpp Sat Aug 14 20:21:08 2010
@@ -336,7 +336,7 @@
ROUND(14, 0, eax, ecx, edi, edx)
ROUND(15, 0, ecx, eax, edx, edi)
- ASL(1)
+ ASL(label1) // Bitcoin: fix for MinGW GCC 4.5
AS2(add WORD_REG(si), 4*16)
ROUND(0, 1, eax, ecx, edi, edx)
ROUND(1, 1, ecx, eax, edx, edi)
@@ -355,7 +355,7 @@
ROUND(14, 1, eax, ecx, edi, edx)
ROUND(15, 1, ecx, eax, edx, edi)
AS2( cmp WORD_REG(si), K_END)
- ASJ( jne, 1, b)
+ ASJ( jne, label1, ) // Bitcoin: fix for MinGW GCC 4.5
AS2( mov WORD_REG(dx), DATA_SAVE)
AS2( add WORD_REG(dx), 64) Tryed Bitcoin 3.10 on Ubuntu Lucid 64 bit on Intel Atom 330.
Using the option -4way produces half the hash/s than not using the option. I tried using 1 to 4 (virtual) cores and -4way option produces less than no option always (arround half). Its probably due to the Intel thing.