4 hashes parallel on SSE2 CPUs for 0.3.6

tcatm July 30, 2010 12:23 UTC Source ·

This patch will calculate four hashes on one core using vector instructions. There’s a test programm included that validates the new hash function against the old one so it should be correct.

The patch is against 0.3.6. Improves khash/s by roughly 115%.

http://pastebin.com/XN1JDb53

tcatm July 30, 2010 21:47 UTC Source ·

Performance of stock code (as measured by my test/benchmark program) is about 1500khash/s. My code does 3500khash/s. Both figures are for one core. It scales well because I do 128 hashes at once and keep the datastructures small enough to fit in the CPU cache.

I have two local collision attacks which will squeeze another 300khash/s out, but they are not stable yet.

knightmb July 30, 2010 21:51 UTC Source ·

Awesome, I’ll have to give it a try myself then. 😲

tcatm July 30, 2010 22:00 UTC Source ·

Tell me if it works 😊 Donations are welcome. 17asVKkzRGTFvvGH9dMGQaHe78xzfvgSSA

Satoshi Nakamoto July 31, 2010 00:29 UTC Source ·

That’s amazing…

So are you saying you use 128-bit registers to SIMD four 32-bit data at once?  I’ve wondered about that for a long time, but I didn’t think it would be possible due to addition carrying into the neighbour’s value.

knightmb July 31, 2010 04:49 UTC Source ·

Darn, it means the next release, the difficulty is going to have to increase to 1000 or so to keep up, LOL 😁

tcatm July 31, 2010 10:12 UTC Source ·
Quote from: satoshi on July 30, 2010, 3:29:20 PM UTC

That’s amazing…

So are you saying you use 128-bit registers to SIMD four 32-bit data at once? I’ve wondered about that for a long time, but I didn’t think it would be possible due to addition carrying into the neighbour’s value.

That’s how it works. Four 32 bit values in a 128 bit vector. They’re calculated independently, but at the same time.

Btw. Why are you using this alignup<16> function when attribute ((aligned (16))) will tell the compiler to align at compiletime?

em3rgentOrdr July 31, 2010 13:42 UTC Source ·

hmm…I wasn’t able to apply the patch (I’m a noobie). Here’s the command I ran from bitcoin-0.3.6/src # patch < XN1JDb53.txt

Output:

1 out of 1 hunk ignored (Stripping trailing CRs from patch.) patching file main.cpp Hunk #1 FAILED at 2555. Hunk #2 FAILED at 2701. 2 out of 2 hunks FAILED (Stripping trailing CRs from patch.) patching file makefile.unix Hunk #1 FAILED at 45. Hunk #2 FAILED at 58.

What’s the proper command to type into linux? Or do you have linux binaries?

knightmb August 2, 2010 08:47 UTC Source ·

Is it a AMD only optimization perhaps?

impossible7 August 2, 2010 09:00 UTC Source ·
Quote from: knightmb on August 01, 2010, 11:47:04 PM UTC

Is it a AMD only optimization perhaps?

Or a 64-bit only optimization.

Ground Loop August 2, 2010 09:17 UTC Source ·

With the patch above, I was unable to build the test program. You?

petree August 2, 2010 09:22 UTC Source ·

The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client. I was even able to port its minor changes to 0.3.7 successfully, with the same results.

Is there a way we can confirm that the variables are being aligned properly? I’m wondering if the Intel procs are less tolerant of misalignment than the AMD’s.

impossible7 August 2, 2010 09:31 UTC Source ·
Quote from: Ground Loop on August 02, 2010, 12:17:07 AM UTC

With the patch above, I was unable to build the test program. You?

Under x86 I had to include cryptopp/obj/cpu.o in the list of object files, otherwise “make test” would fail. Under x86_64 I had no such issue.

Quote from: petree on August 02, 2010, 12:22:29 AM UTC

The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client. I was even able to port its minor changes to 0.3.7 successfully, with the same results.

As I said above I did notice an imporvement in performace too, but I am not sure the patched version works correctly. Have you been able to generate any blocks with the patched version?

nelisky August 2, 2010 13:05 UTC Source ·
Quote from: impossible7 on August 02, 2010, 12:00:55 AM UTC
Quote from: knightmb on August 01, 2010, 11:47:04 PM UTC

Is it a AMD only optimization perhaps?

Or a 64-bit only optimization.

I’m trying on a Q6600 running 64bit linux (ubuntu server) and it makes things slower there, so not 64bit only. And I’m running on my mac laptop which sports an Intel i5 (also 64 bit OSX 10.6), which great speed improvement there, so not AMD only.

petree August 2, 2010 16:12 UTC Source ·
Quote from: impossible7 on August 02, 2010, 12:31:44 AM UTC
Quote from: Ground Loop on August 02, 2010, 12:17:07 AM UTC

With the patch above, I was unable to build the test program. You?

Under x86 I had to include cryptopp/obj/cpu.o in the list of object files, otherwise “make test” would fail. Under x86_64 I had no such issue.

Quote from: petree on August 02, 2010, 12:22:29 AM UTC

The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client. I was even able to port its minor changes to 0.3.7 successfully, with the same results.

As I said above I did notice an imporvement in performace too, but I am not sure the patched version works correctly. Have you been able to generate any blocks with the patched version?

Yes, since applying this patch I’ve generated 2 blocks.

Satoshi Nakamoto August 2, 2010 19:02 UTC Source ·

Is it 2x fast on AMD and 1/2 fast on Intel?

Quote from: tcatm on July 31, 2010, 1:12:38 AM UTC

Btw. Why are you using this alignup<16> function when attribute ((aligned (16))) will tell the compiler to align at compiletime?

Tried that, but it doesn’t work for things on the stack.  I ran some tests.

It doesn’t even cause an error, it just doesn’t align it.

Jeff Garzik (jgarzik) August 2, 2010 19:15 UTC Source ·

FWIW, there exists -mstackrealign and -mpreferred-stack-boundary=NUM

impossible7 August 2, 2010 20:49 UTC Source ·

After 52 hours of trying with no blocks generated, I give up and I am switching back to the vanilla bitcoin.

The probability of getting no blocks within 52 hours at 51,000 khash/s is 0.011%. So I conclude that the patch doesn’t work and I am 99.989% confident about that. I hope that tcatm provides some explanation on how to use the supplied test program.

tcatm August 2, 2010 21:07 UTC Source ·

To use the test program download this file (or generate it yourself from the blockchain): http://ul.to/hz5wlg The program will try to find the correct nonce in each block and detect if the hash function does work correctly. It’ll also benchmark the algorithm.

From what I’ve heard the patch does not work on 32 bit systems. I don’t know why. I’ve developed it on an AMD64 machine and it works fine. If it’s slower on Intel, try to disable Hyperthreading. The big loop in the SSE2 code doesn’t contain any “normal” x86 except for one jump at the end.

Btw, there’s a git repo at http://github.com/tcatm/bitcoin-cruncher/

tcatm August 6, 2010 10:59 UTC Source ·

Oh, that’s a part in the code my patch doesn’t touch. You could try to remove line 2741 (CRITICAL_BLOCK(cs_main)).

impossible7 August 6, 2010 11:37 UTC Source ·

CRITICAL_BLOCK is a macro that contains a for loop. The assertion failure indicates that break has been called inside the body of the loop. The only break statement in this block is in line 2762. In the original source file, there is no break statement in this critical block. I think you must remove lines 2759-2762. The is nothing like that in the original main.cpp.

tcatm August 6, 2010 11:54 UTC Source ·

Thanks! That got probably mixed up when I patched the file with an older diff. I fixed the git: http://github.com/tcatm/bitcoin-cruncher

Satoshi Nakamoto August 7, 2010 21:16 UTC Source ·
Quote from: impossible7 on August 06, 2010, 2:37:20 AM UTC

CRITICAL_BLOCK is a macro that contains a for loop. The assertion failure indicates that break has been called inside the body of the loop. The only break statement in this block is in line 2762. In the original source file, there is no break statement in this critical block. I think you must remove lines 2759-2762. The is nothing like that in the original main.cpp.

Sorry about that.  CRITICAL_BLOCK isn’t perfect.  You have to be careful not to break or continue out of it.  There’s an assert that catches and warns about break.  I can be criticized for using it, but the syntax would be so much more bloated and error prone without it.

Is there a chance the SSE2 code is slow on Intel because of some quirk that could be worked around?  For instance, if something works but is slow if it’s not aligned, or thrashing the cache, or one type of instruction that’s really slow?  I’m not sure how available it is, but I think Intel used to have a profiler for profiling on a per instruction level.  I guess if tcatm doesn’t have a system with the slow processor to test with, there’s not much hope.  But it would be really nice if this was working on most CPUs.

impossible7 August 7, 2010 22:51 UTC Source ·

I can confirm that the patch now works just fine. I just generated my first 50 BTC with it. And since this patch doubles the speed I think it’s only fair if I donated half of that to tcatm.

tcatm August 7, 2010 22:59 UTC Source ·
Quote from: impossible7 on August 07, 2010, 10:51:07 PM UTC

I can confirm that the patch now works just fine. I just generated my first 50 BTC with it. And since this patch doubles the speed I think it’s only fair if I donated half of that to tcatm.

That’s nice to hear. Thanks for the donation and thanks to everyone else who donated!

aceat64 August 8, 2010 06:37 UTC Source ·

I just tried again with the latest version from your github. I’m still seeing a drop in performance compared to the vanilla source.

My system went from ~7100 to ~4200.

This particular system has dual Intel Xeon Quad-Core CPUs (E5335) @ 2.00GHz.

tcatm August 8, 2010 11:52 UTC Source ·

It seems to be like this: everything before Core2 will be slower, everything starting with Core2 is faster. Can anyone test the code on an older AMD64? I know there was a change in the way SSE2 instructions are executed in recent architectures.

nimnul August 12, 2010 12:18 UTC Source ·

Can we implement a speed test, so different hashing engines are tried and the fastest is chosen?

nelisky August 12, 2010 12:57 UTC Source ·
Quote from: tcatm on August 08, 2010, 2:52:53 AM UTC

It seems to be like this: everything before Core2 will be slower, everything starting with Core2 is faster. Can anyone test the code on an older AMD64? I know there was a change in the way SSE2 instructions are executed in recent architectures.

My Core2Quad (Q6600) slowed down 50%, my i5 improved ~200%, thus I don’t think what you state is accurate. Maybe starting at some specific Core2?

Satoshi Nakamoto August 12, 2010 22:07 UTC Source ·

That big of a difference in speed, by a factor of 4 or 6, feels like it’s likely to be some quirky weak spot or instruction that the old chip is slow with.  Unless it’s a touted feature of the i5 that they made SSE2 six times faster.

A quick summary:
Xeon Quad        41% slower
Core 2 Duo        55% slower
Core 2 Duo        same (vess)
Core 2 Quad      50% slower
Core i5            200% faster (nelisky)
Core i5            100% faster (vess)
AMD Opteron    105% faster

aceat64:
My system went from ~7100 to ~4200.
This particular system has dual Intel Xeon Quad-Core CPUs (E5335) @ 2.00GHz.

impossible7:
on an Intel Core 2 Duo T7300 running x86_64 linux it was 55% slower compared to the stock version (r121)

nelisky:
My Core2Quad (Q6600) slowed down 50%,
my i5 improved ~200%,

impossible7:
on an AMD Opteron 2374 HE running x86_64 linux I got a 105% improvement (!)

vess August 12, 2010 22:09 UTC Source ·

My core i5 doubled in speed. My Core 2 Duo is the same speed.

tcatm August 13, 2010 00:42 UTC Source ·

Would be interesting to try it out on older AMD64. There’s been a change that would explain it there: http://developer.amd.com/documentation/articles/pages/682007171.aspx

Maybe Intel did something similiar without announcing it?

Cheater August 13, 2010 06:27 UTC Source ·

I’ll just pitch in that Phenom and Phenom II processors doubled (roughly). No difference between the two that I can tell.

Sorry I dont have anything older than Phenoms available right now. Might be able to access a old X2 in a week.

tcatm August 13, 2010 21:27 UTC Source ·
  1. Does not work on 32-bit (though that’s not a problem with the algorithm).
  2. Patch is against older SVN. There’s a git repo at http://github.com/tcatm/bitcoin-cruncher
  3. Compiles on every 64bit Linux.

It’s not intended as a replacement for a standard client but for a dedicated bitcoinminer box. I’m planning a pluggable bitcoinminer someday. But at current difficulty it’s easier to work for bitcoins than finding faster ways for mining.

NewLibertyStandard August 13, 2010 22:27 UTC Source ·

I would really like to have this feature included in an official build sometimes soon along with an internal speed test to determine which algorithm to use. You can always remove the speed test later once you figure out how to determine whether it will be faster or slower without running the speed test.

sgtstein August 13, 2010 23:17 UTC Source ·
Quote from: tcatm on August 13, 2010, 12:27:14 PM UTC
  1. Does not work on 32-bit (though that’s not a problem with the algorithm).
  2. Patch is against older SVN. There’s a git repo at http://github.com/tcatm/bitcoin-cruncher
  3. Compiles on every 64bit Linux.

It’s not intended as a replacement for a standard client but for a dedicated bitcoinminer box. I’m planning a pluggable bitcoinminer someday. But at current difficulty it’s easier to work for bitcoins than finding faster ways for mining.

  1. Do we know why it doesn’t work on 32bit? Is is it because it’s using 128bits and if so, would it help if we dropped it to 64?

  2. Thank you, I will look into implementing it on my 64bit systems.

  3. Excellent to hear. I’m looking forward to using it.

I was planning on using it on a PE2650 dual proc Xeon @3.2GHz w/HT. I would really like to get this figured out to utilize that system. I am planning one as well. At current difficulty I would agree, except when the system needs to be run anyway and latency isn’t an issue.

Satoshi Nakamoto August 14, 2010 00:49 UTC Source ·

MinGW on Windows has trouble compiling it:

g++ -c -mthreads -O2 -w -Wno-invalid-offsetof -Wformat -g -D__WXDEBUG__ -DWIN32 -D__WXMSW__ -D_WINDOWS -DNOPCH -I”/boost” -I”/db/build_unix” -I”/openssl/include” -I”/wxwidgets/lib/gcc_lib/mswud” -I”/wxwidgets/include” -msse2 -O3 -o obj/sha256.o sha256.cpp

sha256.cpp: In function `long long int vector Ch(long long int vector, long long int vector, long long int vector)’: sha256.cpp:31: internal compiler error: in perform_integral_promotions, at cp/typeck.c:1454 Please submit a full bug report, with preprocessed source if appropriate. See URL:http://www.mingw.org/bugs.shtml for instructions. make: *** [obj/sha256.o] Error 1

tcatm August 14, 2010 00:50 UTC Source ·
Quote from: sgtstein on August 13, 2010, 2:17:51 PM UTC
  1. Do we know why it doesn’t work on 32bit? Is is it because it’s using 128bits and if so, would it help if we dropped it to 64?

No idea, maybe some alignment problem. Someone was trying to figure it out on IRC. I don’t have a SSE2 capable 32bit system. The additional registers in 64bit mode are also useful. I don’t know if your PE2650 has a recent enough CPU. You might experience a performance drop of 50% if the CPU is too old.

Btw, did anyone with Intel CPU compare performance with Hyperthreading enabled/disabled? The SSE2 loop keeps the arithmetic units and pipelines pretty busy and I can imagine Hyperthreading might decrease performance.

tcatm August 14, 2010 00:53 UTC Source ·
Quote from: satoshi on August 13, 2010, 3:49:18 PM UTC

MinGW on Windows has trouble compiling it:

g++ -c -mthreads -O2 -w -Wno-invalid-offsetof -Wformat -g -D__WXDEBUG__ -DWIN32 -D__WXMSW__ -D_WINDOWS -DNOPCH -I”/boost” -I”/db/build_unix” -I”/openssl/include” -I”/wxwidgets/lib/gcc_lib/mswud” -I”/wxwidgets/include” -msse2 -O3 -o obj/sha256.o sha256.cpp

sha256.cpp: In function `long long int vector Ch(long long int vector, long long int vector, long long int vector)’: sha256.cpp:31: internal compiler error: in perform_integral_promotions, at cp/typeck.c:1454 Please submit a full bug report, with preprocessed source if appropriate. See <URL:http://www.mingw.org/bugs.shtml\> (http://www.mingw.org/bugs.shtml%5C%3E) for instructions. make: *** [obj/sha256.o] Error 1

Looks like we’re triggering a compiler bug in the tree optimizer. Can you try to compile it -O0?

Satoshi Nakamoto August 14, 2010 04:22 UTC Source ·

If you haven’t already, try aligning thash.  It might matter.  Couldn’t hurt.

Quote from: tcatm on August 13, 2010, 3:53:07 PM UTC

Looks like we’re triggering a compiler bug in the tree optimizer. Can you try to compile it -O0?

No help from -O0, same error.

MinGW is GCC 3.4.5.  Probably the problem.

I’ll see if I can get a newer version of MinGW.

Satoshi Nakamoto August 14, 2010 17:55 UTC Source ·

Got the test working on 32-bit with MinGW GCC 4.5.  Exactly 50% slower than stock with Core 2.

Satoshi Nakamoto August 14, 2010 22:06 UTC Source ·

MinGW GCC 4.5.0: Crypto++ doesn’t work, X86_SHA256_HashBlocks() never returns I only got 4-way working with test.cpp but not when called by BitcoinMiner

MinGW GCC 4.4.1:
Crypto++ works
4-way SIGSEGV

GCC is definitely not aligning __m128i.

Even if we align our own __m128i variables, the compiler may decide to use a __m128i behind the scenes as a temporary variable.

By making our __m128i variables aligned and changing these inlines to defines, I was able to get it to work on 4.4.1 with -O0 only: #define Ch(b, c, d)  ((b & c) ^ (~b & d)) #define Maj(b, c, d)  ((b & c) ^ (b & d) ^ (c & d)) #define ROTR(x, n) (_mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n)) #define SHR(x, n)  _mm_srli_epi32(x, n)

But that’s with -O0.

sgtstein August 15, 2010 03:19 UTC Source ·

Well, reporting back.

I got it to compile by specifying -msse and -msse2 to gcc when compiling. I first was hashing about 692kh/s (50% of SVN r130[1400kh/s]) but recompiled and am now receiving about ~1120kh/s. This is currently the equivalent of using both of my CPUs without HyperThreading, though I can verify that it IS using HyperThreading. With HyperThreading turned off, I get ~1350kh/s. Pretty close to the stock build.

Also, does the git contain the patched and updated code?

// SVN r130 Using HT.
08/14/10 19:02 hashmeter   4 CPUs   1392 khash/s
08/14/10 19:32 hashmeter   4 CPUs   1387 khash/s
08/14/10 20:02 hashmeter   4 CPUs   1386 khash/s
08/14/10 20:32 hashmeter   4 CPUs   1380 khash/s
08/14/10 21:02 hashmeter   4 CPUs   1363 khash/s
// With -msse -msse2, first run. Using HT.
08/14/10 21:32 hashmeter   4 CPUs    692 khash/s
08/14/10 22:06 hashmeter   4 CPUs   1011 khash/s
08/14/10 22:11 hashmeter   4 CPUs   1104 khash/s
08/14/10 22:16 hashmeter   4 CPUs   1120 khash/s
// NOT using HT.
08/14/10 22:21 hashmeter   2 CPUs   1359 khash/s
08/14/10 22:26 hashmeter   2 CPUs   1340 khash/s

Just wanted to tell my story and help with whatever information I could.

Satoshi Nakamoto August 15, 2010 03:40 UTC Source ·

On both MinGW GCC 4.4.1 and 4.5.0 I have it working with test.cpp but SIGSEGV when called by BitcoinMiner.  So now it doesn’t look like it’s the version of GCC, it’s something else, maybe just the luck of how the stack is aligned.

I have it working fine on GCC 4.3.3 on Ubuntu 32-bit.

I found the problem with Crypto++ on MinGW 4.5.0.  Here’s the patch for that:

--- \old\sha.cpp Mon Jul 26 13:31:11 2010
+++ 
ew\sha.cpp Sat Aug 14 20:21:08 2010
@@ -336,7 +336,7 @@
  ROUND(14, 0, eax, ecx, edi, edx)
  ROUND(15, 0, ecx, eax, edx, edi)
 
- ASL(1)
+    ASL(label1)   // Bitcoin: fix for MinGW GCC 4.5
  AS2(add WORD_REG(si), 4*16)
  ROUND(0, 1, eax, ecx, edi, edx)
  ROUND(1, 1, ecx, eax, edx, edi)
@@ -355,7 +355,7 @@
  ROUND(14, 1, eax, ecx, edi, edx)
  ROUND(15, 1, ecx, eax, edx, edi)
  AS2( cmp  WORD_REG(si), K_END)
- ASJ( jne, 1, b)
+    ASJ(    jne,    label1,  )   // Bitcoin: fix for MinGW GCC 4.5
 
  AS2( mov  WORD_REG(dx), DATA_SAVE)
  AS2( add  WORD_REG(dx), 64)
hugolp August 17, 2010 18:23 UTC Source ·

Tryed Bitcoin 3.10 on Ubuntu Lucid 64 bit on Intel Atom 330.

Using the option -4way produces half the hash/s than not using the option. I tried using 1 to 4 (virtual) cores and -4way option produces less than no option always (arround half). Its probably due to the Intel thing.