Protocol Buffers for Bitcoin
There has been a discussion going on elsewhere about using protocol buffers for bitcoin. To summarise the advantages:
-> Small encoding -> Very fast -> Implementations in loads of languages (So writing new clients become a lot simpler) -> Forwards compatible (indeed, this is most of the point of protocol buffers) -> Extremely simpleto use in code
So initially I would suggest storing the wallet file using protocol buffers, this isn’t a breaking change and immediately makes the wallet file easier for other programs to parse. Eventually I would hope that bitcoin could use protocol buffers for networking.
Some people have been suggesting that protocol buffers might be larger than the custom written packet layout. I suspect that actually it would be smaller due to some of the clever encoding used in protocol buffers. To resolve this, I think a test is in order, I shall encode a wallet file/network packet using protocol buffers and compare the size the packets in the current scheme. However, I have no idea what’s in a packet, what data is stored in a packet, and in what format?
FYI, it is pointless to make a packet smaller than 60 bytes — the minimum size of an Ethernet packet. Packets are padded up to 60 bytes, if they are smaller.
[Deleted] Quote from: martin on July 30, 2010, 11:37:59 AM
Quote from: martin on July 29, 2010, 11:29:31 PM UTCThe encoded protocol buffer is just 55 bytes, wheras the bitcoin version is 85 0x00 sets (each one representing 2 bytes each I assume). This means that my badly designed protocol buffer is over half the size of the hand built layout!
The “0x00” groups each represent one byte. The length of the standard version packet is 87 bytes plus 20 for the header. The header could be massively optimized as well:
message start "magic bytes" - 0xF9 0xBE 0xB4 0xD9
command - name of command, 0 padded to 12 bytes "version\0\0\0\0\0"
size - 4 byte int
checksum (absent for messages without data and version messages) - 4 bytes
Obviously using proto buffers here, while absolutely a breaking change, would save a fair bit of space, especially because the "I've created a transaction" packet has the name "tx" meaning that there's at least 10 bytes of overhead in every one of those packets. Why do you consider it a breaking change? There’s no reason you couldn’t first try with the new protocol and then retry using the old bitcoin serialization technique. Also I think this is a change that should be made sooner rather then later while the BitCoin community is still small. It’s already been a major blocker in making new clients and delaying it is going to hamper bitcoin’s adoption.
The reason I didn’t use protocol buffers or boost serialization is because they looked too complex to make absolutely airtight and secure. Their code is too large to read and be sure that there’s no way to form an input that would do something unexpected.
I hate reinventing the wheel and only resorted to writing my own serialization routines reluctantly. The serialization format we have is as dead simple and flat as possible. There is no extra freedom in the way the input stream is formed. At each point, the next field in the data structure is expected. The only choices given are those that the receiver is expecting. There is versioning so upgrades are possible.
CAddress is about the only object with significant reserved space in it. (about 7 bytes for flags and 12 bytes for possible future IPv6 expansion)
The larger things we have like blocks and transactions can’t be optimized much more for size. The bulk of their data is hashes and keys and signatures, which are uncompressible. The serialization overhead is very small, usually 1 byte for size fields.
On Gavin’s idea about an existing P2P broadcast infrastructure, I doubt one exists. There are few P2P systems that only need broadcast. There are some libraries like Chord that try to provide a distributed hash table infrastructure, but that’s a huge difficult problem that we don’t need or want. Those libraries are also much harder to install than ourselves.
The reason I didn’t use protocol buffers or boost serialization is because they looked too complex to make absolutely airtight and secure. Their code is too large to read and be sure that there’s no way to form an input that would do something unexpected.
I hate to sound rude, but that sounds like the danger with the SCRIPT field in transactions. You’re comfortable writing a whole evaluation language letting the blocks suggest operations to the client, but you’re not comfortable using a library like protocol buffers?
Would you consider including an option to write the wallet file out in protocol buffer format instead of the custom format? That way the default can be the custom format which you trust more, and users can export their wallet to protobuf format if they want to move to a new client.
— martin
Why not use XML for that case? The size of the wallet file on disk isn’t exactly a big concern when it comes to export, and XML compresses pretty well. Plus, it’s completely human readable - it would help people to understand what is actually stored.
Indeed, but the version packet is probably the smallest packet of all the ones sent, so we’ll gain more elsewhere. Also, keep an eye on the main point. The fact that protocol buffers are smaller is a nice aside to the fact that they’re Forwards compatible and make bitcoin portable between languages.
Debugging is also easier with non-custom formats. Instead of being the only one using it, you have many other people on different projects looking for and fixing bugs. You also often get tools for decoding/displaying the packet to make it easier to see if something is wrong. IMHO the size of the packet is the least important reason.