YugabyteDB is a horizontally scalable 100% open source distributed SQL database providing Cassandra (YCQL) and PostgreSQL (YSQL) compatible APIs. For PostgreSQL API, YugabyteDB uses the actual PostgreSQL 11.2 engine and builds on top of it with a RAFT consensus layer.
There are two components in the YugabyteDB architecture: master servers and tablet servers. Master servers know everything about the state of the cluster, they are like an address book of all objects and data stored in the database. Tablet servers store the data. Each tablet server runs an instance of PostgreSQL.
That was a very brief, very top level, simplified view of the architecture, but it should give a good image of what’s going on.
Masters and tablet servers communicate using a
protobuf (Google Protocol Buffers) API.
Let’s locate all
*.proto files (without test data):
./ent/src/yb/master/master_backup.proto ./src/yb/cdc/cdc_consumer.proto ./src/yb/cdc/cdc_service.proto ./src/yb/common/common.proto ./src/yb/common/pgsql_protocol.proto ./src/yb/common/ql_protocol.proto ./src/yb/common/redis_protocol.proto ./src/yb/common/wire_protocol.proto ./src/yb/consensus/consensus.proto ./src/yb/consensus/log.proto ./src/yb/consensus/metadata.proto ./src/yb/docdb/docdb.proto ./src/yb/fs/fs.proto ./src/yb/master/master.proto ./src/yb/master/master_types.proto ./src/yb/rocksdb/db/version_edit.proto ./src/yb/rpc/rpc_header.proto ./src/yb/rpc/rpc_introspection.proto ./src/yb/server/server_base.proto ./src/yb/tablet/metadata.proto ./src/yb/tablet/tablet.proto ./src/yb/tserver/backup.proto ./src/yb/tserver/pg_client.proto ./src/yb/tserver/remote_bootstrap.proto ./src/yb/tserver/tserver.proto ./src/yb/tserver/tserver_admin.proto ./src/yb/tserver/tserver_forward_service.proto ./src/yb/tserver/tserver_service.proto ./src/yb/util/encryption.proto ./src/yb/util/histogram.proto ./src/yb/util/opid.proto ./src/yb/util/pb_util.proto ./src/yb/util/version_info.proto ./src/yb/yql/cql/cqlserver/cql_service.proto ./src/yb/yql/redis/redisserver/redis_service.proto
From the above list it’s easy to roughly figure out YugabyteDB components:
- the enterprise master backup service: it’s enterprise just in the name, this is fully open source and available under the Apache 2 license, just like any other YugabyteDB component, spoiler alert - this service provides snapshot features,
- change data capture (CDC) service,
- the consensus service,
- DocDB (YugabyteDB storage layer),
- filesystem service,
- master server service,
- tablet server service,
- CQL and Redis services,
- common bits and bobs, here we can find PSQL and Redis protocols, query layer protocol, and wire protocol,
- base RPC definitions and various utilities.
Let’s peek at one of the files, for brevity I’m going to skip the output.
The output tells us the following:
- the protocol is
- there are no non-standard
protobuffeatures in use,
- they’re all regular messages and service definitions.
What’s not clear from those proto files: YugabyteDB doesn’t use gRPC.
Instead, YugabyteDB uses an RPC protocol coming from HBase. The choice is somewhat clear if we consider that the founder of Yugabyte, the company behind YugabyteDB, used to be the technical lead for HBase at Facebook.
To understand the wire format, we have to look at HBase RPC protocol. The HBase RPC protocol description can be found here.
The protocol is pretty easy to underatand. It follows the regular request/response paradigm. The client initializes a TCP connection, then sends requests over an established connection. To every request, the client receives a response. The connection can be established over TLS.
After opening a connection, the client sends a preamble and a connection header. In case of YugabyteDB, this is simply
YB1. The server does not respond on successful connection setup. Once the connection is established, the client sends requests and receives responses.
Every request consists of a header and the paylod. The header is a protobuf message defined in
To send the request, the client serializes the
protobuf header and exactly one
protobuf request messages. The client calculates the byte length of the header and a payload, and aggregates the total length. The following is sent over the socket:
We can look up the relevant method in the YugabyteDB Java client, the code is here. By looking at the Java code, we can imply that YugabyteDB never takes more than one message, it’s always one header and one request message.
The server replies with a response. The response might be arriving in chunks. The client is expected to read all chunks until receiving a
EOF marker which is indicated by a byte array of 4
The format of the response:
int32total data length,
uint32response header length,
protobufserialized response header,
protobufserialized response message.
Before proceeding to read the assumed, expected response message, the client needs to look at the response header
Let’s look up the
ResponseHeader definition in the
is_error flag is
true, the response is an error response and the client must parse it as the
ErrorStatusPB, defined also in
Otherwise, the response is the expected
protobuf response object, as defined by the RPC service.
For example, the
rpc ListMasters(ListMastersRequestPB) returns (ListMastersResponsePB); operation of the
MasterService servide defined in the
master.proto file takes:
- a header
ListMastersResponsePBas a response
If the RPC service was to return an error, instead of the
ListMastersResponsePB message, it would return the
In selected cases the response consists of so called side cars. We can identify which calls reply with side cars by searching for
set_rows_data_sidecar in YugabyteDB code. Currently only the
Write operations of the
tserver_service.proto use them.
Every request follows exactly the same semantics. Each new request implies a new request header. The request header contains the
int32 call_id field. This is a sequence number of executed requests and it is returned in the response as
A request header contains the optional
RemoteMethodPB remote_method field. This field indicates the target service type for which the request is destined. For example, the
yb.master.MasterService. This full name is the result of combing the
package value (
yb.master) and the service name (
In versions later than 2.11.1, the service name can be optionally declared with a
yb.rpc.custom_service_name service option. Here’s an example:
§locating the leader master
In most cases, we are communicating with the
MasterService. When running YugabyteDB with replication factor greater than
1, at any given time, there is only one leader master, other masters are most often acting as followers. In this case, we have to send our requests to the leader master. If we don’t, the master will reply with a
Follower masters do not forward requests to the leader master. It’s the task of the client to locate and track the leader master.
To find the leader master, the client opens a connection to each master and sends
This message, found in
master.proto, looks like this:
consensus.RaftPeerPB.Role field contains the role of the master. It is defined in the
consensus/metadata.proto, under the
After locating the leader master, the client can close connections to any non-leader master and continue communicating with a leader master only.
YugabyteDB master leader can change at any time of the cluster lifecycle. This might be due to:
- the operator requesting a leader change,
- due to an underlying infrastructure failure forcing the leader election.
When this happens, a message sent to what we think is the current leader will either:
- not go through (if leader is offline),
- will be rejected with a
The client should simply follow the same leader discovery procedure and retry the request.
§it’s undocumented and unofficial
The RPC API isn’t documented and it’s not an official YugabyteDB API. It’s very powerful, there are very interesting things one can build with it but be advised:
- you’re on your own,
- there really isn’t any documentation,
- figuring out the proper way to navigate through it means diving deep in the YugabyteDB source code,
- there are undocumented nuances to what can be called when and how, stepping out of the paved path can lead to trouble;
- hence: do not test random stuff on a live system with important data, you can break your cluster.
A good reference on how to start working with can be found in the
yb-admin tool sources: