YugabyteDB is a horizontally scalable 100% open source distributed SQL database providing Cassandra (YCQL) and PostgreSQL (YSQL) compatible APIs. For PostgreSQL API, YugabyteDB uses the actual PostgreSQL 11.2 engine and builds on top of it with a RAFT consensus layer.
There are two components in the YugabyteDB architecture: master servers and tablet servers. Master servers know everything about the state of the cluster, they are like an address book of all objects and data stored in the database. Tablet servers store the data. Each tablet server runs an instance of PostgreSQL.
That was a very brief, very top level, simplified view of the architecture, but it should give a good image of what’s going on.
Masters and tablet servers communicate using a protobuf
(Google Protocol Buffers) API.
Let’s locate all *.proto
files (without test data):
|
|
./ent/src/yb/master/master_backup.proto
./src/yb/cdc/cdc_consumer.proto
./src/yb/cdc/cdc_service.proto
./src/yb/common/common.proto
./src/yb/common/pgsql_protocol.proto
./src/yb/common/ql_protocol.proto
./src/yb/common/redis_protocol.proto
./src/yb/common/wire_protocol.proto
./src/yb/consensus/consensus.proto
./src/yb/consensus/log.proto
./src/yb/consensus/metadata.proto
./src/yb/docdb/docdb.proto
./src/yb/fs/fs.proto
./src/yb/master/master.proto
./src/yb/master/master_types.proto
./src/yb/rocksdb/db/version_edit.proto
./src/yb/rpc/rpc_header.proto
./src/yb/rpc/rpc_introspection.proto
./src/yb/server/server_base.proto
./src/yb/tablet/metadata.proto
./src/yb/tablet/tablet.proto
./src/yb/tserver/backup.proto
./src/yb/tserver/pg_client.proto
./src/yb/tserver/remote_bootstrap.proto
./src/yb/tserver/tserver.proto
./src/yb/tserver/tserver_admin.proto
./src/yb/tserver/tserver_forward_service.proto
./src/yb/tserver/tserver_service.proto
./src/yb/util/encryption.proto
./src/yb/util/histogram.proto
./src/yb/util/opid.proto
./src/yb/util/pb_util.proto
./src/yb/util/version_info.proto
./src/yb/yql/cql/cqlserver/cql_service.proto
./src/yb/yql/redis/redisserver/redis_service.proto
From the above list it’s easy to roughly figure out YugabyteDB components:
- the enterprise master backup service: it’s enterprise just in the name, this is fully open source and available under the Apache 2 license, just like any other YugabyteDB component, spoiler alert - this service provides snapshot features,
- change data capture (CDC) service,
- the consensus service,
- DocDB (YugabyteDB storage layer),
- filesystem service,
- master server service,
- tablet server service,
- CQL and Redis services,
- common bits and bobs, here we can find PSQL and Redis protocols, query layer protocol, and wire protocol,
- base RPC definitions and various utilities.
Let’s peek at one of the files, for brevity I’m going to skip the output.
|
|
The output tells us the following:
- the protocol is
proto2
based, - there are no non-standard
protobuf
features in use, - they’re all regular messages and service definitions.
§the protocol
What’s not clear from those proto files: YugabyteDB doesn’t use gRPC.
Instead, YugabyteDB uses an RPC protocol coming from HBase. The choice is somewhat clear if we consider that the founder of Yugabyte, the company behind YugabyteDB, used to be the technical lead for HBase at Facebook.
To understand the wire format, we have to look at HBase RPC protocol. The HBase RPC protocol description can be found here[1].
The protocol is pretty easy to underatand. It follows the regular request/response paradigm. The client initializes a TCP connection, then sends requests over an established connection. To every request, the client receives a response. The connection can be established over TLS.
After opening a connection, the client sends a preamble and a connection header. In case of YugabyteDB, this is simply YB1
. The server does not respond on successful connection setup. Once the connection is established, the client sends requests and receives responses.
§the request
Every request consists of a header and the paylod. The header is a protobuf message defined in rpc_header.proto
:
|
|
To send the request, the client serializes the protobuf
header and exactly one protobuf
request messages. The client calculates the byte length of the header and a payload, and aggregates the total length. The following is sent over the socket:
int32
total length,uint32
header length,protobuf
serialized header,uint32
message length,protobuf
serialized message.
We can look up the relevant method in the YugabyteDB Java client, the code is here[2]. By looking at the Java code, we can imply that YugabyteDB never takes more than one message, it’s always one header and one request message.
§the response
The server replies with a response. The response might be arriving in chunks. The client is expected to read all chunks until receiving a EOF
marker which is indicated by a byte array of 4 0
bytes.
The format of the response:
int32
total data length,uint32
response header length,protobuf
serialized response header,uint32
payload length,protobuf
serialized response message.
Before proceeding to read the assumed, expected response message, the client needs to look at the response header is_error
property.
Let’s look up the ResponseHeader
definition in the rcp_header.proto
:
|
|
If the is_error
flag is true
, the response is an error response and the client must parse it as the ErrorStatusPB
, defined also in rcp_header.proto
.
Otherwise, the response is the expected protobuf
response object, as defined by the RPC service.
For example, the rpc ListMasters(ListMastersRequestPB) returns (ListMastersResponsePB);
operation of the MasterService
servide defined in the master.proto
file takes:
- a header
- a
ListMastersRequestPB
message
and expects:
ListMastersResponsePB
as a response
If the RPC service was to return an error, instead of the ListMastersResponsePB
message, it would return the ErrorStatusPB
message.
§side cars
In selected cases the response consists of so called side cars. We can identify which calls reply with side cars by searching for set_rows_data_sidecar
in YugabyteDB code[3]. Currently only the Read
and Write
operations of the tserver_service.proto
use them.
Each side car is a separate protobuf
message. We can find out how to read them by looking at this Java client code[4].
Every request follows exactly the same semantics. Each new request implies a new request header. The request header contains the int32 call_id
field. This is a sequence number of executed requests and it is returned in the response as int32 call_id
.
§services
A request header contains the optional RemoteMethodPB remote_method
field. This field indicates the target service type for which the request is destined. For example, the MasterService
is yb.master.MasterService
. This full name is the result of combing the package
value (yb.master
) and the service name (MasterService
).
In versions later than 2.11.1, the service name can be optionally declared with a yb.rpc.custom_service_name
service option. Here’s an example:
|
|
§locating the leader master
In most cases, we are communicating with the MasterService
. When running YugabyteDB with replication factor greater than 1
, at any given time, there is only one leader master, other masters are most often acting as followers. In this case, we have to send our requests to the leader master. If we don’t, the master will reply with a NOT_THE_LEADER
error.
Follower masters do not forward requests to the leader master. It’s the task of the client to locate and track the leader master.
To find the leader master, the client opens a connection to each master and sends GetMasterRegistrationRequestPB
GetMasterRegistrationResponsePB
message.
This message, found in master.proto
, looks like this:
|
|
The consensus.RaftPeerPB.Role
field contains the role of the master. It is defined in the consensus/metadata.proto
, under the RaftPeerPB
message:
|
|
After locating the leader master, the client can close connections to any non-leader master and continue communicating with a leader master only.
YugabyteDB master leader can change at any time of the cluster lifecycle. This might be due to:
- the operator requesting a leader change,
- due to an underlying infrastructure failure forcing the leader election.
When this happens, a message sent to what we think is the current leader will either:
- not go through (if leader is offline),
- will be rejected with a
NOT_THE_LEADER
error.
The client should simply follow the same leader discovery procedure and retry the request.
§it’s undocumented and unofficial
The RPC API isn’t documented and it’s not an official YugabyteDB API. It’s very powerful, there are very interesting things one can build with it but be advised:
- you’re on your own,
- there really isn’t any documentation,
- figuring out the proper way to navigate through it means diving deep in the YugabyteDB source code,
- there are undocumented nuances to what can be called when and how, stepping out of the paved path can lead to trouble;
- hence: do not test random stuff on a live system with important data, you can break your cluster.
A good reference on how to start working with can be found in the yb-admin
tool sources:
Happy hacking!