A brief look at YugabyteDB RPC API

with great power comes great responsibility

YugabyteDB is a horizontally scalable 100% open source distributed SQL database providing Cassandra (YCQL) and PostgreSQL (YSQL) compatible APIs. For PostgreSQL API, YugabyteDB uses the actual PostgreSQL 11.2 engine and builds on top of it with a RAFT consensus layer.

There are two components in the YugabyteDB architecture: master servers and tablet servers. Master servers know everything about the state of the cluster, they are like an address book of all objects and data stored in the database. Tablet servers store the data. Each tablet server runs an instance of PostgreSQL.

That was a very brief, very top level, simplified view of the architecture, but it should give a good image of what’s going on.

Masters and tablet servers communicate using a protobuf (Google Protocol Buffers) API.

Let’s locate all *.proto files (without test data):

1
2
3
4
5
cd /tmp
git clone https://github.com/yugabyte/yugabyte-db.git
cd yugabyte-db/
git checkout v2.11.1
find . -name *.proto | grep -v test | sort
./ent/src/yb/master/master_backup.proto
./src/yb/cdc/cdc_consumer.proto
./src/yb/cdc/cdc_service.proto
./src/yb/common/common.proto
./src/yb/common/pgsql_protocol.proto
./src/yb/common/ql_protocol.proto
./src/yb/common/redis_protocol.proto
./src/yb/common/wire_protocol.proto
./src/yb/consensus/consensus.proto
./src/yb/consensus/log.proto
./src/yb/consensus/metadata.proto
./src/yb/docdb/docdb.proto
./src/yb/fs/fs.proto
./src/yb/master/master.proto
./src/yb/master/master_types.proto
./src/yb/rocksdb/db/version_edit.proto
./src/yb/rpc/rpc_header.proto
./src/yb/rpc/rpc_introspection.proto
./src/yb/server/server_base.proto
./src/yb/tablet/metadata.proto
./src/yb/tablet/tablet.proto
./src/yb/tserver/backup.proto
./src/yb/tserver/pg_client.proto
./src/yb/tserver/remote_bootstrap.proto
./src/yb/tserver/tserver.proto
./src/yb/tserver/tserver_admin.proto
./src/yb/tserver/tserver_forward_service.proto
./src/yb/tserver/tserver_service.proto
./src/yb/util/encryption.proto
./src/yb/util/histogram.proto
./src/yb/util/opid.proto
./src/yb/util/pb_util.proto
./src/yb/util/version_info.proto
./src/yb/yql/cql/cqlserver/cql_service.proto
./src/yb/yql/redis/redisserver/redis_service.proto

From the above list it’s easy to roughly figure out YugabyteDB components:

  • the enterprise master backup service: it’s enterprise just in the name, this is fully open source and available under the Apache 2 license, just like any other YugabyteDB component, spoiler alert - this service provides snapshot features,
  • change data capture (CDC) service,
  • the consensus service,
  • DocDB (YugabyteDB storage layer),
  • filesystem service,
  • master server service,
  • tablet server service,
  • CQL and Redis services,
  • common bits and bobs, here we can find PSQL and Redis protocols, query layer protocol, and wire protocol,
  • base RPC definitions and various utilities.

Let’s peek at one of the files, for brevity I’m going to skip the output.

1
cat ./src/yb/master/master.proto

The output tells us the following:

  • the protocol is proto2 based,
  • there are no non-standard protobuf features in use,
  • they’re all regular messages and service definitions.

§the protocol

What’s not clear from those proto files: YugabyteDB doesn’t use gRPC.

Instead, YugabyteDB uses an RPC protocol coming from HBase. The choice is somewhat clear if we consider that the founder of Yugabyte, the company behind YugabyteDB, used to be the technical lead for HBase at Facebook.

To understand the wire format, we have to look at HBase RPC protocol. The HBase RPC protocol description can be found here[1].

The protocol is pretty easy to underatand. It follows the regular request/response paradigm. The client initializes a TCP connection, then sends requests over an established connection. To every request, the client receives a response. The connection can be established over TLS.

After opening a connection, the client sends a preamble and a connection header. In case of YugabyteDB, this is simply YB1. The server does not respond on successful connection setup. Once the connection is established, the client sends requests and receives responses.

§the request

Every request consists of a header and the paylod. The header is a protobuf message defined in rpc_header.proto:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
message RemoteMethodPB {
  // Service name for the RPC layer.
  // The client created a proxy with this service name.
  // Example: yb.rpc_test.CalculatorService
  required string service_name = 1;

  // Name of the RPC method.
  required string method_name = 2;
};

// The header for the RPC request frame.
message RequestHeader {
  // A sequence number that is sent back in the Response. Hadoop specifies a uint32 and
  // casts it to a signed int. That is counterintuitive, so we use an int32 instead.
  // Allowed values (inherited from Hadoop):
  //   0 through INT32_MAX: Regular RPC call IDs.
  //   -2: Invalid call ID.
  //   -3: Connection context call ID.
  //   -33: SASL negotiation call ID.
  optional int32 call_id = 1;

  // RPC method being invoked.
  // Not used for "connection setup" calls.
  optional RemoteMethodPB remote_method = 2;

  // Propagate the timeout as specified by the user. Note that, since there is some
  // transit time between the client and server, if you wait exactly this amount of
  // time and then respond, you are likely to cause a timeout on the client.
  optional uint32 timeout_millis = 3;
}

To send the request, the client serializes the protobuf header and exactly one protobuf request messages. The client calculates the byte length of the header and a payload, and aggregates the total length. The following is sent over the socket:

  • int32 total length,
  • uint32 header length,
  • protobuf serialized header,
  • uint32 message length,
  • protobuf serialized message.

We can look up the relevant method in the YugabyteDB Java client, the code is here[2]. By looking at the Java code, we can imply that YugabyteDB never takes more than one message, it’s always one header and one request message.

§the response

The server replies with a response. The response might be arriving in chunks. The client is expected to read all chunks until receiving a EOF marker which is indicated by a byte array of 4 0 bytes.

The format of the response:

  • int32 total data length,
  • uint32 response header length,
  • protobuf serialized response header,
  • uint32 payload length,
  • protobuf serialized response message.

Before proceeding to read the assumed, expected response message, the client needs to look at the response header is_error property.

Let’s look up the ResponseHeader definition in the rcp_header.proto:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
message ResponseHeader {
  required int32 call_id = 1;

  // If this is set, then this is an error response and the
  // response message will be of type ErrorStatusPB instead of
  // the expected response type.
  optional bool is_error = 2 [ default = false ];

  // Byte offsets for side cars in the main body of the response message.
  // These offsets are counted AFTER the message header, i.e., offset 0
  // is the first byte after the bytes for this protobuf.
  repeated uint32 sidecar_offsets = 3;

}

If the is_error flag is true, the response is an error response and the client must parse it as the ErrorStatusPB, defined also in rcp_header.proto.

Otherwise, the response is the expected protobuf response object, as defined by the RPC service.

For example, the rpc ListMasters(ListMastersRequestPB) returns (ListMastersResponsePB); operation of the MasterService servide defined in the master.proto file takes:

  • a header
  • a ListMastersRequestPB message

and expects:

  • ListMastersResponsePB as a response

If the RPC service was to return an error, instead of the ListMastersResponsePB message, it would return the ErrorStatusPB message.

§side cars

In selected cases the response consists of so called side cars. We can identify which calls reply with side cars by searching for set_rows_data_sidecar in YugabyteDB code[3]. Currently only the Read and Write operations of the tserver_service.proto use them.

Each side car is a separate protobuf message. We can find out how to read them by looking at this Java client code[4].

Every request follows exactly the same semantics. Each new request implies a new request header. The request header contains the int32 call_id field. This is a sequence number of executed requests and it is returned in the response as int32 call_id.

§services

A request header contains the optional RemoteMethodPB remote_method field. This field indicates the target service type for which the request is destined. For example, the MasterService is yb.master.MasterService. This full name is the result of combing the package value (yb.master) and the service name (MasterService).

In versions later than 2.11.1, the service name can be optionally declared with a yb.rpc.custom_service_name service option. Here’s an example:

1
2
3
service MasterCluster {
  option (yb.rpc.custom_service_name) = "yb.master.MasterService";
  ...

§locating the leader master

In most cases, we are communicating with the MasterService. When running YugabyteDB with replication factor greater than 1, at any given time, there is only one leader master, other masters are most often acting as followers. In this case, we have to send our requests to the leader master. If we don’t, the master will reply with a NOT_THE_LEADER error.

Follower masters do not forward requests to the leader master. It’s the task of the client to locate and track the leader master.

To find the leader master, the client opens a connection to each master and sends a GetMasterRegistrationRequestPB message and receives a GetMasterRegistrationResponsePB message.

This message, found in master.proto, looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
message GetMasterRegistrationResponsePB {
  // Node instance information is always set.
  required NodeInstancePB instance_id = 1;

  // These fields are optional, as they won't be set if there's an
  // error retrieving the host/port information.
  optional ServerRegistrationPB registration = 2;

  // This server's role in the consensus configuration.
  optional consensus.RaftPeerPB.Role role = 3;

  // Set if there an error retrieving the registration information.
  optional MasterErrorPB error = 4;
}

The consensus.RaftPeerPB.Role field contains the role of the master. It is defined in the consensus/metadata.proto, under the RaftPeerPB message:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
  // The possible roles for peers.
  enum Role {
    // Indicates this node is a follower in the configuration, i.e. that it participates
    // in majorities and accepts Consensus::Update() calls.
    FOLLOWER = 0;

    // Indicates this node is the current leader of the configuration, i.e. that it
    // participates in majorities and accepts Consensus::Append() calls.
    LEADER = 1;

    // New peers joining a quorum will be in this role for both PRE_VOTER and PRE_OBSERVER
    // while the tablet data is being remote bootstrapped. The peer does not participate
    // in starting elections or majorities.
    LEARNER = 2;

    // Indicates that this node is not a participant of the configuration, i.e. does
    // not accept Consensus::Update() or Consensus::Update() and cannot
    // participate in elections or majorities. This is usually the role of a node
    // that leaves the configuration.
    NON_PARTICIPANT = 3;

    // This peer is a read (async) replica and gets informed of the quorum write
    // activity and provides time-line consistent reads.
    READ_REPLICA = 4;

    UNKNOWN_ROLE = 7;
  };

After locating the leader master, the client can close connections to any non-leader master and continue communicating with a leader master only.

YugabyteDB master leader can change at any time of the cluster lifecycle. This might be due to:

  • the operator requesting a leader change,
  • due to an underlying infrastructure failure forcing the leader election.

When this happens, a message sent to what we think is the current leader will either:

  • not go through (if leader is offline),
  • will be rejected with a NOT_THE_LEADER error.

The client should simply follow the same leader discovery procedure and retry the request.

§it’s undocumented and unofficial

The RPC API isn’t documented and it’s not an official YugabyteDB API. It’s very powerful, there are very interesting things one can build with it but be advised:

  • you’re on your own,
  • there really isn’t any documentation,
  • figuring out the proper way to navigate through it means diving deep in the YugabyteDB source code,
  • there are undocumented nuances to what can be called when and how, stepping out of the paved path can lead to trouble;
    • hence: do not test random stuff on a live system with important data, you can break your cluster.

A good reference on how to start working with can be found in the yb-admin tool sources:

Happy hacking!