Client / server protocol

Do I need to understand the protocol?

The hand tracking server communicates with clients using a simple text protocol over a network socket. We have provided example source code for doing this communication in C++, C#, and Java. The code can serve as a component or library for your applications. You only need to understand the communication protocol if your are writing your own client code in a different language.

Message Format

Hand tracking messages are sent as UTF-8 encoded plain text over TCP. A newline is used to separate messages, and the space ' ' character is used to tokenize elements on each line. Each message begins with a string representing the message type. Message types can be divided into four classes: session messages, low-level "pose" data, mid-level "pinch" messages and high-level "gesture" messages. Our protocol is designed to be modular, so for your application, you may only need to parse / respond to one level of messages. For instance, if you're only interested in mouse emulation, you only need to parse the mid-level pinch messages: PRESSED, RELEASED, MOVED, and DRAGGED.

Session messages

Session messages provide basic information about the user and the server. Typically they are sent once at the start of the connection with the client.

WELCOME: Provides the server and protocol version, and information about the depth camera(s).
USER: Provides the user name and calibrated skinning model for each hand

Low-level pose messages

Low-level "Pose" messages provide the raw position and skeleton information for each hand. A single message containing data for both hands is sent for every tracked frame.

POSE

Mid-level pinch messages

Mid-level "pinch" messages such as moved, trigger pressed / peace_began, trigger released / peace ended, behave much like mouse events. Although information for the state of both hands is always included in each message, each mouse event applies to one hand. For instance, two moved events, one for each hand, is sent for every tracked frame.

PRESSED
DRAGGED
RELEASED
MOVED

High-level gesture messages

We define gestures as pose or motion sequences lasting multiple frames. We currently provide a very limited set of gestures, but we'll be expanding this set soon!

For example, simultaneously pressed and individually pressed messages fall under the class of gesture messages. They are used to distinguish if both hands pressed at the same time, which can be used to issue a different command than if each hand was pressed individually. It takes a small delay to determine if both hands pressed simultaneously or if one hand pressed individually. Hence, these high-level messages are emitted slightly after their mid-level message counterparts. More specifically, for simultaneous press we wait up to 100ms to determine if another hand has pressed at roughly the same time as the first hand. This means a Pressed event will be fired up to 100 ms earlier than the Simultaneously Pressed event. Similarly Simultaneously/Individually Released and Dragged Bimanual take longer to fire than their mid-level counterparts Released and Dragged.

SIMULTANEOUSLY_PRESSED: both hands simultaneously pressed
INDIVIDUALLY_PRESSED: one hand pressed (and the other hand did not press simultaneously)
SIMULTANEOUSLY_RELEASED: both hands simultaneously released
INDIVIDUALLY_RELEASED: one hand released (and the other hand did not release simultaneously)
DRAGGED_BIMANUAL: stably dragging left, right or both hands

Basic Message Format

Several messages types are composed in part of a common basic message format:

<basic message format>: message_type pxL pyL pzL rxL ryL rzL rwL ccL pxR pyR pzR rxR ryR rzR rwR ccR

Here, the parameters are defined as follows.

message_type: is a string specifying the type of the message. (See above)
pxL pyL pzL: specifies the position coordinates of the left hand. More specifically, it specifies a position near the tip of the thumb that moves rigidly with the hand. This can be used as a cursor position for selection.
rxL ryL rzL rwL: specifies the rotation coordinates of the left hand. More specifically, it specifies a quaternion rotational frame tracking the palm portion of the hand. The frame's x-axis is parallel with the vector extending from the hand to the forearm. The frame's y-axis points up from the back of the hand.
ccL: is an integer specifying the "click count" or the number of consecutive presses issued by the left hand. This is analogous to the click count for a mouse. It would be 1 for single-clicks, 2 for double-clicks, and 0 if the left hand is not in a pressed state.
pxR pyR pzR: specifies the position coordinates of the right hand.
rxR ryR rzR rwR: specifies the rotation coordinates of the right hand.
ccR: is an integer specifying the "click count" for the right hand.

Pose Message

POSE messages start with the basic message format (see above) but include additional information encoding the coordinate frames for each joint and the location of the fingertips. Each pose message contains the entire state of both hands. Each pose message also contains a confidence score for each of the seven hand poses per hand. Confidence scores are between 0.0 and 1.0, and sum to one for each hand.

<pose message format> :=: <basic message format> <left hand pose information> <right hand pose information> <left hand pose confidences> <right hand pose confidences> <left hand finger degrees of freedom> <right hand finger degrees of freedom>
<hand pose information> :=: <confidence> <joint frames> <fingertips>
<confidence> :=: is a single float (currently either 0, meaning no confidence, or 1, meaning with confidence).
<joint frames> :=: <root frame> <wrist frame> <thumb proximal frame> <thumb medial frame> <thumb distal frame> <index proximal frame> <index medial frame> ... <pinky medial frame> <pinky distal frame>
<frame> :=: <rx> <ry> <rz> <rw> <px> <py> <pz>
<fingertips> :=: <thumb tip> <index tip> ... <pinky tip>
<tip> :=: <px> <py> <pz>
<hand pose confidences> :=: <curled pose> <ell pose> <okay pose> <pinch pose> <pointing pose> <relaxed open pose> <spread pose>
<finger degrees of freedom> :=: <thumb CMC adduction / abduction> <thumb CMC metacarpal flexion / extension> <thumb MCP flexion / extension> <thumb IP flexion / extension> <index MCP adduction / abduction> <index MCP flexion extension> <index PIP flexion / extension> ... <pinky MCP adduction / abduction> <pinky MCP flexion extension> <pinky PIP flexion / extension>

Pinch messages

PRESSED, DRAGGED, RELEASED, and MOVED messages begin with the basic message format (see above), followed by a referenced hand indicator:

<pinch message format> :=: <basic message format> <referenced hand>
<referenced hand>: specifies which hand the message applies to. This can take the values LEFT or RIGHT.

Bimanual pinch messages

SIMULTANEOUSLY_PRESSED, INDIVIDUALLY_PRESSED, SIMULTANEOUSLY_RELEASED, INDIVIDUALLY_RELEASED, and DRAGGED_BIMANUAL messages begin with the basic message format (see above), followed by a referenced hand indicator:

<bimanual pinch message format> :=: <basic message format> <referenced hand>
<referenced hand>: specifies which hand(s) the message applies to. This can take the values LEFT, RIGHT, or BOTH (meaning both hands were involved in the gesture).

For the DRAGGED_BIMANUAL, the <referenced hand> indicator specifies if the left hand, the right hand, or both hands are dragging.

Point message

POINT messages are sent to indicate what the user is pointing at. Currently they are sent all the time; in a future release we will be more discriminative. A POINT consists of a start point (first knuckle of the index finger) and an endpoint (the tip of the index finger).

<point message format> :=: POINT <referenced hand> <point direction> <point end> <confidence>,
<referenced hand> :=: LEFT or RIGHT
<point start> :=: <px> <py> <pz>
<point end> :=: <px> <py> <pz>
<confidence> :=: is a single float (currently either 0, meaning no confidence, or 1, meaning reasonable confidence).

Welcome Message

WELCOME messages encode the version of the server and the protocol used.

<welcome message format> :=: WELCOME Server-Version: <server version> Protocol-Version: <protocol version> Cameras: <number of cameras> <camera data 1> <camera data 2 (if present)>

The format of the camera data is as follows.

<camera data> :=: <image width> <image height> <extrinsics> <fx> <fy> <cx> <cy> <k1> <k2> <p1> <p2> <k3> <k4> <k5> <k6>
<extrinsics> :=: <rx> <ry> <rz> <rw> <tx> <ty> <tz>

Here, the extrinsics matrix is the rotation and translation taking us from world space into the OpenCV camera space (looking down the z axis).

The remaining parameters (fx, fy, cx, cy, k1...k6, p1...p2) are all taken from the OpenCV camera model. Note that k4, k5, and k6 are currently always zero; they are only included for forward compatibility with possible future cameras.

User Message

USER messages contain the user's profile name and the skinning information for each hand.

<user message format> :=: USER User: <user profile name> <left hand skinning data> <right hand skinning data>
<hand skinning data> :=: Hand: <hand index> Rest-Positions: <mesh vertices> Triangles: <mesh triangles> Skinning-Weights: <skinning weights> Rest-Joint-Frames: <joint frames>
<hand index> :=: an integer, 0 for the left hand, 1 for the left hand.
<mesh vertices> :=: <N, the number of points> <vertex 0> <vertex 1> ... <vertex N - 1>
<vertex> :=: <px> <py> <pz>
<mesh triangles> :=: <T, the number of triangles> <triangle 0> <triangle 1> ... <triangle T - 1>
<triangle> :=: <vertex index 0> <vertex index 1> <vertex index 2>
<skinning weights> :=: <joint weights for vertex 0> <joint weights for vertex 1> ... <joint weights for vertex N - 1>
<joint weights> :=: <J, the number of joint influences> <joint index 0> <weight for joint index 0> ... <joint index J - 1> <weight for joint index J - 1>
<joint frames> :=: see definition in POSE message.

Calibration Message

CALIBRATION messages contain progress information about the automatic hand calibration process and the current scale of the user's hands. This information is useful to applications that prompt users to calibrate their hands. Calibration progress is represented as a decimal between 0 and 1. Typical hand scales are numbers between 0.6 and 1.2.

<calibration message format> :=: CALIBRATION Percent-Complete: <calibration progress> Current-Scale: <current user hand scale>

Examples

A pressed event for the right hand, where the left hand is at position (1,2,3) with rotation (0,0,0,1) and the right hand is at position (4,5,6) with rotation (0,0,0,1) is reported as:

PRESSED 1.0 2.0 3.0 0.0 0.0 0.0 1.0 1 4.0 5.0 6.0 0.0 0.0 0.0 1.0 1 RIGHT

Two "move" messages, one for the left hand and one for the right hand, at coordinates (1,2,3) and (4,5,6) respectively, are separated by a newline (\n):

MOVED 1.0 2.0 3.0 0.0 0.0 0.0 1.0 0 4.0 5.0 6.0 0.0 0.0 0.0 1.0 0 LEFT MOVED 1.0 2.0 3.0 0.0 0.0 0.0 1.0 0 4.0 5.0 6.0 0.0 0.0 0.0 1.0 0 RIGHT