The hand tracking server communicates with clients using a simple text protocol over a network socket. We have provided example source code for doing this communication in C++, C#, and Java. The code can serve as a component or library for your applications. You only need to understand the communication protocol if your are writing your own client code in a different language.
Hand tracking messages are sent as UTF-8 encoded plain text over TCP. A newline is used to separate messages, and the space ' ' character is used to tokenize elements on each line. Each message begins with a string representing the message type. Message types can be divided into four classes: session messages, low-level "pose" data, mid-level "pinch" messages and high-level "gesture" messages. Our protocol is designed to be modular, so for your application, you may only need to parse / respond to one level of messages. For instance, if you're only interested in mouse emulation, you only need to parse the mid-level pinch messages: PRESSED, RELEASED, MOVED, and DRAGGED.
Session messages provide basic information about the user and the server. Typically they are sent once at the start of the connection with the client.
Low-level "Pose" messages provide the raw position and skeleton information for each hand. A single message containing data for both hands is sent for every tracked frame.
Mid-level "pinch" messages such as moved, trigger pressed / peace_began, trigger released / peace ended, behave much like mouse events. Although information for the state of both hands is always included in each message, each mouse event applies to one hand. For instance, two moved events, one for each hand, is sent for every tracked frame.
We define gestures as pose or motion sequences lasting multiple frames. We currently provide a very limited set of gestures, but we'll be expanding this set soon!
For example, simultaneously pressed and individually pressed messages fall under the class of gesture messages. They are used to distinguish if both hands pressed at the same time, which can be used to issue a different command than if each hand was pressed individually. It takes a small delay to determine if both hands pressed simultaneously or if one hand pressed individually. Hence, these high-level messages are emitted slightly after their mid-level message counterparts. More specifically, for simultaneous press we wait up to 100ms to determine if another hand has pressed at roughly the same time as the first hand. This means a Pressed event will be fired up to 100 ms earlier than the Simultaneously Pressed event. Similarly Simultaneously/Individually Released and Dragged Bimanual take longer to fire than their mid-level counterparts Released and Dragged.
Several messages types are composed in part of a common basic message format:
Here, the parameters are defined as follows.
POSE messages start with the basic message format (see above) but include additional information encoding the coordinate frames for each joint and the location of the fingertips. Each pose message contains the entire state of both hands. Each pose message also contains a confidence score for each of the seven hand poses per hand. Confidence scores are between 0.0 and 1.0, and sum to one for each hand.
PRESSED, DRAGGED, RELEASED, and MOVED messages begin with the basic message format (see above), followed by a referenced hand indicator:
SIMULTANEOUSLY_PRESSED, INDIVIDUALLY_PRESSED, SIMULTANEOUSLY_RELEASED, INDIVIDUALLY_RELEASED, and DRAGGED_BIMANUAL messages begin with the basic message format (see above), followed by a referenced hand indicator:
For the DRAGGED_BIMANUAL, the <referenced hand> indicator specifies if the left hand, the right hand, or both hands are dragging.
POINT messages are sent to indicate what the user is pointing at. Currently they are sent all the time; in a future release we will be more discriminative. A POINT consists of a start point (first knuckle of the index finger) and an endpoint (the tip of the index finger).
WELCOME messages encode the version of the server and the protocol used.
The format of the camera data is as follows.
Here, the extrinsics matrix is the rotation and translation taking us from world space into the OpenCV camera space (looking down the z axis).
The remaining parameters (fx, fy, cx, cy, k1...k6, p1...p2) are all taken from the OpenCV camera model. Note that k4, k5, and k6 are currently always zero; they are only included for forward compatibility with possible future cameras.
USER messages contain the user's profile name and the skinning information for each hand.
CALIBRATION messages contain progress information about the automatic hand calibration process and the current scale of the user's hands. This information is useful to applications that prompt users to calibrate their hands. Calibration progress is represented as a decimal between 0 and 1. Typical hand scales are numbers between 0.6 and 1.2.
A pressed event for the right hand, where the left hand is at position (1,2,3) with rotation (0,0,0,1) and the right hand is at position (4,5,6) with rotation (0,0,0,1) is reported as:
PRESSED 1.0 2.0 3.0 0.0 0.0 0.0 1.0 1 4.0 5.0 6.0 0.0 0.0 0.0 1.0 1 RIGHT
Two "move" messages, one for the left hand and one for the right hand, at coordinates (1,2,3) and (4,5,6) respectively, are separated by a newline (\n):
MOVED 1.0 2.0 3.0 0.0 0.0 0.0 1.0 0 4.0 5.0 6.0 0.0 0.0 0.0 1.0 0 LEFT
MOVED 1.0 2.0 3.0 0.0 0.0 0.0 1.0 0 4.0 5.0 6.0 0.0 0.0 0.0 1.0 0 RIGHT