I'm also interested in info others can share. Here are some ideas for conversation.
The Kinect v1 and v2 sensor's frame rate provides raw data to the CPU at 1/30 second, i.e. every 33ms. On top of that is processing time needed to convert that raw data into a Max compatible format. When I developed dp.kinect and dp.kinect2, I did performance profiling; hunting down large consumers of CPU. dp.kinect2 has the most recent benefits of that where I used careful parallelization and SIMD instructions. This processing time will range based on your own computer config and the Kinect features enabled. Overwhelmingly, your processing time and lag is determined by the 1/30 second frame rate.
For conversation, lets simplify and assume the raw->Max format processing time of dp.kinect(2) is 0.0 ms. That means that you can get new visual data from the sensor 33ms after it occurs. The data is still rather raw. For example, you will have a 512x424 pixel bitmap in 16-bit monochrome that represents the IR image (aka irmap) seen by the sensor. Now, you need to process that irmap. Are you going to do body/joint analysis and recognition? Are you going to look for a cluster of at least 50 pixels that has a value > 512 to "see" an LED on the tip of the conductor's wand? Are you going to use a library to create a neural network to recognize gestures? There is measurable CPU work that is needed to do this analysis/tracking step. The CPU time needed is not easily guessable; instead needs prototyping.
Still, you already know you are at least 33ms from the actual event.
What about our mind/bodies? Given it's a electrochemical system, there is also lag there. It takes measurable time for the retina to respond to the photons of light, do some local processing, send that signal up to the brain, do low level processing, and higher level cognitive processing. Then, the brain sends a signal to the hands, arm, diaphragm, etc to play the note "on the 1". There are studies suggesting the raw eye->brain is 13-40ms and the end-to-end (eye->brain->hand) is 250-400ms. However...if 250ms was the lag to respond to the downbeat, that would likely be problematic. Well...if we all operate in the same lag, its not a problem...but let's set that aside for this discussion.
So how does a professional music player respond quicker than 250ms? With training and prediction. There are studies which suggest that the processing in the eye and in the larger nervous system does lag correction. It uses old data, trends of it, and new data to simulate/create/act on things which might not have actually been seen yet. When it eventually *is* seen, our bodies update the lag correction and continue this loop until we die.
Computers can do a similar thing. There are algorithms that can use old data to predict new likely data. Simple algorithms like holt's double exponential smoothing are built into dp.kinect and dp.kinect2 and you can tune the algorithm using 5 parameters on the @smoothing attribute. There are also libraries from Microsoft (I have yet to integrate them into dp.kinect/2) and other parties which can create neural nets for gesture recognition. This latter gesture approach is interesting to me for your situation because you could train the net with much real-world conductor data. I would hope that the result is a gesture recognizer that can recognize all the different ways conductors hit the downbeat (relaxed, plucking, bouncy, forceful stomp, kamikaze plunge, etc.) and can accommodate the moving and swaying of the conductor's body, which without special handling, greatly changes the real-world spacial location of the "1".
These gesture recognizers usually use neural nets which have a component of prediction. The training will educate the recognizer that when a "4" occurs, and then there is a downward movement of the wand, that this is likely to be a "1". It will keep looking at the incoming spacial and temporal data matching it to the training and thereby tracking a "1" beginning to form. Based on the past data it will know the likely velocities and accelerations associated with a "1". Given that historical model and the current velocities/accelerations + the tempo of the preceding 1234, it can predict temporarily and in space where/when the "1" will occur. Now with a prediction (which usually has a factor of likely occurrence) in time and space units, other algorithms can now act on that data (e.g. drive a global clock).
Having another sensor which has a faster frame rate may or may not improve your end result. Yes, you can feed more data with less lag into an algorithm in a given unit time. This might allow for faster-than-needed recognition or perhaps-more-importantly correction of a incorrect prediction. Does this improve the end result? Maybe or maybe not. Why? Because it is likely your conductor's tempo is less than 150 beats per minute -> 2.5 beats per second -> 400ms. 400ms - 33ms Kinect lag = 367ms for a gesture recognizer to process data, make a prediction, and get that prediction to the next algorithm (e.g. clock).
367ms is a long time for computers. I have faith that a good gesture recognizer can generate output faster than that. The recognizer can operate in parallel with the incoming data. It does not have to wait until a sensor heartbeat to output the gesture recognized. This can facilitate the clock's "1" event to be async from the sensor heartbeat...and I think that's a good thing.