*removed externally hosted image*
Abstract
We introduce MoMask, a novel masked modeling framework for text-driven 3D human motion generation. In MoMask, a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details. Starting at the base layer, with a sequence of motion tokens obtained by vector quantization, the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy. This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens, a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage. During generation (i.e. inference) stage, starting from an empty sequence, our Masked Transformer iteratively fills up the missing tokens; Subsequently, a Residual Transformer learns to progressively predict the next-layer tokens based on the results from current layer. Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset, and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning, such as text-guided temporal inpainting.
Paper: https://arxiv.org/abs/2312.00063
Code: https://github.com/EricGuo5513/momask-codes (coming Dec. 15)
Progect Page: https://ericguo5513.github.io/momask/
This is cool, from a digital visuals perspective, because it is building out a very detailed library of human behaviors to model after. A richer catalog provides a lot more potential for engines that want to render more complex human behaviors.
But its also kinda... illustrative of the soft upper limits of generative AI, as this is still ultimately a discrete (and presumably fairly limited) library of motions that will result in visually distinct characters all falling into the same set of physical behaviors. Both a small rambunctious boy and an elderly infirm woman crossing a gap in the pavement in the same way isn't realistic. And while you can solve this by adding more data points, you're still having all small rambunctious boys and all elderly infirm women crossing the gap in the pavement in the same way. And while you can solve this by adding variations... how much modeling are you really willing to do for sufficient variance? Idk.
We get generative in the sense that we can reskin a stick-figure model very quickly using a catalog of behaviors. But we don't get generative in the sense that the computer understands the biomechanics of a human body and can create these stick-figure models in believable states.