CLIP系列学习（六）— MotionCLIP

type

status

date

slug

summary

MotionCLIP (ECCV 2022)

项目地址：https://guytevet.github.io/motionclip-page/

code地址：https://github.com/GuyTevet/MotionCLIP

项目地址很有意思，建议去看看，对于代码写的好的而且比较完善的项目是值得多费些时间去细读的（有 train 的代码）。这篇文章不像其他文章一样尝试去改进CLIP或者是把CLIP复现进别的领域之中。而是利用CLIP图像、语义空间对齐的能力来改进整个Motion AutoEncoder的表现。所以CLIP在整个过程中起到的是输出监督信息，而非对其进行微调训练。

Abstract:

We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual descriptions. MotionCLIP gains its unique power by aligning its latent space with that of the Contrastive Language-Image Pre-training (CLIP) model. Aligning the human motion manifold to CLIP space implicitly infuses the extremely rich semantic knowledge of CLIP into the manifold. In particular, it helps continuity by placing semantically similar motions close to one another, and disentanglement, which is inherited from the CLIP-space structure. MotionCLIP comprises a transformer-based motion auto-encoder, trained to reconstruct motion while being aligned to its text label's position in CLIP-space. We further leverage CLIP's unique visual understanding and inject an even stronger signal through aligning motion to rendered frames in a self-supervised manner. We show that although CLIP has never seen the motion domain, MotionCLIP offers unprecedented text-to-motion abilities, allowing out-of-domain actions, disentangled editing, and abstract language specification. For example, the text prompt "couch" is decoded into a sitting down motion, due to lingual similarity, and the prompt "Spiderman" results in a web-swinging-like solution that is far from seen during training. In addition, we show how the introduced latent space can be leveraged for motion interpolation, editing and recognition.
MotionCLIP 是一个3D运动自编码器，它能够产生一个分离的、表现良好的潜在嵌入，这个嵌入支持高度语义化和精细的描述。为此，我们采用了CLIP，这是一个大规模的视觉-文本嵌入模型。我们的关键洞见是，尽管CLIP没有在运动领域受过训练，但我们可以通过将其强大且语义化的结构强制应用到运动领域，从而继承其潜在空间的许多优点。为了实现这一点，我们训练了一个基于Transformer的自编码器，使其与CLIP的潜在空间对齐，使用的是现有的运动文本标签。
换句话说，我们训练了一个编码器来找到输入序列在CLIP空间中的正确嵌入，以及一个解码器，它能够为给定的CLIP空间潜在代码生成最合适的运动。为了进一步提高与CLIP空间的对齐度，我们还利用了CLIP的视觉编码器，并通过合成渲染帧以自监督的方式引导对齐。正如我们所展示的，这一步对于跨领域泛化至关重要，因为它允许对运动进行更细粒度的描述，这是使用文本无法实现的。

简单来说就是用 transformer 做一个 Autoencoder，但 loss 不仅仅是常用的 Autoencoder 的 loss，还有在 latente space 对齐 CLIP 的 image 和 text embedding 的 loss。作者认为将人物动作对齐 CLIP 空间有两个好处，将几何移动的空间对齐语义空间能够让语义描述动作获益（也就是 motion 能够更加对齐语义）。另一方面是这么干在 motion 的 latent space 中也能够获得更好的结果。

作者给了个例子来说明学习到的语义、动作对齐得到的好结果：

For example, the CLIP embedding for the phrase “wings" is decoded into a flapping motion like a bird, and “Williams sisters" into a tennis serve, since these terms are encoded close to motion seen during training, thanks to CLIP’s semantic understanding.

技术细节

Our training process is illustrated in Figure 2. We train a transformer-based motion auto-encoder, while aligning the latent motion manifold to CLIP joint representation. We do so using

(i) a Text Loss, connecting motion representations to the CLIP embedding of their text labels, and

(ii) an Image Loss, connecting motion representations to CLIP embedding of rendered images that depict the motion visually.

在推理的时候，可以用 semantic editing：

例如要做一个风格迁移的话，找到一个 latent vector 表征这个 style，然后简单的将其加在 motion 的 latent vector 上。

要做分类的话，就看在 latent space 距离那个类的 vector 更近就好。更进一步，我们用 CLIP text encoder 来做 text-to-motion：text 进来之后用 CLIP text encoder 做 encoding，然后直接输入 motion decoder。

动作序列用 SMPL 模型来表示:一个长度为的序列表示为，用一个 6D 的维度和 24 个点来表示，其中1个点定义一个身体的全局方向，其余 23 个 SMPL 点表示关节，在第帧的结果。矩阵的节点位置根据 SMPL 的规定来计算（这里设置某个超参数 , 代表无性别的身体）

Transformer Encoder.

Encoder 用表示，将动作序列输入得到潜空间变量 . 这个动作序列被一个线性转换层独立地嵌入了 encoder 的维度，然后加入标准的位置编码，然后和一个额外的 prefix token 一起输入 Transformer Encoder。潜变量空间是 Transformer Encoder 的第一个输出（其余的被丢掉了）

MotionCLIP-Encoder Transformer: forward 方法

Transformer Decoder

Decoder，denoted as ，基于一个输入预测一个行动序列。这个输入后，作为 key 和 value，而 query 是一个简单的的位置编码。Transformer 给每一帧都输出一个 representation。然后用线性层将其变为动作空间。即 . 接下来用一个可微分的 SMPL layer 来得到矩阵关节数位置

Losses 损失函数

这个 Autoencoder 通过 L2 Loss 在关节位置、方向和运动速度来重建动作序列来表征 motion。即

第一项：重建序列 loss，

第二项：关节位置 loss，

第三项：运动速度 loss

给定 text-motion 和 image-motion 的 pairs 后，，

我们利用 cosine 距离来将将动作表征和 CLIP 中 text 的向量、CLIP 中 image 的向量关联起来

文本数据可以从数据集中标签而来，图像数据可以用 motion 空间渲染而来，所以总的来说，损失函数包括

训练细节

We train a transformer auto-encoder with 8 layers for each encoder and decoder as described in Section 3. We align it with the CLIP-ViT-B/32 frozen model.

Both values are set to 0.01 throughout our experiments.

代码阅读

这里有两部分的代码，对比前面,就是Motion本身的重建loss和CLIP相关的loss。我们首先看重建部分的loss：

loss func

第一部分：重建loss

在这部分中，self.lambdas 包含

parameters[”losses”] 从 model_name中解析，包含三个：

rc

rcxyz

vel

而 get_model.py 里规定的LOSSES表示所有支持的loss，这里最后只选择了上面三个。排除了velxyz

这部分总共包含3个 lambda loss：

lambda_rc

lambda_rcxyz

lambda_vel

~~lambda_velxyz (排除了，没用到）~~

对于上述三个loss，有 get_loss_function 去对应查找具体的loss

可见三个都是mse loss，batch当中包了 output 和 x_xyz和output_xyz，这里说白了就是动作序列，关节位置，和动作序列速度。

第二部分：CLIPLoss

clip loss，模型首先计算出该文本或者图片对应的features，标准化后得到了features_norm，之后得到模型输出的每个动作的logits ，和 features_norm是同样embedding size，ground_truth是一个batch_size序列。用crossEntropyLoss，就是一个

这里应该是实验了3种loss，最后用了cosine loss。其他两个不用看了

和论文中所述相同。