CLIP系列学习（五）— PyramidCLIP | Yixin’s Blog

type

status

date

slug

summary

tags

category

icon

password

该工作个人主观打分：6.5 打分理由：作为follow CLIP的一篇工作，和大多数其他peer work差不多，在任务复杂度上进行了创新。主要区分了粗粒度描述和细粒度描述，但整体创新角度比较局限，且引入的soft-label机制不如该作者后续工作（soft-labeled clip）有说服力。文章整体阅读起来逻辑清晰，轻重有序。美中不足就是follow这篇工作的后续并不多。

PyramidCLIP 2022.05 腾讯优图×上海交大，NeurIPS2022

https://arxiv.org/pdf/2204.14095

Our main contributions are summarized as follows:
(i) We propose PyramidCLIP for more accurate image-text alignment for vision-language model pre-training, which effectively constructs two input pyramids at both sides of the visual encoder and linguistic encoder, and then align the visual elements and linguistic elements via peer-level semantics alignment and cross-level relation alignment.
(ii) We soften the loss term of negative samples during the contrast process to ease the strict constraint, so as to alleviate the negative effect caused by local similarities. (iii) Extensive experiments demonstrate the effectiveness of PyramidCLIP, which achieves SoTA on several downstream tasks.

主要是做了多个粒度的对齐，以及对齐时的hard-label 变成了 soft-label

没有放出来training code，只有简单的模型code

技术细节

notion image

在训练过程中，每个 Image、text pair，每个 image 都会转成 2 个 views：

一个是 Local View 一个是 Global view。

Local View 是通过不同比例的随机剪裁来做的，但在代码中没找到对应的implementation。

第一：Global View 和 text summarization 是掌握全局信息。

第二：Local View 和详细的 Text 是另一个对

第三：object Detection 得到的 ROI Features 和分类

前两个都走 Image Encoder， Text Encoder 后进行 Contrastive loss 的计算。

第三个是图像经过 Object Detection 后，经过 ROI 提取（预训练的），经过一个线性 embedding 模块，希望他对齐 f（1）的输出结果，作为 f2 的输入。得到的目标文本拼接起来作为新的文本。

然后各个 pair 之间也要计算一些交叉的 loss。

网络结构：

notion image

soft label

notion image

这里把匹配/不匹配的 1 和 0 的关系进行一定程度的平滑。这里的平滑就是引入一个 α=0.2来解决这个事儿。这里的 N 是 mini-batch 的数量，

最后，基于这样的 smooth loss 的设计，总共 Loss 会分为三块，一个是 peer loss （global image 对 Text Summary， local image 对 text）和 cross loss （global、local）

Author:Yixin Huang
URL:https://yixinhuang.cn/article/clip_pyramidclip
Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!

CLIP系列学习（六）— MotionCLIP CLIP系列学习（四） — CLIP in Medical Imaging

Loading...

Catalog

0%

Yixin Huang

一个热爱生活的算法工程师

Latest posts

时间序列论文阅读 — TimeCMA（AAAI 2025）

时间序列论文阅读-ChatTime: A Unified Multimodal Time Series Foundation Model Bridging

VLM系列论文阅读-Mixed Preference Optimization (MPO)

VLM系列论文阅读 — Flamingo

认识你自己，才是这件事的最终乐趣 — 抄录

用GPT4学量化投资 — Junior Level - Unit 1: Introduction to Stock Markets and Data Handling

Announcement

🎉NotionNext 4.5已经上线🎉

-- 感谢您的支持 ---

👏欢迎更新体验👏

Catalog

0%