Pyramid Vision Transformer: A Versatile Backbone for Dense Predictionwithout Convolutions

Pyramid Vision Transformer: A Versatile Backbone for Dense Predictionwithout Convolutions

논문 리뷰/Vision 2021. 8. 9. 11:35

1. Introduction

- Computer Vision 분야에서 pixel-level dense prediction task를 위한 다목적 convolution-free Transformer backbone network 제안

* Convolutional Neural Network (CNN)

- Computer vision 분야의 거의 모든 task에서 지배적인 방법

- Depth가 증가할수록 receptive field 증가

* PVT & ViT 공통점

- pure Transformation models without convolutional operation

* Vision Transformer (ViT)

- ViT[10]는 image classification을 위해 convolution-free Transformer 이용

- ViT output sequence의 길이는 input과 동일, 즉 single scale (Columnar 구조)

- Image를 coarse image patch로 나눔

- 제한된 resource 때문에 output은 coarse-grained, resolution이 상대적으로 낮음

ex) patch size가 16 또는 32

16-stride 또는 32-stride

- High-resolution 또는 multi-scale feature map을 요구하는 dense prediction task에 직접 적용하기 어려움

* Pyramid Vision Transformer (PVT)

- Pixel-level dense prediction, image-level prediction을 포함한 많은 downstream task에 적용가능한 다목적 backbone

(ex. object detection, semantic segmentation)

- High-resolution을 위한 fine-grained image patch (4x4 per patch)

- Progressive shrinking pyramid 도입을 통해 multi-scale feature map 생성 가능

- Spatial-reduction attention (SRA) layer 도입을 통해 computaion/memory cost 감소

- Resource consumption을 줄이고 PVT가 multi-scale & high-resolution feature map을 학습할 수 있도록 함

* PVT의 장점

- 항상 global receptive field를 생성 (by attention among all small patches)

(전통적인 CNN backbone은 depth가 증가할 때 receptive field 증가)

- ViT에 비교해서 pyramid 구조 덕분에 많은 대표적인 dense prediction pipeline에 연결 가능

(ex. RetinaNet, Mask R-CNN)

- PVT에 다른 task를 위해 디자인된 Transformer decoder를 합쳐 convolutional-free pipeline 생성 가능

(ex. object detection을 위한 PVT+DETR)

2. Related Work

-

-

-

3. Pyramid Vision Transformer (PVT)

- Goal : dense prediction task를 위한 multi-scale feature map을 생성할 수 있도록 pyramid 구조를 Transformer에 도입

Figure 2. Overall architecture of the proposed Pyramid Vision Transformer (PVT)

* Hyperparameters

* Overall Architecture

- 다른 scale의 feature map을 생성하는 4 stage로 구성

- 모든 stage는 유사한 구조 공유 (patch embedding layer와 L_i Transformer encoder layers로 구성)

- 첫번째 stage에서 input image size : H x W x 3

- Patch 개수 : H*W/(4^2), 각 patch size : 4 x 4 x 3

- Flattened patch를 linear projection에 넣어 embedded patch 얻음 (size : H*W/(4^2) * C_1)

- Embedded patch에 position embedding 더함

- Transformer encoder with L_1 layer 통과

- Output은 H/4 * W/4 * C_1 size의 feature map F_1으로 reshape

- 이전 stage의 feature map이 다음 stage의 input으로 들어감

- 같은 과정을 반복하여 F_2, F_3, F_4 얻음 (stride = 8, 16, 32 pixel)

- Feature Pyramid :

* Feature Pyramid for Transformer

- Patch embedding layers에서 feature map의 scale을 조절하기 위해 progressive shrinking strategy 사용

- stage i에서

위 식과 같은 input feature map을

위 식과 같은 개수의 patch로 나눔

- 각 patch는 flatten 후 C_i 차원 임베딩으로 linear projection하면

위와 같은 shape의 embedded patch 생성

- Input보다 P_i배 작은 height, width

- 이와 같은 방법으로 각 stage에서 feature map의 크기 유연하게 조정 가능

* Transformer Encoder

- stage i에서 Transformer encoder는 L_i encoder layer 있음

- 각 encoder layer는 attention layer와 feed-forward layer로 구성

- Encoder에서 전통적인 multi-head attention (MHA)를 대체하기 위한 spatial-reduction attention (SRA) 제안

Figure 3. Multi-head attention vs. spatial-reduction attention (SRA)

- Attention operation 전에 Key와 Value의 spatial size를 줄임

- 이를 통해 computaion/memory cost 굉장히 줄임

- N_i : Transformer encoder의 head의 수

- d_head = C_i / N_i

- Spatial-reduction operation :

- R_i : stage i에서 attention layer의 reduction ratio

- Reshape(x, R_i) :

위와 같은 input을

위와 같은 size의 sequence로 reshape

- Input sequence를 C_i로 차원 감소시키는 linear projection :

- Norm() : layer normalization

- Attention() :

- MHA에 비해 (R_i)^2 배 적은 computational/memory cost

- 제한된 resource에서 더 큰 input feature map/sequence 처리 가능

* Model Details

- ResNet의 design rule을 따라

1) 얕은 stage에서는 output channel number을 작게 설정

2) 중간 stage에서 주요한 computational resource 집중

Table 1. Detailed settings of PVT series

- Network depth가 증가할수록 hidden dimension은 증가하고 output resolution은 감소

- Major computational resource는 Stage 3에 집중

4. Applied to Downstream tasks

-

-

-

5. Experiments

- 대표적인 CNN backbone인 ResNet[15], ResNeXt[56]과 비교

5.1. Image Classification

- ImageNet dataset으로 실험

- DeiT와 동일한 data augmentation

- Label-smoothing regularization 사용

- 유사한 parameter 수와 computation budget에서 PVT 모델이 전통적인 CNN backbone에 비해 우수한 성능

- 유사하거나 더 낮은 complexity로 PVT 모델이 Transformer-based model(ViT, DeiT)과 유사한 성능

- 이는 pyramid structure가 dense prediction task에 유리한 방법이기 때문

5.2. Object Detection

- challenging COCO benchmark

- 2 standard detector : RetinaNet & Mask R-CNN

- Pre-trained weight on ImageNet

- Image resize

- Object detection를 위해 RetinaNet 사용했을 때 PVT variants 유사한 parameter 수로 좋은 성능

- Object detection에서 PVT는 CNN backbone의 훌륭한 대안

- Instance segmentation을 위해 Mask R-CNN 사용했을 때 좋은 성능

5.3. Semantic Segmentation

- ADE20K : 150 fine-grained semantic categories

- Semantic FPN

- Pre-trained weight on ImageNet

- PVT가 ResNet, ResNeXt 보다 유사한 parameter 수로 일관되게 좋은 성능

- PVT가 global attention mechanism 덕분에 CNN backbone보다 더 나은 feature 추출

5.4. Pure Transformer Dense Prediction

-

-

-

-

-

-

5.5. Ablation Study

- ImageNet과 COCO dataset에서 ablation study 진행

* Pyramid Structure

- Transformer를 dense prediction task에 적용할 때 중요

- 이전 연구 ViT의 경우 coluumnar framework로 인해

(1) coarse image patch 사용 : output feature map의 resolution 낮음, detection 성능 낮음

(2) fine-grained image patch 사용 : GPU 메모리 펑

- 본 논문은 progressive shrinking pyramid를 이용

- shallow stage에서 high-resolution feature map을 처리

- deep stage에서 low-resolution feature map을 처리

- 본 논문 방법론이 더 좋은 성능

* Deep vs. Wider

- Deep model : PVT-Medium

- Wide model : PVT-Small의 hidden dimension에 scale factor 1.4 곱함

- 두 모델의 parameter 수 동일

- deep model이 wide model보다 일관되게 좋은 성능

- PVT design에서는 going deeper!

* Pre-trained Weights

- 대부분 dense prediction model은 ImageNet에서 pre-trained weight의 backbone에 의존

- 첫번째 그래프의 빨간 선이 with pre-trained weight, 파란 선이 without pre-trained weight

- PVT-based model도 pre-trained weight가 있을 때 더 빨리 더 나은 성능

- 아래 그래프를 보면 PVT-based model(빨간 선)이 ResNet-based model(초록 선)보다 수렴 속도가 더 빠름

* Computation Cost

- Input scale 증가할 때 GFLOPs의 증가율은 ViT > PVT > ResNet

- PVT는 medium-resolution input에 더 적합

- Input scale 감소시키자 더 빠른 속도 & 여전히 좋은 성능

6. Conclusions and Future Work

- PVT는 dense prediction task를 위한 pure Transformer backbone

- Progressive shrinking pyramid와 spatial-reduction attention layer를 개발하여 제한된 computatinal/memory resource에서 multi-scale feature map 얻음

- 광범위한 실험을 통해 object detection과 semantic segmentation에서 PVT가 잘 디자인된 CNN backbone 보다 더 강함을 보여줌

본 논문 :

Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021). Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. http://arxiv.org/abs/2102.12122

[10] : Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. Proc. Int. Conf. Learn. Representations, 2021.

[15] : Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.

[56] : Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and ´ Kaiming He. Aggregated residual transformations for deep neural networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.

'논문 리뷰 > Vision' 카테고리의 다른 글

Learning Deep Features for Discriminative Localization (0) 2021.09.07
관련글 관련글 더보기
- Learning Deep Features for Discriminative Localization

ABOUT ME

기록장 기록장

1. Introduction

2. Related Work

3. Pyramid Vision Transformer (PVT)

4. Applied to Downstream tasks

5. Experiments

5.1. Image Classification

5.2. Object Detection

5.3. Semantic Segmentation

5.4. Pure Transformer Dense Prediction

5.5. Ablation Study

6. Conclusions and Future Work

'논문 리뷰 > Vision' 카테고리의 다른 글

티스토리툴바

ABOUT ME

1. Introduction

2. Related Work

3. Pyramid Vision Transformer (PVT)

4. Applied to Downstream tasks

5. Experiments

5.1. Image Classification

5.2. Object Detection

5.3. Semantic Segmentation

5.4. Pure Transformer Dense Prediction

5.5. Ablation Study

6. Conclusions and Future Work

'논문 리뷰 > Vision' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바