Innovative Transformer-iN-Transformer Network Enhances Vision Tasks
Written on
Chapter 1: Introduction to Transformer-iN-Transformer
A recent study conducted by researchers at Huawei, ISCAS, and UCAS introduces the Transformer-iN-Transformer (TNT) network architecture. This innovative model demonstrates superior performance compared to traditional vision transformers, particularly in preserving local information and enhancing visual recognition capabilities.
Transformers emerged in 2017, quickly establishing themselves as the leading architecture for natural language processing (NLP) due to their computational efficiency and scalability. Their potential has now begun to be recognized in computer vision (CV) applications such as image recognition, object detection, and image processing.
Section 1.1: Limitations of Existing Visual Transformers
Current visual transformers typically process input images as sequences of patches, neglecting the inherent structural relationships between these patches. This limitation adversely affects their performance in visual recognition tasks. The TNT architecture seeks to overcome this issue by incorporating both patch-level and pixel-level representations.
Subsection 1.1.1: CNNs vs. Transformer Models
Despite convolutional neural networks (CNNs) maintaining dominance in CV, transformer-based models have shown impressive results in visual tasks without relying on image-specific biases. A key development in applying transformers to image recognition is the Vision Transformer (ViT), which segments images into sequences of patches and converts these into embeddings. While ViT can process images effectively with minimal adjustments, it still overlooks the structural information contained in the images.
Section 1.2: The Architecture of TNT
Similar to the ViT method that inspired it, TNT divides an image into sequences of patches. However, it enhances this process by reshaping each patch into a (super) pixel sequence. By applying linear transformations to both patches and pixels, TNT generates embeddings that are subsequently processed through a series of TNT blocks for representation learning. Each TNT block features an outer transformer block that captures global relationships among patch embeddings and an inner transformer block that focuses on extracting local structural information from pixel embeddings. This design allows TNT to effectively capture spatial details by projecting pixel embeddings into the patch embedding space. Ultimately, classification is performed using a class token processed through a Multi-Layer Perceptron (MLP) head.
Chapter 2: Experimental Evaluation of TNT
The researchers undertook thorough evaluations on various visual benchmarks to assess how effectively TNT models both global and local structural information within images, as well as its feature representation learning capabilities. They utilized the ImageNet ILSVRC 2012 dataset for image classification tasks and also conducted transfer learning experiments to analyze TNT's generalization performance. The TNT architecture was benchmarked against other recent transformer-based models like ViT and DeiT, alongside CNN models such as ResNet, RegNet, and EfficientNet.
The first video, "Transformers in Vision: From Zero to Hero," delves into the evolution and implementation of transformer models in visual tasks, highlighting their significance in modern AI development.
The second video, "Stanford CS25: V1 I Transformers in Vision: Tackling Problems in Computer Vision," explores the challenges faced in computer vision and how transformers are addressing these issues.
Results indicate that TNT-S achieved an impressive top-1 accuracy of 81.3 percent, surpassing the baseline model DeiT by 1.5 percent. Furthermore, TNT outperformed several other visual transformer models and notable CNN models like ResNet and RegNet, although it did not reach the performance level of EfficientNet. These findings illustrate that while the TNT architecture excels in visual transformer benchmarks, it has yet to match the current state-of-the-art CNN methodologies.
For further insights, the paper titled "Transformer in Transformer" can be found on arXiv.
Author: Hecate He | Editor: Michael Sarazen
Stay updated on the latest news and breakthroughs in AI by subscribing to our popular newsletter, Synced Global AI Weekly.