LW-DETR (Light-Weight Detection Transformer)

LW-DETR, a Light-Weight DEtection TRansformer model, is designed to be a real-time object detection alternative that outperforms conventional convolutional (YOLO-style) and earlier transformer-based (DETR) methods in terms of speed and accuracy trade-off. It was introduced in the paper LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection by Chen et al. and first released in this repository. Disclaimer: This model was originally contributed by stevenbucaille in 🤗 transformers.

Model description

LW-DETR is an end-to-end object detection model that uses a Vision Transformer (ViT) backbone as its encoder, a simple convolutional projector, and a shallow DETR decoder. The core philosophy is to leverage the power of transformers while implementing several efficiency-focused techniques to achieve real-time performance.

Key Architectural Details:

ViT Encoder: Uses a plain ViT architecture. To reduce the quadratic complexity of global self-attention, it adopts interleaved window and global attentions.
Window-Major Organization: It employs a highly efficient window-major feature map organization scheme for attention computation, which drastically reduces the costly memory permutation operations required when transitioning between global and window attention modes, leading to lower inference latency.
Feature Aggregation: It aggregates features from multiple levels (intermediate and final layers) of the ViT encoder to create richer input for the decoder.
Projector: A C2f block (from YOLOv8) connects the encoder and decoder. For larger versions (large/xlarge), it outputs two-scale features ($1/8$ and $1/32$) to the decoder.
Shallow DETR Decoder: It uses a computationally efficient 3-layer transformer decoder (instead of the standard 6 layers), incorporating deformable cross-attention for faster convergence and lower latency.
Object Queries: It uses a mixed-query selection scheme to form the object queries from both learnable content queries and generated spatial queries (based on top-K features from the Projector).

Training Details:

IoU-aware Classification Loss (IA-BCE loss): Enhances the classification branch by incorporating IoU information into the target score $t=s^{\alpha}u^{1-\alpha}$.
Group DETR: Uses a Group DETR strategy (13 parallel weight-sharing decoders) for faster training convergence without affecting inference speed.
Pretraining: Uses a two-stage pretraining strategy: first, ViT is pretrained on Objects365 using a Masked Image Modeling (MIM) method (CAEv2), followed by supervised retraining of the encoder and training of the projector and decoder on Objects365. This provides a significant performance boost (average of $\approx 5.5\text{ mAP}$).

How to use

You can use the raw model for object detection. See the model hub to look for all available LW DETR models.

Here is how to use this model:

from transformers import AutoImageProcessor, LwDetrForObjectDetection
import torch
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained("stevenbucaille/lwdetr_large_60e_coco")
model = LwDetrForObjectDetection.from_pretrained("stevenbucaille/lwdetr_large_60e_coco")

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

# convert outputs (bounding boxes and class logits) to COCO API
# let's only keep detections with score > 0.7
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.7)[0]

for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    print(
            f"Detected {model.config.id2label[label.item()]} with confidence "
            f"{round(score.item(), 3)} at location {box}"
    )

This should output:

Detected cat with confidence 0.95 at location [345.88, 24.42, 639.33, 372.3]
Detected cat with confidence 0.943 at location [8.09, 54.36, 317.32, 474.24]
Detected remote with confidence 0.929 at location [40.54, 73.22, 175.62, 117.03]
Detected couch with confidence 0.866 at location [0.25, 1.86, 639.57, 474.72]

Currently, both the feature extractor and model support PyTorch.

Training data

The LW-DETR models are trained/finetuned on the following datasets:

Pretraining: Primarily conducted on Objects365, a large-scale, high-quality dataset for object detection.
Finetuning: Final training is performed on the standard COCO 2017 object detection dataset.

BibTeX entry and citation info

@article{chen2024lw,
        title={LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection},
        author={Chen, Qiang and Su, Xiangbo and Zhang, Xinyu and Wang, Jian and Chen, Jiahui and Shen, Yunpeng and Han, Chuchu and Chen, Ziliang and Xu, Weixiang and Li, Fanrong and others},
        journal={arXiv preprint arXiv:2406.03459},
        year={2024}
    }

Downloads last month: 22

Safetensors

Model size

46.8M params

Tensor type

F32