Popis: |
Part-level features obtained by uniformly partitioning have attracted much attention in person re-identification. However, standard uniform part partitions may lead to within-part inconsistency across different samples, as shown in Figure 1. Attention mechanisms (e.g., refined part pooling) have been proposed to refine part division with enhanced consistency. Unfortunately, such mechanisms adopt single-headed convolutional structures, fail to fuse fine-grained part information. Besides, convolution-based schemes can maintain local positional information but cannot effectively pre-serve relative positions between parts. This paper proposes a new CNN-Transformer hyper architecture called the Person Retrieval with Conv-Transformer (PRCT). We integrate the multi-head self-attention and positional embedding module, which are the core ingredients of non-convolutional Transformer, with a CNN-based part-feature extractor to maintain more precise within-part consistency in feature aggregation. With PRCT, we can effectively eliminate the part mis-alignments when matching different samples. We conduct extensive evaluations on the MSMT17, DukeMTMC-ReID, and Market-1501 datasets and obtain state-of-the-art performance. |