Popis: |
Positional Encoding (PE) plays a vital role in a Transformer’s ability to capture the order of sequential information, allowing it to overcome the permutation equivarience property. Recent state-of-the-art Transformer-based scene text recognition methods have leveraged the advantages of the 2D form of PE with fixed sinusoidal frequencies, also known as 2SPE, to better encode the 2D spatial dependencies of characters in a scene text image. These 2SPE-based Transformer frameworks have outperformed Recurrent Neural Networks (RNNs) based methods, mostly on recognizing text of arbitrary shapes; However, they are not tailored to the type of data and classification task at hand. In this paper, we extend a recent Learnable Sinusoidal frequencies PE (LSPE) from 1D to 2D, which we hereafter refer to as 2LSPE, and study how to adaptively choose the sinusoidal frequencies from the input training data. Moreover, we show how to apply the proposed Transformer architecture for scene text recognition. We compare our method against 11 state-of-the-art methods and show that it outperforms them in over 50% of the standard tests and are no worse than the second best performer, whereas we outperform all other methods on irregular text datasets (i.e., non horizontal or vertical layouts). Experimental results demonstrate that the proposed method offers higher word recognition accuracy (WRA) than two recent Transformer-based methods, and eleven state-of-theart RNN-based techniques on four challenging irregular-text recognition datasets, all while maintaining the highest WRA values on the regular-text datasets. |