Improvement of Tradition Dance Classification Process Using Video Vision Transformer based on Tubelet Embedding.

Autor: Mulyanto, Edy, Yuniarno, Eko Mulyanto, Putra, Oddy Virgantara, Hafidz, Isa, Priyadi, Ardyono, Purnomo, Mauridhi H.
Předmět:
Zdroj: International Journal of Intelligent Engineering & Systems; 2024, Vol. 17 Issue 4, p530-545, 16p
Abstrakt: Image processing has extensively addressed object detection, classification, clustering, and segmentation challenges. At the same time, the use of computers associated with complex video datasets spurred various strategies to classify videos automatically, particularly in detecting traditional dances. This research proposes advancement in classifying traditional dances by implementing a Video Vision Transformer (ViViT) that relies on tubelet embedding. The authors utilized IDEEH-10, a dataset of videos showcasing traditional dances. In addition, the ViViT artificial neural network model was used for video classification. The video representation is generated by projecting spatiotemporal tokens onto the transformer layer. Next, an embedding strategy is used to improve the classification accuracy of Traditional Dance Videos. The proposed concept treats video as a sequence of tubules mapped into tubule embeddings. Tubelet management has added TA (tubelet attention layer), CA (cross attention layer), and tubelet duration and scale management. From the test results, the proposed approach can better classify traditional dance videos compared to the LSTM, GRU, and RNN methods, with or without balancing data. Experimental results with 5 flods showed Loss between 0.003 to 0.011 with an average Lost of 0.0058. Experiments also produced an accuracy rate between 98.68 to 100 percent, resulting in an average accuracy of 99.216. This result is the best of several comparison methods. ViViT with tubeless embedding has a good level of accuracy with low losses, so that it can be used for dance video classification processes. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index