Popis: |
The Hybrid Multidimensional Deep Convolutional Neural Network (HMDCNN) topology for the multimodal recognition of the speech, the face, the lips, and human gestures behavior is proposed. In this case a hybridization is understood to be compatible use of 2D and 3D convolutional neural networks in one multimodal architecture. Conducted researches relate to improving the understanding of complex dynamic scenes. The basic unit of the proposed hybrid system is deep neural network topology, which combines 2D and 3D convolutional neural network (CNN) for each modality with proposed intermediate-level feature fusion subsystem. Such a feature map fusion method is based on scaling procedure with a specific combination of pooling operation with non-square kernels and allows merging different type of modalities. |