Popis: |
This paper presents a concept of a 4D multimodal speaker model (4D-MSM) for asynchronous remote speech diagnosis. Recording and archiving diagnostically significant articulation material remain an issue in computer-aided speech diagnosis. Therefore, we propose a workflow for preparing and storing reliable and easily interpretable multimodal data regarding pronunciation. According to our assumptions, data acquisition should be non-invasive, comfortable for both the patient and therapist, not interfere with the articulation process, and provide essential data of high quality. We developed and employed a dedicated device, obtaining a 15-channel spatially distributed audio signal and stable stereovision stream from two cameras focused on the lower part of the face. Our framework for data preprocessing covers digital beamforming of the multichannel audio signal, audio-video synchronization, and segmentation of words in the audio signal. Then, we use stereo data to calculate and adjust the depth map and prepare point clouds. Simultaneously, we delineate the mouth in video frames using a dedicated semi-automated segmentation algorithm. The point clouds are then textured with the camera images with superimposed mouth regions. Finally, we add the audio track to constitute the 4D-MSM. In the paper, we show the concept and detailed specification of the model and present experiments to justify the methodology. Proposed 4D-MSMs may be employed in remote speech diagnosis for objectifying and archiving diagnoses, conducting asynchronous consultations, and documenting the progress in therapy. |