Popis: |
Text-guided image processing has made tremendous progress in recent years. Most existing methods generally focus on using visual-language pre-training models for text-guided image processing. However, their applications to achieve text-guided fine-grained attribute face image processing (e.g., editing a smiling face to change from showing teeth to a closed-mouth smile) lead to poor performance due to the limited fine-grained semantic knowledge learned by existing visual-language pre-training models. To alleviate this problem, we propose a novel visual-language pre-training model based on fine-grained facial attribute features, which we call GrainedCLIP. Based on GrainedCLIP, we further propose a new text-guided fine-grained attribute face image processing model, which we call DiffusionGrainedCLIP. Our experimental results showed that GrainedCLIP outperformed existing methods, achieving $12.61 R$ @1 and $12.17 R$ @1 in text-to-image and image-to-text retrieval evaluation metrics, respectively, on the FFHQ dataset. Furthermore, compared to state-of-the-art text-guided face image processing methods, DiffusionGrainedCLIP significantly improved 55.37% in semantic consistency and 49.38% in face identity preservation on the FFHQ dataset. |