Research and Exploration
Semantics-driven High-fidelity Ancient Ceramic Image Generation by Integrating Instance Segmentation and Diffusion Models

XIAO Yong, LI Busheng, YU Yanlin, ZHOU Wenbo, YANG Lihua, QIU Wangren, XIAO Zhuohao

(Jingdezhen Ceramic University, Jingdezhen 333403, Jiangxi, China)

Extended abstract:

[Background and purposes] Ancient ceramic image generation is an important foundation for cultural heritage digitization and virtual restoration, yet generic text-to-image models still show obvious limitations in vessel-structure control and decorative semantic expression. Ancient ceramics usually involve complex contours, fine-grained ornamentation and strong dynastic style dependence, while Chinese collection descriptions are often long and noisy, which increases the difficulty of controllable generation. This study aimed to develop a controllable image generation framework that can simultaneously preserve vessel geometry and satisfy semantic requirements in Chinese domain descriptions.

[Methods] We propose a high-fidelity ancient ceramic image generation framework that integrates instance segmentation with diffusion models. GroundingDINO and SAM2 are first employed to detect and segment ceramic objects. Instead of directly using raw binary masks or applying only ordinary morphological smoothing, the proposed HiRes-APD-style mask regularization strategy improves boundary stability through supersampling-based smoothing and area-consistent resampling, while a gray-scale soft mask is further constructed for boundary transition and blending. The regularized structural mask and text prompt are then jointly fed into a ControlNet-guided Stable Diffusion model to generate decorative patterns within the vessel region. In addition, domain fine-tuning is applied to improve the model's ability to represent vessel forms, glaze styles and decorative semantics in the ancient ceramic domain.

[Results] On 30 validation samples, the proposed method achieves an average Mask IoU of 0.8371 and an average CLIP Score of 0.984±0.051. Comparative experiments indicate that ControlNet is the key module for achieving high structural consistency, while domain fine-tuning further improves text-image semantic alignment and decorative style expression. Ablation experiments further suggest that the upper bound of structural controllability is still mainly limited by the accuracy of the front-end structural condition.

[Conclusions] The proposed method can be used to effectively coordinate structural constraints and semantic guidance in ancient ceramic image generation. It may provide a controllable image generation approach for digital display, creative design and related methodological research on ancient ceramics.

Key words: ancient ceramics; text-to-image generation; semantic-driven; instance segmentation; ControlNet; domain fine-tuning


  • View full text】Downloaded times

Print    Favorites      export BibTex      export EndNote      export XML