Arxiv URL: https://arxiv.org/abs/2304.06720
Authors: Songwei Ge, Taesung Park, Jun-Yan Zhu, Jia-Bin Huang
The paper proposes a method for text-to-image generation using rich text prompts that support various text attributes such as font family, size, color, and footnote. The method enables precise control of text-to-image synthesis regarding colors, styles, and object details compared to plain text. The paper demonstrates that the proposed method outperforms strong baselines with quantitative evaluations.
Key Insights & Learnings:
- Plain text has limitations in accurately describing desired outputs, especially for specifying continuous quantities and creating detailed text prompts for complex scenes.
- Rich text editors offer unique solutions for incorporating conditional information separate from the text, such as font color, size, style, and footnotes.
- The proposed method decomposes a rich-text prompt into a short plain-text prompt and multiple region-specific prompts that include text attributes.
- The method achieves precise color rendering, distinct styles, and accurate details compared to plain text-based methods.
- The proposed method outperforms strong baselines with quantitative evaluations.
Interfaces for generative AI are an area where a loft of development is happening. These papers expand the horizon of what's possible. Do check out the demo for this paper!
Terms Mentioned: text-to-image generation, rich text, font family, font size, font color, footnote, RGB, diffusion process, cross-attention maps, image editing, view synthesis
Technologies / Libraries Mentioned: PyTorch