Language-based Colorization of Scene Sketches
Changqing Zou#1,2Haoran Mo#1(joint first author),  Chengying Gao*1Ruofei Du3Hongbo Fu4
Sun Yat-sen University1,  Huawei Noah's Ark Lab2
Google3,  City University of Hong Kong4

Accepted by SIGGRAPH Asia 2019

Given a scene sketch, our system automatically produces a colorized cartoon image by progressively coloring foreground object instances and the background following user-specified language-based instructions.
Abstract
Being natural, touchless, and fun-embracing, language-based inputs have been demonstrated effective for various tasks from image generation to literacy education for children. This paper for the first time presents a language-based system for interactive colorization of scene sketches, based on semantic comprehension. The proposed system is built upon deep neural networks trained on a large-scale repository of scene sketches and cartoonstyle color images with text descriptions. Given a scene sketch, our system allows users, via language-based instructions, to interactively localize and colorize specific foreground object instances to meet various colorization requirements in a progressive way. We demonstrate the effectiveness of our approach via comprehensive experimental results including alternative studies, comparison with the state-of-the-art methods, and generalization user studies. Given the unique characteristics of language-based inputs, we envision a combination of our interface with a traditional scribble-based interface for a practical multimodal colorization system, benefiting various applications.
Methodology

A.   System Overview

Our system supports two-mode interactive colorization for a given input scene sketch and text-based colorization instructions, using three models, namely, the instance matching model, foreground colorization model, and background colorization model. It is not necessary to colorize foreground objects before background regions.

B.1   Instance Matching Model

This network is trained in an end-to-end manner to obtain the binary mask (shown in (b)). In the inferring phase, the generated binary mask is fused with the instance segmentation results generated by Mask R-CNN to obtain the final results.

B.2   Foreground Colorization Model

This network is able to colorize objects from different categories. The generator has a U-Net architecture based on MRU blocks, with skip connections between mirrored layers and an embedded RMI fusion module consisting of LSTM text encoders and multimodal LSTMs (mLSTM). It is referred to as the FG-MRU-RMI network for conciseness in the paper.

B.3   Background Colorization Model

This network consists of an image encoder built on residual blocks (Res-Block), a fusion module, a two-branch decoder, and a Res-Block based convolutional discriminator. It is referred to as the BG-RES-RMI-SEG network in the paper.

Datasets

We have built three large-scale datasets for language-based scene sketch colorization:
  1. MATCHING dataset: including 38k groups of text-based instance segmentation data for scene sketch.
  2. FOREGROUND dataset: including 4k groups of text-based sketch object colorization data.
  3. BACKGROUND dataset: including 20k groups of text-based background colorization data for scene sketch.
Results

*For more results, please see main paper and the supplementary material.
Fast Forward Video

BibTeX
@article{zouSA2019sketchcolorization,
    title   = {Language-based Colorization of Scene Sketches},
    author  = {Zou, Changqing and Mo, Haoran and Gao, Chengying and Du, Ruofei and Fu, Hongbo},
    journal = {ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH Asia 2019)},
    year    = {2019},
    volume  = 38,
    number  = 6,
    pages   = {233:1--233:16}
}

Related Work
Changqing Zou, Qian Yu, Ruofei Du, Haoran Mo, Yi-Zhe Song, Tao Xiang, Chengying Gao, Baoquan Chen and Hao Zhang. SketchyScene: Richly-Annotated Scene Sketches. ECCV, 2018. [Paper] [Webpage] [Code]

Jianbo Chen, Yelong Shen, Jianfeng Gao, Jingjing Liu and Xiaodong Liu. Language-Based Image Editing with Recurrent Attentive Models. CVPR, 2018. [Paper][Code]

Wengling Chen and James Hays. SketchyGAN: Towards Diverse and Realistic Sketch to Image Synthesis. CVPR, 2018. [Paper][Code]

Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu and Alan Yuille. Recurrent Multimodal Interaction for Referring Image Segmentation. ICCV, 2017. [Paper][Code]