CARTOONDIFF: TRAINING-FREE CARTOON IMAGE GENERATION WITH DIFFUSION TRANSFORMER MODELS

Feihong He1 , Gang Li2,3, Lingyu Si2, Leilei Yan1,Shimeng Hou4, Hongwei Dong2, Fanzhang Li1

1School of Computer Science and Technology, Sooshow University;
2Institute of Software, Chinese Academy of Sciences
3University of Chinese Academy of Sciences
4Northwestern Polytechnical University

Paper Code

Abstract

Image cartoonization has attracted significant interest in the field of image generation. However, most of the existing image cartoonization techniques require re-training models or using images of cartoon style as reference. In this paper,we present CartoonDiff, a novel reference and training-free sampling approach which generates image cartoonization using diffusion transformer models. Specifically, we decompose the reverse process of diffusion models into high and low-frequency signal denoising processes. Furthermore, we implement the image cartoonization process by normalizing high-frequency signal of the noisy image in specific denoising steps. CartoonDiff doesn’t require any additional refer- ence images, complex model designs, or the tedious adjustment of multiple parameters. Extensive experimental results show the powerful ability of our CartoonDiff. The projectpage is available at: Project page is available at https://cartoondiff.github.io/.

An overview of the proposed methods.

Methodology overview and model structure diagram. (a) expresses the classifier-free extrapolation the models to specific classes under conditional guidance. (b) Illustrates our investigation of the denoising process for the diffusion model, categorizing it into low-frequency and high-frequency signal denoising stages based on the generated images for relative frequency information. In (c), we present the improvements made to the model based on the DiT[1] structure.

The Influence Of 𝜎 Parameters On Cartoonization

Observing from the image change results, we can see that better results can be achieved when the hyperparameter is between 200 and 300.

The main experimental results

Based on DiT XL/2[1], with a hyperparameter σ set to 250, we present the results of DiT’s generation and the results generated using CartoonDiff. In each pair of images, the left side shows the original image generated by DiT, while the right side shows the image cartoonized using CartoonDiff.

Comparison with other cartoon generation works

The comparative experiment between CartoonDiff and Back-D includes images where each small image in the upper-left corner represents the original image output, while the larger image represents the output after cartoonization.