Reasoning Photo Retouching

VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching

A research system for automatic, style-driven, and parameterized photo retouching with an interpretable reasoning pipeline and a fully differentiable retouch renderer.

0.5B VLM core
1M+ AetherRetouch dataset
3 Retouching modes
Yihong Guo Youwei Lyu Jiajun Tang Yizhuo Zhou Hongliang Wang Jinwei Chen Changqing Zou Qingnan Fan

On-Device Demo

Real deployment on iPhone 13 Pro Max — no cloud, fully on-device. (Demo videos play at 3x speed)

Auto Mode

Style Mode

Param Mode

Overview

Abstract

Reasoning photo retouching has gained significant traction, requiring models to analyze image defects, give reasoning processes, and execute precise retouching enhancements. However, existing approaches often rely on non-differentiable external software, creating optimization barriers and suffering from high parameter redundancy and limited generalization. To address these challenges, we propose VeraRetouch, a lightweight and fully differentiable framework for multi-task photo retouching. We employ a 0.5B Vision-Language Model (VLM) as the central intelligence to formulate retouching plans based on instructions and scene semantics. Furthermore, we develop a fully differentiable Retouch Renderer that replaces external tools, enabling direct end-to-end pixel-level training through decoupled control latents for lighting, global color, and specific color adjustments. To overcome data scarcity, we introduce AetherRetouch-1M+, the first million-scale dataset for professional retouching, constructed via a new inverse degradation workflow. Furthermore, we propose DAPO-AE, a reinforcement learning post-training strategy that enhances autonomous aesthetic cognition. Extensive experiments demonstrate that VeraRetouch achieves state-of-the-art performance across multiple benchmarks while maintaining a significantly smaller footprint, enabling mobile deployment.

Highlights

  • Lightweight design for controllable, interpretable mobile deployment
  • Free-resolution input for flexible retouching across diverse image sizes
  • Fully differentiable renderer for direct pixel-level training
  • Unified support for auto, style, and parameter retouching
  • AetherRetouch-1M+ for large-scale professional supervision

Visual Results

Comparisons across three retouching modes

Drag the divider to inspect how the model reshapes lighting, palette, and local color relationships while preserving detail. Auto mode autonomously analyzes image defects and generates reasoning-aware enhancements without any user prompt. Style mode translates stylistic text prompts into visual adjustments by formulating a structured retouching plan and control latents. Param mode executes exact pixel-level modifications based on professional operational parameters. (Note: The comprehensive reasoning text produced by the model has been truncated for this display to prioritize visual clarity.)

Mode 01

Auto Mode

Image-only automatic retouching

Mode 02

Style Mode

Prompt-guided stylistic retouching

Mode 03

Param Mode

Instruction and parameter driven retouching

Method

A compact pipeline with explicit control over retouching intent

VeraRetouch model structure pipeline
01

Reasoning Brain

A 0.5B vision-language model reads the image and optional user request, then produces an interpretable retouching plan.

02

Disentangled Controls

Internal latents separate lighting, global color, and specific-color adjustments for finer retouch behavior.

03

Differentiable Rendering

The renderer replaces external editing software so the whole system can be trained end to end at pixel level.

Dataset

AetherRetouch Dataset

To support large-scale reasoning photo retouching, VeraRetouch introduces AetherRetouch-1M+, a million-scale dataset designed for professional-quality enhancement. The dataset covers diverse scenes, lighting conditions, portrait and landscape content, and rich retouching targets across auto, style, and parameter-driven workflows.

AetherRetouch is organized into three complementary parts: Auto-Retouch pairs for image-only enhancement, Style-Retouch pairs for prompt-guided stylistic editing, and Param-Retouch examples for explicit parameter-driven control. Together, these three subsets provide broad supervision for reasoning, controllability, and visual generalization across real retouching scenarios.

Representative examples from the AetherRetouch dataset

The data construction pipeline below mainly illustrates how the Auto-Retouch subset is built through an inverse degradation process. Starting from high-quality retouched references, the pipeline synthesizes realistic low-quality inputs to form supervision pairs for differentiable planning and rendering. This strategy enables scalable collection while preserving strong retouch targets and realistic visual degradation patterns.

AetherRetouch data construction pipeline

Citation

Use VeraRetouch in your research

@article{guo2026veraretouch,
  title={VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching},
  author={Guo, Yihong and Lyu, Youwei and Tang, Jiajun and Zhou, Yizhuo and Wang, Hongliang and Chen, Jinwei and Zou, Changqing and Fan, Qingnan},
  journal={arXiv preprint arXiv:2604.27375},
  year={2026}
}