πŸŽ‰ Accepted to ECCV 2026

SciIR: A Large-scale Training Dataset and Benchmark for Scientific Image Reasoning Generation

Grounding scientific image generation in Peirce's Semiotic Triad.

Zhiyuan Ma1,*, Zhengfeng Shi1,*, Yuning An2, Peize Li3, Jiabao Wei4, Ruijie Li5, Junhao Xiao6, Jianjun Li1,†, Bowen Zhou7
1 Huazhong University of Science and Technology  Β·  2 Harbin Engineering University  Β·  3 King's College London  Β·  4 Beijing Institute of Technology
5 Dept. of Informatics, King's College London  Β·  6 Central China Normal University  Β·  7 Tsinghua University
* Equal contribution   β€  Corresponding author
Overview of SciIR
Figure 1. Overview of SciIR. (a) Keyword word cloud and track distribution. (b) Example figures across domains. (c) SciIR-Bench results, comparing Intrinsic Reasoning vs. Instruction Following.
82K
Image–text pairs
800
Benchmark instances
65
Scientific aspects
35%β†’43%
SciIR-Bench gain
Abstract

What is SciIR?

Inspired by Peirce's Semiotic Triad, SciIR formalizes scientific reasoning into three dimensions β€” Entity Structure (Icon), Scientific Process (Index), and Scientific Law (Symbol) β€” backed by the SciIR-82k dataset (80k+ Sci-RCoT pairs) and the SciIR-Bench Atomic Checklist. Fine-tuning yields Qwen-Image-SciIR, lifting the score from 35% β†’ 43%.
A Principled Taxonomy

Peirce's Semiotic Triad

Icon

Entity Structure

Geometric hierarchy and spatial alignment of scientific entities.

structure & topology
Index

Scientific Process

Causal and temporal chains β€” state transitions and workflows.

process & causality
Symbol

Scientific Law

Abstract laws β€” energy conservation, molecular valence.

law & principle
Contributions

Highlights

🧠

A Principled Taxonomy

Scientific correctness decomposed into Icon, Index, and Symbol dimensions.

πŸ“š

SciIR-82k Dataset

>80,000 image–text pairs with Sci-RCoT reasoning annotations.

πŸ”¬

SciIR-Bench Benchmark

The first benchmark to score multidimensional correctness via a verifiable Atomic Checklist.

πŸš€

Qwen-Image-SciIR

Open-source baseline boosting SciIR-Bench from 35% β†’ 43%.

SciIR-82k

The Construction Pipeline

A multi-stage pipeline that reverse-engineers reasoning from published figures.

SciIR-82k construction pipeline
Figure 2. The SciIR-82k pipeline. Corpus construction (YOLO11, InternVL3.5) β†’ semiotic stratification β†’ reasoning-driven Sci-RCoT annotation (Qwen3).
STAGE 01

Corpus Construction

Decompose multi-panel figures into subfigures, standardize to 1024Γ—1024, and filter via VLM.

YOLO11InternVL3.5
STAGE 02

Semiotic Stratification

Score each sample's relevance to the three tracks and route to targeted annotation.

Qwen3-VL
STAGE 03

Reasoning-Driven Annotation

Reverse-engineer the Sci-RCoT from ground-truth images, then distill a concise prompt.

Qwen3-VLQwen3-Max

Dataset Statistics

Distribution of figures by discipline
(a) Distribution by discipline. SciIR-82k spans a broad range of scientific fields.
Term count distribution across tracks
(b) Term-count distribution across the three semiotic reasoning tracks.
πŸ€— Get SciIR-82k on Hugging Face
SciIR-Bench

A Fine-grained, Verifiable Benchmark

Measuring whether models faithfully instantiate structured scientific content.

A SciIR-Bench evaluation instance
Figure 3. An evaluation instance from SciIR-Bench. A prompt covering all four tracks guides various models to generate images. Each output is scrutinized by Gemini-3-Pro using a dimension-specific atomic checklist.
DESIGN 01

Four-fold Track Grouping

N=200 per group: one holistic group plus pairwise intersections of the three tracks.

DESIGN 02

IF vs. IR Stratification

Instruction Following (dense prompt) vs. Intrinsic Reasoning (abstract prompt).

DESIGN 03

Atomic Checklist

Term-driven extraction β†’ atomic questioning β†’ refereeing, with a strict veto on hallucinations.

Explore the SciIR-Bench Leaderboard
Our Model

Qwen-Image-SciIR

Decoupling scientific reasoning from visual synthesis via two LoRA modules.

Qwen-Image-SciIR architecture
Figure 4. Qwen-Image-SciIR architecture. A Reasoning Planner (LoRA-tuned Qwen2.5-7B-Instruct) infers a comprehensive Sci-RCoT from the input prompt, which is then consumed by the Visual Generator (LoRA-tuned Qwen-Image-2512) to synthesize the final image.
🧩

Reasoning Planner

Qwen2.5-7B-Instruct (LoRA r=64, Ξ±=16) infers the Sci-RCoT from the prompt.

🎨

Visual Generator

Qwen-Image-2512 (LoRA r=32) renders the image from the Sci-RCoT at 1024Γ—1024.

Experiments

Results on SciIR-Bench

Accuracy Score (%) for Intrinsic Reasoning (IR), Instruction Following (IF), and overall performance across four tracks.

Qwen-Image-2512
35%
β†’
Qwen-Image-SciIR
43%
Largest gains on Scientific Process (+16%) and Entity Structure (+9%).
SciIR-Bench evaluation results table
Table 1. Evaluation on SciIR-Bench. Accuracy Score (%) across four tracks β€” Scientific Law (SL), Entity Structure (ES), Scientific Process (SP), and Text β€” reported for IR, IF, and their average.
Qualitative comparison of generated results
Figure 5. Qualitative comparison. Qwen-Image-SciIR shifts from a generic artistic style toward precise scientific illustration, minimizing structural omissions, broken causal links, and domain-prior violations produced by the baseline.
Reference

Citation

If you find SciIR useful for your research, please consider citing our work.

BibTeX
@inproceedings{sciir2026,
  title     = {SciIR: A Large-scale Training Dataset and Benchmark
               for Scientific Image Reasoning Generation},
  author    = {Ma, Zhiyuan and Shi, Zhengfeng and An, Yuning and
               Li, Peize and Wei, Jiabao and Li, Ruijie and
               Xiao, Junhao and Li, Jianjun and Zhou, Bowen},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2026}
}