SciIR: A Large-scale Training Dataset and Benchmark for Scientific Image Reasoning Generation

**Figure 1. Overview of SciIR.** (a) Keyword word cloud and track distribution. (b) Example figures across domains. (c) SciIR-Bench results, comparing *Intrinsic Reasoning* vs. *Instruction Following*.

82K

Image–text pairs

800

Benchmark instances

Scientific aspects

35%→43%

SciIR-Bench gain

Abstract

What is SciIR?

Inspired by Peirce's Semiotic Triad, SciIR formalizes scientific reasoning into three dimensions — Entity Structure (Icon), Scientific Process (Index), and Scientific Law (Symbol) — backed by the SciIR-82k dataset (80k+ Sci-RCoT pairs) and the SciIR-Bench Atomic Checklist. Fine-tuning yields Qwen-Image-SciIR, lifting the score from 35% → 43%.

A Principled Taxonomy

Peirce's Semiotic Triad

Icon

Entity Structure

Geometric hierarchy and spatial alignment of scientific entities.

structure & topology

Index

Scientific Process

Causal and temporal chains — state transitions and workflows.

process & causality

Symbol

Scientific Law

Abstract laws — energy conservation, molecular valence.

law & principle

Contributions

Highlights

🧠
A Principled TaxonomyScientific correctness decomposed into Icon, Index, and Symbol dimensions.
📚
SciIR-82k Dataset>80,000 image–text pairs with Sci-RCoT reasoning annotations.
🔬
SciIR-Bench BenchmarkThe first benchmark to score multidimensional correctness via a verifiable Atomic Checklist.
🚀
Qwen-Image-SciIROpen-source baseline boosting SciIR-Bench from 35% → 43%.

SciIR-82k

The Construction Pipeline

A multi-stage pipeline that reverse-engineers reasoning from published figures.

SciIR-82k construction pipeline — **Figure 2. The SciIR-82k pipeline.** Corpus construction (YOLO11, InternVL3.5) → semiotic stratification → reasoning-driven Sci-RCoT annotation (Qwen3).

STAGE 01

Corpus Construction

Decompose multi-panel figures into subfigures, standardize to 1024×1024, and filter via VLM.

YOLO11InternVL3.5

STAGE 02

Semiotic Stratification

Score each sample's relevance to the three tracks and route to targeted annotation.

Qwen3-VL

STAGE 03

Reasoning-Driven Annotation

Reverse-engineer the Sci-RCoT from ground-truth images, then distill a concise prompt.

Qwen3-VLQwen3-Max

Dataset Statistics

Distribution of figures by discipline — **(a) Distribution by discipline.** SciIR-82k spans a broad range of scientific fields.

Term count distribution across tracks — **(b) Term-count distribution** across the three semiotic reasoning tracks.

🤗 Get SciIR-82k on Hugging Face

SciIR-Bench

A Fine-grained, Verifiable Benchmark

Measuring whether models faithfully instantiate structured scientific content.

A SciIR-Bench evaluation instance — **Figure 3. An evaluation instance from SciIR-Bench.** A prompt covering all four tracks guides various models to generate images. Each output is scrutinized by Gemini-3-Pro using a dimension-specific atomic checklist.

DESIGN 01

Four-fold Track Grouping

N=200 per group: one holistic group plus pairwise intersections of the three tracks.

DESIGN 02

IF vs. IR Stratification

Instruction Following (dense prompt) vs. Intrinsic Reasoning (abstract prompt).

DESIGN 03

Atomic Checklist

Term-driven extraction → atomic questioning → refereeing, with a strict veto on hallucinations.

Explore the SciIR-Bench Leaderboard

Our Model

Qwen-Image-SciIR

Decoupling scientific reasoning from visual synthesis via two LoRA modules.

🧩

Reasoning Planner

Qwen2.5-7B-Instruct (LoRA r=64, α=16) infers the Sci-RCoT from the prompt.

🎨

Visual Generator

Qwen-Image-2512 (LoRA r=32) renders the image from the Sci-RCoT at 1024×1024.

Experiments

Results on SciIR-Bench

Accuracy Score (%) for Intrinsic Reasoning (IR), Instruction Following (IF), and overall performance across four tracks.

Qwen-Image-2512

35%

→

Qwen-Image-SciIR

43%

Largest gains on Scientific Process (+16%) and Entity Structure (+9%).

SciIR-Bench evaluation results table — **Table 1. Evaluation on SciIR-Bench.** Accuracy Score (%) across four tracks — Scientific Law (SL), Entity Structure (ES), Scientific Process (SP), and Text — reported for IR, IF, and their average.

Qualitative comparison of generated results — **Figure 5. Qualitative comparison.** Qwen-Image-SciIR shifts from a generic artistic style toward precise scientific illustration, minimizing structural omissions, broken causal links, and domain-prior violations produced by the baseline.

Reference

Citation

If you find SciIR useful for your research, please consider citing our work.

BibTeX

@inproceedings{sciir2026,
  title     = {SciIR: A Large-scale Training Dataset and Benchmark
               for Scientific Image Reasoning Generation},
  author    = {Ma, Zhiyuan and Shi, Zhengfeng and An, Yuning and
               Li, Peize and Wei, Jiabao and Li, Ruijie and
               Xiao, Junhao and Li, Jianjun and Zhou, Bowen},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2026}
}