Baseline Models


To facilitate rapid onboarding, we recommend the following baselines.


Vision–Language Similarity Baseline

CLIP / SigLIP

Use cosine similarity between image and caption embeddings.

Purpose:

  • Detect global semantic mismatch
  • Provide fast initial benchmark

Vision-Only Baseline

DINOv2 Features

Compare foreground and background embeddings.

Purpose:

  • Measure visual inconsistency
  • Analyze scene composition

Multimodal Reasoning Baseline

Multimodal LLM (Qwen2-VL)

Prompt model with:

“Is this image–caption pair plausible?”

Purpose:

  • Test reasoning ability
  • Establish upper-bound reference

Manipulation-Aware Baseline

HAMMER

Evaluate local manipulation detection failure modes on FG–BG data.

Purpose:

  • Study out-of-distribution behavior

Reporting Guidelines

Baseline reports should include:

  • Hardware used
  • Inference time
  • Training data
  • Hyperparameters
  • Failure cases