Logo GRAB

A Challenging GRaph Analysis Benchmark for Large Multimodal Models

1University of Cambridge, 2The University of Hong Kong
ICCV 2025
overall scores

Overall performance on GRAB. Our benchmark proves challenging for frontier LMMs.
The highest performing model, Claude 3.5 Sonnet 🥇, attains an accuracy of just 21.0%.

Overview

Large multimodal models (LMMs) have exhibited proficiences across many visual tasks. Although numerous benchmarks exist to evaluate model performance, they increasing have insufficient headroom and are unfit to evaluate the next generation of frontier LMMs.

To overcome this, we present GRAB, a challenging benchmark focused on the tasks human analysts might typically perform when interpreting figures. Such tasks include estimating the mean, intercepts or correlations of functions and data series and performing transforms.

We evaluate a suite of 20 LMMs on GRAB via exact matching, finding it to be a challenging benchmark, with the current best model scoring just 21.0%.

Focused on the recent development of reasoning models, we also introduce GRAB-Lite, a light-weight task-balanced 500-question subset of GRAB, and evaluate leading frontier LMMs on it.

Logo Leaderboards

Rank Model Properties Functions Series Transforms Real Overall
1 Claude 3.5 Sonnet 🥇 41.8 15.5 11.0 10.0 19.6 21.0
2 Gemini 1.5 Pro 🥈 34.2 11.4 13.3 6.5 20.3 18.8
3 Gemini 1.5 Flash 🥉 28.5 11.5 8.4 9.0 17.1 16.1
4 GPT-4o 24.7 10.8 9.2 3.5 17.3 14.9
5 Claude 3 Sonnet 15.3 8.6 4.5 4.8 12.4 10.3
6 Reka Flash 13.2 10.1 6.3 3.9 10.0 9.5
7 GPT-4 Turbo 18.5 8.5 4.9 3.5 7.5 9.2
8 Claude 3 Haiku 14.2 6.6 8.8 3.9 9.2 9.1
9 TransCore-M 7.9 9.2 7.6 3.9 8.2 7.9
10 Yi-VL-6b 5.6 8.6 7.1 4.2 9.7 7.7
11 LLaVA-1.5 13b 5.0 7.7 8.4 3.9 8.9 7.3
12 CogVLM-Chat 7.0 4.9 5.1 3.9 10.5 7.2
13 GPT-4o mini 15.8 6.8 5.7 2.9 4.0 7.1
14 LLaVA-1.5 7b 4.7 7.5 6.5 4.8 8.5 6.9
15 Yi-VL-34b 7.6 5.9 5.5 2.3 7.5 6.4
16 Qwen-VL-Chat 10.2 6.6 5.1 2.9 4.6 6.1
17 OmniLMM-3b 6.7 4.9 4.1 4.5 6.2 5.5
18 Reka Core 1.7 0.0 4.3 0.3 1.3 1.5
Gemini 1.0 Pro Vision 20.2 5.8 6.9 6.1
Reka Edge 11.8 8.7 11.6 1.9

🎉 To add your GRAB results, please contact this email.

Logo GRAB and GRAB-Lite Datasets

GRAB consists of 3284 questions centered around high-quality synthetic graphs. There are five key tasks in GRAB, which include questions covering 23 different graphs properties.

  • Properties focuses on the analysis of features of individual functions and series
  • Functions requires computing the mean of properties across multiple functions
  • Series requires computing the mean of properties across multiple series
  • Transforms involves determining the properties of a function after it has undergone a series of transforms
  • Real involves determining the properties of functions and series on real graphs drawn on whiteboards or paper or on synthetic graphs embedded in various computing environments or with random noise added

Additional Experimental Results

BibTeX

@article{roberts2024grab,
      title        = {GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models},
      author       = {Jonathan Roberts, Kai Han and Samuel Albanie},
      year         = {2024},
      journal      = {arXiv preprint arXiv:2408.11817}
    }