Logo GRAB

A Challenging GRaph Analysis Benchmark for Large Multimodal Models

1University of Cambridge, 2The University of Hong Kong
overall scores

Overall performance on GRAB. Our benchmark proves challenging for frontier LMMs.
The highest performing model, Claude 3.5 Sonnet 🥇, attains an accuracy of just 21.7%.

Overview

Large multimodal models (LMMs) have exhibited proficiences across many visual tasks. Although numerous benchmarks exist to evaluate model performance, they increasing have insufficient headroom and are unfit to evaluate the next generation of frontier LMMs.

To overcome this, we present GRAB, a challenging benchmark focused on the tasks human analysts might typically perform when interpreting figures. Such tasks include estimating the mean, intercepts or correlations of functions and data series and performing transforms.

We evaluate a suite of 20 LMMs on GRAB via exact matching, finding it to be a challenging benchmark, with the current best model scoring just 21.7%.

Logo GRAB Leaderboard

Accuracy scores on GRAB (2170 questions)

Rank Model Overall Properties Functions Series Transforms
1 Claude 3.5 Sonnet 🥇 21.7 41.8 15.5 11.0 10.0
2 Gemini 1.5 Pro 🥈 18.1 34.2 11.4 13.3 6.5
3 Gemini 1.5 Flash 🥉 15.6 28.5 11.5 8.4 9.0
4 GPT-4o 13.6 24.7 10.8 9.2 3.5
5 Gemini 1.0 Pro Vision 10.5 20.2 5.8 6.9 6.1
6 GPT-4 Turbo 10.0 18.5 8.5 4.9 3.5
7 Reka Edge 9.4 11.8 8.7 11.6 1.9
8 Reka Flash 9.3 13.2 10.1 6.3 3.9
9 Claude 3 Sonnet 9.2 15.3 8.6 4.5 4.8
10 Claude 3 Haiku 9.0 14.2 6.6 8.8 3.9
11 GPT-4o mini 8.7 15.8 6.8 5.7 2.9
12 TransCore-M 7.6 7.9 9.2 7.6 3.9
13 Qwen-VL-Chat 6.8 10.2 6.6 5.1 2.9
14 Yi-VL-6b 6.7 5.6 8.6 7.1 4.2
15 LLaVA-1.5 13b 6.5 5.0 7.7 8.4 3.9
16 LLaVA-1.5 7b 6.0 4.7 7.5 6.5 4.8
17 Yi-VL-Chat 34b 5.8 7.6 5.9 5.5 2.3
18 CogVLM-Chat 5.4 7.0 4.9 5.1 3.9
19 OmniLMM-3b 5.2 6.7 4.9 4.1 4.5
20 Reka Core 1.5 1.7 0.0 4.3 0.3

🎉 To add your results to the leaderboard, please contact this email.

Logo GRAB Dataset

GRAB consists of 2170 questions centered around high-quality synthetic graphs. There are four key tasks in GRAB, which include questions covering 23 different graphs properties.

  • Properties focuses on the analysis of features of individual functions and series
  • Functions requires computing the mean of properties across multiple functions
  • Series requires computing the mean of properties across multiple series
  • Transforms involves determining the properties of a function after it has undergone a series of transforms

Experimental Results

BibTeX

@article{roberts2024grab,
      title        = {GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models},
      author       = {Jonathan Roberts, Kai Han and Samuel Albanie},
      year         = {2024},
      journal      = {arXiv preprint arXiv:2408.11817}
    }