SEA: Evaluating Sketch Abstraction Efficiency via Element-level Common-sense Visual Question Answering

Dongguk University, South Korea
* Corresponding author
CVPR 2026
Banner Image

SEA is designed to quantify how efficiently a sketch abstracts visual concepts while preserving recognizability. As shown on the left, high SEA scores are assigned to sketches that are simple yet still easy to identify, whereas low SEA scores correspond to sketches that are either ambiguous or overly detailed. To support this evaluation, we introduce CommonSketch, a multimodal dataset of hand-drawn sketches paired with element-level commonsense annotations and fine-grained captions. Together, SEA and CommonSketch provide a foundation for element-aware analysis and evaluation of abstraction efficiency in sketch understanding and generation.

Abstract

A sketch is a distilled form of visual abstraction that conveys core concepts through simplified yet purposeful strokes while omitting extraneous detail. Despite its expressive power, quantifying the efficiency of semantic abstraction in sketches remains challenging. Existing evaluation methods that rely on reference images, low-level visual features, or recognition accuracy do not capture abstraction, the defining property of sketches.

To address these limitations, we introduce SEA (Sketch Evaluation metric for Abstraction efficiency), a reference-free metric that assesses how economically a sketch represents class-defining visual elements while preserving semantic recognizability. These elements are derived per class from commonsense knowledge about features typically depicted in sketches. SEA leverages a visual question answering model to determine the presence of each element and returns a quantitative score that reflects semantic retention under visual economy.

To support this metric, we present CommonSketch, the first semantically annotated sketch dataset, comprising 23,100 human-drawn sketches across 300 classes, each paired with a caption and element-level annotations. Experiments show that SEA aligns closely with human judgments and reliably discriminates levels of abstraction efficiency, while CommonSketch serves as a benchmark providing systematic evaluation of element-level sketch understanding across various vision-language models.

CommonSketch

CommonSketch is a semantically annotated sketch dataset for element-aware evaluation of sketch abstraction. It contains 23,100 single-object, instance-level human-drawn sketches across 300 object classes and 14 semantic categories. For each class, CommonSketch defines a set of externally visible, drawable commonsense elements, and each sketch is paired with a natural-language caption and binary element-presence annotations.

The dataset was collected under a standardized tablet-based drawing protocol and validated through caption-based label verification. Candidate commonsense elements were extracted with GPT-4o and refined by human annotators to retain only visually observable components. These annotations enable VQA-style evaluation of whether vision-language models recognize the class-defining elements that make sketches identifiable.

CommonSketch supports SEA by providing the commonsense visual elements needed for element-aware abstraction evaluation. These annotations make it possible to assess not only whether a sketch is recognizable, but also which class-defining elements are preserved or omitted.

  • 23,100 human-drawn, single-object sketches
  • 300 object classes across 14 semantic categories
  • Natural-language captions for each sketch
  • Class-wise commonsense element inventories
  • Per-sketch binary element-presence annotations
  • Benchmark support for element-level sketch VQA and abstraction evaluation
CommonSketch data construction pipeline
Figure 2a. CommonSketch data construction pipeline.
Average commonsense elements by category
Figure 2b. Average commonsense elements by category.
CommonSketch class distribution and zebra element-level commonsense example
Figure 2c. Class distribution across 14 categories and an example element set.

SEA: Sketch Evaluation Metric for Abstraction Efficiency

SEA is a reference-free metric for evaluating how efficiently a sketch conveys its semantic identity. Rather than comparing a sketch to a reference image or relying only on recognition accuracy, SEA measures whether the sketch remains recognizable while using a compact set of class-defining visual elements.

Given a sketch and its target class, SEA combines three signals: the class recognizability P from a zero-shot classifier, the class-specific commonsense element set E extracted by an LLM, and the number of visually grounded elements V detected by a VLM-based visual question answering module. The normalized visual ratio v = V / E represents how much semantic visual information is expressed in the sketch.

SEA computes a latent efficiency signal by balancing a reward and a penalty:

Z = reward(P, v) − penalty(P, v)
SEA = tanh(αZ) ∈ (−1, 1)

The reward increases when a sketch is highly recognizable while using fewer visual elements. The penalty increases when the sketch is either unrecognizable or overly detailed. As a result, SEA assigns high scores to sketches that preserve semantic recognizability with minimal yet sufficient visual detail.

SEA computation pipeline
SEA computation pipeline. SEA combines class recognizability, commonsense visual elements, and VLM-based element detection to compute abstraction efficiency.
SEA example of an unrecognizable sketch

Case 1: Unrecognizable Sketch

A sketch with low class recognizability receives a low SEA score even if it is simple.

SEA example of an over-detailed sketch

Case 2: Over-detailed Sketch

A sketch can be recognizable but still penalized when it expresses too many visual elements.

SEA example of an abstraction-efficient sketch

Case 3: Abstraction-efficient Sketch

SEA rewards sketches that remain recognizable while preserving only a compact set of informative class-defining elements.

SEA distinguishes abstraction failure, incomplete abstraction, and abstraction-efficient sketching. A sketch with low recognizability receives a low score regardless of simplicity. A highly detailed sketch can also be penalized if it expresses more visual elements than necessary. The highest scores are assigned to sketches that remain recognizable while preserving only the most informative class-defining elements.

ACKNOWLEDGEMENTS

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2026-RS-2020-II201789), and the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2026-RS-2023-00254592) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation).

This study involved human participants. All procedures were approved by the Institutional Review Board (IRB) of Dongguk University (Approval No. DUIRB-2025-05-08) and were conducted in accordance with institutional ethical guidelines.

BibTeX

@article{park2026sea,
	        title={SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering},
	        author={Park, Jiho and Choi, Sieun and Seo, Jaeyoon and Sohn, Minho and Kim, Yeana and Kim, Jihie},
	        journal={arXiv preprint arXiv:2603.28363},
	        year={2026}
	        }