A sketch is a distilled form of visual abstraction that conveys core concepts through simplified yet purposeful strokes while omitting extraneous detail. Despite its expressive power, quantifying the efficiency of semantic abstraction in sketches remains challenging. Existing evaluation methods that rely on reference images, low-level visual features, or recognition accuracy do not capture abstraction, the defining property of sketches.
To address these limitations, we introduce SEA (Sketch Evaluation metric for Abstraction efficiency), a reference-free metric that assesses how economically a sketch represents class-defining visual elements while preserving semantic recognizability. These elements are derived per class from commonsense knowledge about features typically depicted in sketches. SEA leverages a visual question answering model to determine the presence of each element and returns a quantitative score that reflects semantic retention under visual economy.
To support this metric, we present CommonSketch, the first semantically annotated sketch dataset, comprising 23,100 human-drawn sketches across 300 classes, each paired with a caption and element-level annotations. Experiments show that SEA aligns closely with human judgments and reliably discriminates levels of abstraction efficiency, while CommonSketch serves as a benchmark providing systematic evaluation of element-level sketch understanding across various vision-language models.
CommonSketch is a semantically annotated sketch dataset for element-aware evaluation of sketch abstraction. It contains 23,100 single-object, instance-level human-drawn sketches across 300 object classes and 14 semantic categories. For each class, CommonSketch defines a set of externally visible, drawable commonsense elements, and each sketch is paired with a natural-language caption and binary element-presence annotations.
The dataset was collected under a standardized tablet-based drawing protocol and validated through caption-based label verification. Candidate commonsense elements were extracted with GPT-4o and refined by human annotators to retain only visually observable components. These annotations enable VQA-style evaluation of whether vision-language models recognize the class-defining elements that make sketches identifiable.
CommonSketch supports SEA by providing the commonsense visual elements needed for element-aware abstraction evaluation. These annotations make it possible to assess not only whether a sketch is recognizable, but also which class-defining elements are preserved or omitted.
SEA is a reference-free metric for evaluating how efficiently a sketch conveys its semantic identity. Rather than comparing a sketch to a reference image or relying only on recognition accuracy, SEA measures whether the sketch remains recognizable while using a compact set of class-defining visual elements.
Given a sketch and its target class, SEA combines three signals: the class recognizability P from a zero-shot classifier, the class-specific commonsense element set E extracted by an LLM, and the number of visually grounded elements V detected by a VLM-based visual question answering module. The normalized visual ratio v = V / E represents how much semantic visual information is expressed in the sketch.
SEA computes a latent efficiency signal by balancing a reward and a penalty:
The reward increases when a sketch is highly recognizable while using fewer visual elements. The penalty increases when the sketch is either unrecognizable or overly detailed. As a result, SEA assigns high scores to sketches that preserve semantic recognizability with minimal yet sufficient visual detail.
A sketch with low class recognizability receives a low SEA score even if it is simple.
A sketch can be recognizable but still penalized when it expresses too many visual elements.
SEA rewards sketches that remain recognizable while preserving only a compact set of informative class-defining elements.
SEA distinguishes abstraction failure, incomplete abstraction, and abstraction-efficient sketching. A sketch with low recognizability receives a low score regardless of simplicity. A highly detailed sketch can also be penalized if it expresses more visual elements than necessary. The highest scores are assigned to sketches that remain recognizable while preserving only the most informative class-defining elements.
This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2026-RS-2020-II201789), and the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2026-RS-2023-00254592) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation).
This study involved human participants. All procedures were approved by the Institutional Review Board (IRB) of Dongguk University (Approval No. DUIRB-2025-05-08) and were conducted in accordance with institutional ethical guidelines.
@article{park2026sea,
title={SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering},
author={Park, Jiho and Choi, Sieun and Seo, Jaeyoon and Sohn, Minho and Kim, Yeana and Kim, Jihie},
journal={arXiv preprint arXiv:2603.28363},
year={2026}
}