CoMemo: LVLMs Need
Image Context with Image Memory

1Shanghai Artificial Intelligence Laboratory, 2Tsinghua University, 3The Chinese University of Hong Kong
ICML 2025
teaser teaser

Left: Evaluation results of three architectures with same training data and model size (2B).
Right: Comparison three types of architectures for LVLMs. Method (a) use image encoder to align visual features with the LLM's continuous token representation space. Method (b) employs mixin layer with cross-attention to update LLM's hidden states based on visual features. And Method (c) contrust a dual-path structure to enable the model to focus more on visual content during generation.

Large Vision Language Models have achieved significant breakthroughs, with the dominant approach being the alignment of visual representations to the text representation space of Large Language Models for generation using the LLM decoder. However, certain design choices inherited from LMs are suboptimal for multimodal long-context and long-generation tasks. To address this, we introduce CoMemo, a novel architecture featuring a dual-path visual attention mechanism and RoPE-DHR, which retains two dimensional image information in positional encoding and mitigates the issue of remote decay in positional encoding.

Design Thinking for CoMemo

1. Why LVLMs Tend to “lose in the middle”?

"Lost in the middle" describes the issue where LVLMs lose critical information from the middle of the context as its length increases. Our findings indicate that this phenomenon is associated with the gradient and attention allocation strategies employed during model training.

teaser teaser

Left: Heatmap of results for the NIAH evaluation on MileBench benchmark. The depth percentage indicates the position of the target information (needle) relative to the entire sequence.
Right: Average gradients and attention weights assigned to tokens at corresponding positions. We computed the average over 1,000 samples.

2. Remote Decay in LVLMs with DHR

Dynamic High Resolution (DHR) effectively enhances the comprehension abilities of LVLMs, especially for OCR-related tasks. However, the extended context tokens resulting from this technique can induce remote decay, resulting in suboptimal performance for long-context tasks.

teaser

Remote decay estimation for InternVL2-2B. The relative distance refers to the difference between absolute position IDs. In RoPE, the position ID of each input token increments by 1 with the input sequence.

3. The Balance Between tow Pathways

Our findings reveal that the input image information for the two visual pathways and the number of training steps significantly influence their balance. Inappropriate visual allocation strategies or inadequate training steps can result in suboptimal model performance.

teaser

Balancing experiments. “1k”, “2k” and “4k” means pre- train steps. All scores are evaluated after fine-tuning the pretrained checkpoint corresponding to the x-axis.

Method

We propose CoMemo, an architecture that employs a dual-path visual mechanism. One path, the context path, integrates textual context for contextual reasoning, while the other, the memory path, maintains continuous focus on visual information. Additionally, we introduce RoPE-DHR, a compression-based positional encoding scheme that preserves the two-dimensional information of images. To balance the dual paths, we propose a three-stage training strategy.

teaser teaser

Left: The computation process of Rope-DHR. The colors are assigned based on a mapping of position IDs in RoPE.
Right: Framework of CoMemo. Both paths share the same encoder and projector

Main Results

To comprehensively evaluate CoMemo, we conducted experiments under the same model size (2B parameters) and training data size (InternVL2’s training data) settings. We selected benchmarks for the following tasks: Caption, Long-Context, Long Generation, Multi-Image, Math, General VQA, and OCR.


Table1: The results on Generation and Math benchmarks. The highest scores are highlighted in bold.

teaser

Table2: The results on Multi-image and Long-context benchmarks. The highest scores are highlighted in bold. 1 LVLM-X’s single image token compression reduces average context length by 50% (e.g., 32k→16k).

teaser

Table3: The results on General VQA and OCR-related benchmarks. The highest scores are highlighted in bold.

teaser

BibTeX

#