CoMemo: LVLMs Need Image Context with Image Memory

Design Thinking for CoMemo

1. Why LVLMs Tend to “lose in the middle”?

"Lost in the middle" describes the issue where LVLMs lose critical information from the middle of the context as its length increases. Our findings indicate that this phenomenon is associated with the gradient and attention allocation strategies employed during model training.

Left: Heatmap of results for the NIAH evaluation on MileBench benchmark. The depth percentage indicates the position of the target information (needle) relative to the entire sequence.
Right: Average gradients and attention weights assigned to tokens at corresponding positions. We computed the average over 1,000 samples.

2. Remote Decay in LVLMs with DHR

Dynamic High Resolution (DHR) effectively enhances the comprehension abilities of LVLMs, especially for OCR-related tasks. However, the extended context tokens resulting from this technique can induce remote decay, resulting in suboptimal performance for long-context tasks.

Remote decay estimation for InternVL2-2B. The relative distance refers to the difference between absolute position IDs. In RoPE, the position ID of each input token increments by 1 with the input sequence.

3. The Balance Between tow Pathways

Our findings reveal that the input image information for the two visual pathways and the number of training steps significantly influence their balance. Inappropriate visual allocation strategies or inadequate training steps can result in suboptimal model performance.

Balancing experiments. “1k”, “2k” and “4k” means pre- train steps. All scores are evaluated after fine-tuning the pretrained checkpoint corresponding to the x-axis.

Method

We propose CoMemo, an architecture that employs a dual-path visual mechanism. One path, the context path, integrates textual context for contextual reasoning, while the other, the memory path, maintains continuous focus on visual information. Additionally, we introduce RoPE-DHR, a compression-based positional encoding scheme that preserves the two-dimensional information of images. To balance the dual paths, we propose a three-stage training strategy.

Left: The computation process of Rope-DHR. The colors are assigned based on a mapping of position IDs in RoPE.
Right: Framework of CoMemo. Both paths share the same encoder and projector

Main Results

To comprehensively evaluate CoMemo, we conducted experiments under the same model size (2B parameters) and training data size (InternVL2’s training data) settings. We selected benchmarks for the following tasks: Caption, Long-Context, Long Generation, Multi-Image, Math, General VQA, and OCR.

Table1: The results on Generation and Math benchmarks. The highest scores are highlighted in bold.

Table2: The results on Multi-image and Long-context benchmarks. The highest scores are highlighted in bold. ¹ LVLM-X’s single image token compression reduces average context length by 50% (e.g., 32k→16k).

Table3: The results on General VQA and OCR-related benchmarks. The highest scores are highlighted in bold.

BibTeX

@article{liu2025comemo,
          title={CoMemo: LVLMs Need Image Context with Image Memory},
          author={Liu, Shi and Su, Weijie and Zhu, Xizhou and Wang, Wenhai and Dai, Jifeng},
          journal={arXiv preprint arXiv:2506.06279},
          year={2025}
        }