"Lost in the middle" describes the issue where LVLMs lose critical information from the middle of the context as its length increases. Our findings indicate that this phenomenon is associated with the gradient and attention allocation strategies employed during model training.
Dynamic High Resolution (DHR) effectively enhances the comprehension abilities of LVLMs, especially for OCR-related tasks. However, the extended context tokens resulting from this technique can induce remote decay, resulting in suboptimal performance for long-context tasks.
Our findings reveal that the input image information for the two visual pathways and the number of training steps significantly influence their balance. Inappropriate visual allocation strategies or inadequate training steps can result in suboptimal model performance.
We propose CoMemo, an architecture that employs a dual-path visual mechanism. One path, the context path, integrates textual context for contextual reasoning, while the other, the memory path, maintains continuous focus on visual information. Additionally, we introduce RoPE-DHR, a compression-based positional encoding scheme that preserves the two-dimensional information of images. To balance the dual paths, we propose a three-stage training strategy.
To comprehensively evaluate CoMemo, we conducted experiments under the same model size (2B parameters) and training data size (InternVL2’s training data) settings. We selected benchmarks for the following tasks: Caption, Long-Context, Long Generation, Multi-Image, Math, General VQA, and OCR.
#