Bee Logo Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

Yi Zhang1,3, Bolin Ni3, Xin-Sheng Chen2, Heng-Rui Zhang2, Yongming Rao3, Houwen Peng3*, Qinglin Lu3, Han Hu3, Meng-Hao Guo2†, Shi-Min Hu2
1Beihang University, 2Tsinghua University, 3Tencent Hunyuan Team
*Project lead. †Corresponding author.

πŸ“„ Abstract

Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities.

Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M.

Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.

βš™οΈ The HoneyPipe Pipeline

To address the challenges of data noise and the reasoning gap in open-source datasets, we developed HoneyPipe, an automated and reproducible workflow built on our DataStudio framework. It systematically transforms a vast, raw data pool into a high-quality, dual-level Chain-of-Thought (CoT) dataset suitable for supervised fine-tuning (SFT).

HoneyPipe Pipeline Stages

Figure 1: The HoneyPipe data curation pipeline with five key stages: data aggregation, noise filtering, short CoT enrichment, long CoT enrichment, and fidelity verification.

The pipeline consists of several key stages:

  • Data Aggregation and Deduplication: We start by assembling ~24 million image-text pairs from diverse sources and perform rigorous deduplication to maximize data diversity and processing efficiency.
  • Noise and Irrelevance Filtering: This stage uses both rule-based and model-based operators to purge noisy data, removing samples with formatting issues, low-quality images, or image-instruction mismatches.
  • Short CoT Enrichment: For instructions requiring moderate reasoning, we use powerful MLLMs (Qwen2.5-VL-72B/32B) to generate explicit, step-by-step explanations, creating a corpus of ~12.2 million short CoT samples.
  • Long CoT Enrichment Loop: For the most complex instructions, we leverage top proprietary MLLMs to generate detailed, multi-step solutions, yielding a high-quality set of ~2.7 million long CoT pairs.
  • Fidelity Verification: Throughout the enrichment process, a verifier model (LLM-as-a-Judge) performs semantic comparisons to ensure the correctness and consistency of the generated CoT responses.

🍯 Honey-Data-15M

The primary output of our pipeline is Honey-Data-15M, a large-scale, multimodal SFT dataset with 15 million meticulously curated samples. It is designed to serve as a new cornerstone for the fully open MLLM community. A defining feature is its enrichment with dual-level CoT reasoningβ€”approximately 12.2 million short CoT samples and 2.7 million long CoT samplesβ€”which provides tailored reasoning depth across a wide spectrum of critical domains like "General" visual understanding and "STEM" for symbolic reasoning.

Honey-Data-15M Category Distribution

Figure 2: Category distribution of Honey-Data-15M dataset.

Honey-Data-15M Distribution Pie Chart

Figure 3: Data collection of Honey-Data-15M. A detailed breakdown of our dataset's composition across seven major categories. The number of samples (in thousands) is listed for each source. The * denotes that the data contains the long CoT response.

🐝 Overall results for Bee-8B

To validate our Honey-Data-15M, we trained Bee-8B, a new 8B parameter model, based on Qwen3-8B, on the full Honey-Data-15M dataset. Bee-8B establishes a new performance bar for fully open models, particularly in factual accuracy and complex reasoning, and proves highly competitive with recent semi-open models. These results confirm our core thesis: a focus on high-quality data curation is critical for creating models that can rival leading semi-open counterparts.

Task Benchmark LLaVA
OneVision-7B*
Molmo
-7B-D*
Qwen2.5
-VL-7B†
Keye-VL
-8B†
InternVL3.5
-8B†
Bee-8B
-SFT*
Bee-8B
-RL*
General
VQA
AI2D 81.4 81.0 84.3 86.7 84.0 83.8 85.3
BLINKval 48.2 49.7 56.4 52.0 59.5 52.5 55.0
CountBench β€” 84.8 74.1 78.0 β€” 90.5 93.0
HallusionBenchavg 31.6 46.4 52.9 67.0 54.5 59.8 58.2
MMBench-CNdev β€” β€” 81.3 92.0 β€” 81.2 84.2
MMBench-ENdev 80.8 β€” 82.1 91.5 β€” 83.0 85.5
MMMUval 48.8 45.3 58.6 71.4 73.4 66.8 66.1
MMMU-Prostandard 29.5 β€” 34.7 47.1 β€” 50.4 50.7
MMStar 61.7 56.1 63.9 75.5 69.3 69.0 71.4
MMT-Benchval 59.3 56.3 63.6 65.9 66.7 64.6 67.0
MMVet 57.5 41.5 67.1 79.0 83.1 83.3 83.9
MMVP β€” β€” 73.3 79.0 β€” 80.7 82.0
POPEavg 88.4 89.0 86.4 86.0 88.7 84.0 84.8
RealWorldQA 66.3 70.7 68.5 67.7 67.5 70.1 73.1
VisuLogic β€” β€” 20.0 25.6 β€” 24.4 26.5
VLMs are Blind 39.2 β€” 37.4 57.1 β€” 55.8 56.5
Table & Chart
& OCR
CharXivDQ β€” β€” 73.9 77.7 72.2 84.7 84.8
CharXivRQ β€” β€” 42.5 45.4 44.4 55.3 57.3
ChartQAtest 80.0 84.1 87.3 86.3 86.7 86.7 86.1
DocVQAval β€” β€” 95.5 88.5 β€” 87.2 87.0
InfoVQAval β€” β€” 81.4 67.4 β€” 72.3 72.9
OCRBench 62.2 65.6 86.4 85.1 84.0 83.1 82.5
SEED-Bench2-Plus 65.4 67.6 70.4 69.4 70.8 67.7 68.5
Math &
Reasoning
DynaMathworst 9.0 β€” 21.0 37.3 37.7 41.3 40.5
LogicVista 33.3 β€” 44.1 54.8 57.3 56.8 61.3
MathVersevision_only 26.2 4.2 25.1 59.8 61.5 61.9 67.0
MathVision 18.3 16.2 25.4 46.0 56.8 46.8 50.0
MathVistamini 63.2 51.6 68.2 80.7 78.4 78.6 81.4
WeMath 20.9 β€” 35.2 60.7 57.0 55.0 59.8

Table 1: Performance comparison of Bee-8B with other fully open (*) and semi-open (†) models across various benchmarks. The top and second-best scores for each benchmark are highlighted.

BibTeX

{@misc{zhang2025beehighqualitycorpusfullstack,
      title={Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs}, 
      author={Yi Zhang and Bolin Ni and Xin-Sheng Chen and Heng-Rui Zhang and Yongming Rao and Houwen Peng and Qinglin Lu and Han Hu and Meng-Hao Guo and Shi-Min Hu},
      year={2025},
      eprint={2510.13795},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.13795}, 
}}