
Tar is a unified multimodal large language model (LLM) developed by ByteDance, designed to seamlessly integrate visual understanding and generation within a shared discrete semantic framework. By employing the Text-Aligned Tokenizer (TA-Tok), Tar converts images into discrete tokens aligned with a large language model's vocabulary, enabling efficient cross-modal processing without the need for modality-specific adaptations. Key Features and Functionality: - Text-Aligned Tokenizer (TA-Tok): Transforms images into discrete tokens using a codebook derived from an LLM's vocabulary, facilitating a unified representation for both text and visual data. - Unified Multimodal Processing: Allows for cross-modal input and output through a shared interface, eliminating the necessity for separate designs for different data modalities. - Scale-Adaptive Encoding and Decoding: Balances computational efficiency with visual detail, ensuring high-quality visual outputs without excessive resource consumption. - Generative De-Tokenizer: Employs both autoregressive and diffusion-based models to decode visual tokens back into high-fidelity images. - Advanced Pre-Training Tasks: Enhances modality fusion, leading to improved performance in both visual understanding and generation tasks. Primary Value and User Solutions: Tar addresses the challenge of integrating visual and textual data by providing a unified framework that simplifies cross-modal tasks. This integration leads to faster convergence and greater training efficiency, benefiting applications that require seamless processing of both text and images. By eliminating the need for modality-specific designs, Tar streamlines development processes and enhances the performance of multimodal applications.