Here’s what caught our eye last week.
Instruction Hierarchy
- This paper proposes giving different priority levels to instructions from different sources. For example, the system-level prompt has higher priority than the user prompt, which has higher priority than tool outputs. Training a model to follow this type of hierarchy results in better resilience to jailbreaking attacks.
- Paper
Multi-head MoE
- Splits token embeddings into sub-token embeddings, which get routed to different experts, and merged afterwards.
- Paper
Retrieval Heads
- This paper identifies attention heads that activate when retrieving data from long contexts. Deactivating these “retrieval heads” increases hallucinations much more than deactivating other heads. They observe this behavior across many LLMs.
- Paper
Better Synthetic Data
- Obtain better synthetic data by transforming existing datasets into a format tailored for a specific task.
- Paper
CatLIP
- Faster pre-training of CLIP-level vision models, by training just a vision model with classification loss, instead of a vision model + text model using contrastive loss. A classification loss is enabled by extracting nouns from each image caption and mapping them to WordNet synsets.
- Paper
SpaceByte
- Use specific transformer blocks only for specific input tokens (like spaces), to improve the speed of byte-level-token models.
- Paper
LayerSkip
- Faster LLMs with a combination of layer dropout and self-speculation decoding (early-exit inference + verification using all layers).
- Paper
OpenELM
- New fully open-source LLM (dataset, code, and weights publicly available) that outperforms other fully open-source LLMs.
- Paper
Phi-3
- New open-source model. Phi-3 mini (3.8 billion parameters) rivals Mixtral 8x7B
- Paper
Snowflake Arctic
Stay up to date
Interested in future weekly updates? Stay up to date by joining our Slack Community!