Here’s what caught our eye the past two weeks.
New Models
Sora
Stable Diffusion 3
- The latest iteration of Stable Diffusion text-to-image models, ranging in size from 800M to 8B parameters.
- The models use a diffusion transformer architecture, and will accept multimodal input.
- Announcement post.
Mistral Large
- Mistral’s new flagship model outperforms competing products (except GPT-4) on multiple benchmarks.
- Has a 32k token context length, and strong multilingual capabilities.
- Announcement post.
Nemotron-4 15B
- A 15B parameter model by Nvidia, trained on 8 trillion tokens.
- Trained on 384 DGX H100 nodes, where each node contains 8 H100 80 GB GPUs.
- Paper.
Gemma
- 2B and 7B open source models by Google. Outperforms similarly sized models on 11 out of 18 tasks.
- Trained on 2 trillion and 6 trillion tokens respectively. Has a 256k vocabulary size.
- Announcement post
Gemini 1.5
- Multimodal model by Google. The publicly available version has a 1 million token context length.
- Has been criticized for the way it responds to certain requests.
- Twitter summary.
Large World Model
- Video and language model with a 1 million token context length. Uses an optimized version of RingAttention.
- Project page.
SDXL-Lightning
- State-of-the-art 1-step open-source diffusion model for text-to-image generation.
- Paper.
Loraland
- 25 Mistral models finetuned on different tasks.
- Project page.
PALO
- Multimodal model covering 10 languages. Model and code to be released soon.
- Paper.
MobileLLM
- A 125M and 350M parameter LLM, with state-of-the-art performance compared to similarly sized models.
- Paper.
New Datasets
A Touch, Vision, and Language Dataset for Multimodal Alignment
Aria Everyday Activities Dataset
- Egocentric multimodal dataset. Data includes 3d point clouds, trajectories, and speech transcriptions
- Project page.
Cosmopedia
New Research
Universal Manipulation Interface
- Training robots via hand-held manipulators operated by humans.
- Project page.
LoRA+
- Use different learning rates for LoRA’s \(A\) and \(B\) matrices, for better performance and faster convergence.
- Paper and GitHub repo.
Genie
- A model that can generate interactive visual environments (specifically platformer games), trained entirely on videos.
- Paper.
Neural Network Diffusion
- Diffusion models that generate the parameters of other neural networks.
- Paper.
LongRoPE
- Extends the context window of a pretrained 256k-context-length LLM, to 2 million.
- Paper.
Chain-of-Thought Reasoning Without Prompting
- Introduces “chain-of-thought decoding”, which means decoding the top-k paths, and selecting the final answer based on the most confident decoded path.
- Paper
Repetition Improves Language Model Embeddings
- Obtains higher quality embeddings by passing the input into the model twice and using the embedding from the 2nd occurrence of the input.
- Paper
How to Train Data-Efficient LLMs
- Ask an LLM to rate the quality of each sample in a dataset, and train on just the top-rated samples.
- Paper
Stay up to date
Interested in future weekly updates? Stay up to date by joining our Slack Community!