AI News #12

Image of

By Kevin Musgrave

February 27, 2024

Here’s what caught our eye the past two weeks.

New Models

Sora

That text-to-video model everyone has been talking about.
Technical report.

Stable Diffusion 3

The latest iteration of Stable Diffusion text-to-image models, ranging in size from 800M to 8B parameters.
The models use a diffusion transformer architecture, and will accept multimodal input.
Announcement post.

Mistral Large

Mistral’s new flagship model outperforms competing products (except GPT-4) on multiple benchmarks.
Has a 32k token context length, and strong multilingual capabilities.
Announcement post.

Nemotron-4 15B

A 15B parameter model by Nvidia, trained on 8 trillion tokens.
Trained on 384 DGX H100 nodes, where each node contains 8 H100 80 GB GPUs.
Paper.

Gemma

2B and 7B open source models by Google. Outperforms similarly sized models on 11 out of 18 tasks.
Trained on 2 trillion and 6 trillion tokens respectively. Has a 256k vocabulary size.
Announcement post

Gemini 1.5

Multimodal model by Google. The publicly available version has a 1 million token context length.
Has been criticized for the way it responds to certain requests.
Twitter summary.

Large World Model

Video and language model with a 1 million token context length. Uses an optimized version of RingAttention.
Project page.

SDXL-Lightning

State-of-the-art 1-step open-source diffusion model for text-to-image generation.
Paper.

Loraland

25 Mistral models finetuned on different tasks.
Project page.

PALO

Multimodal model covering 10 languages. Model and code to be released soon.
Paper.

MobileLLM

A 125M and 350M parameter LLM, with state-of-the-art performance compared to similarly sized models.
Paper.

New Datasets

A Touch, Vision, and Language Dataset for Multimodal Alignment

Multimodal dataset of 44k image-touch pairs.
Project page.

Aria Everyday Activities Dataset

Egocentric multimodal dataset. Data includes 3d point clouds, trajectories, and speech transcriptions
Project page.

Cosmopedia

Dataset of 30 million synthetic samples generated by a Mistral model.
Dataset on HuggingFace.

New Research

Universal Manipulation Interface

Training robots via hand-held manipulators operated by humans.
Project page.

LoRA+

Use different learning rates for LoRA’s $A$ and $B$ matrices, for better performance and faster convergence.
Paper and GitHub repo.

Genie

A model that can generate interactive visual environments (specifically platformer games), trained entirely on videos.
Paper.

Neural Network Diffusion

Diffusion models that generate the parameters of other neural networks.
Paper.

LongRoPE

Extends the context window of a pretrained 256k-context-length LLM, to 2 million.
Paper.

Chain-of-Thought Reasoning Without Prompting

Introduces “chain-of-thought decoding”, which means decoding the top-k paths, and selecting the final answer based on the most confident decoded path.
Paper

Repetition Improves Language Model Embeddings

Obtains higher quality embeddings by passing the input into the model twice and using the embedding from the 2nd occurrence of the input.
Paper

How to Train Data-Efficient LLMs

Ask an LLM to rate the quality of each sample in a dataset, and train on just the top-rated samples.
Paper

Stay up to date

Interested in future weekly updates? Stay up to date by joining our Slack Community!

Recent Posts

SEP 11, 2024

Finding the best LoRA parameters

READ MORE

AUG 12, 2024

Summer '24 Conference Recap

READ MORE

JUL 17, 2024

How does Video Generation work?

READ MORE