Fast Inference from Transformers via Speculative Decoding Transformer Models - Search Videos

How to Quadruple LLM Decoding Performance with Speculative Decoding (SpD) and Microscaling (MX) Formats on Qualcomm® Cloud AI 100

How to Quadruple LLM Decoding Performance with Speculative Dec…

Faster LLMs: Accelerate Inference with Speculative Decoding

Faster LLMs: Accelerate Inference with Speculative Decoding

As AI labs race to train and deploy new frontier models, existing models become more affordable with better tokenomics. ✨ "Everybody's trying to get to the next frontier. And every time they get to the next frontier, the last generation AI tokens, the cost starts to decline about a factor of 10x every year," said NVIDIA CEO Jensen Huang in a recent keynote. Model optimization techniques such as speculative decoding and multi-token prediction, combined with inference serving platforms like NVIDIA

As AI labs race to train and deploy new frontier models, existing mod…

12.4K views2 months ago

FacebookNVIDIA AI

AI Explained: Speculative decoding with vLLM

AI Explained: Speculative decoding with vLLM

1.1K views1 month ago

The vLLM Update That Triples AI Speed #AI

The vLLM Update That Triples AI Speed #AI

97 views1 month ago

YouTubeVijayakumar J

LLM Explained: How Transformers Predict Your Next Word

LLM Explained: How Transformers Predict Your Next Word

117 views1 month ago

YouTubeCode & Capital

IBM Granite 4.0 1B Speech: Compact Multilingual Speech AI Built for Edge Deployment

IBM Granite 4.0 1B Speech: Compact Multilingual Speech AI B…

128 views4 weeks ago

NVIDIA's VP of AI Explains Why They Give Away Their Best Model…

1.2K views3 weeks ago

YouTubeSuperintelligence

How Does AI Autocomplete Work? LSP, FIM & Model Inference Explai…

7 views1 month ago

YouTubeCodeSprint Lab

This AI Trick Gives You 3x Speed For FREE

YouTubeThe AI Century

Beyond Speculative Decoding: Jacobi Forcing in LLMs

224 views1 month ago

YouTubeTales Of Tensors

Speculation is all you need: Intro to Speculative Decoding for High Per…

16 views1 month ago

Speculative Decoding: 2-3x Faster LLMs for Free

1 views1 week ago

YouTubeThe AI Century

AI Frontiers: 101 ML Papers from Nov 21, 2025 - Efficiency, Safety …

15 views4 months ago

YouTubeAI Frontiers

How LLM Inference Actually Works

3 views1 week ago

YouTubeDeep Dive AI

EP5: Speculative Decoding with Nadav Timor

116 views7 months ago

YouTubeThe Information Bottleneck

AIはなぜ一瞬で答えられるの!?AIが速い4つの理由

1.5K views2 weeks ago

YouTubeパンダー_パンダー

Inference Optimization: Making AI Faster & Cheaper (Latency, Throu…

33 views3 weeks ago

This Repo Makes LLMs 24x Faster — And Most AI Companies Use It …

963 views3 weeks ago

YouTubeGithubTrends

Transformer models: Encoder-Decoders

105.6K viewsJun 14, 2021

YouTubeHugging Face

Speculative Decoding Explained

7.8K viewsDec 21, 2023

YouTubeTrelis Research

How chatgpt works

22.1K viewsFeb 9, 2023

YouTubeLucidate

Accelerating AI Model Performance (APAC)

336 views5 months ago

YouTubeMicrosoft Reactor

ChatGPT-5 Architecture Explained

17.2K views7 months ago

YouTubeResDevEng

Set Block Decoding: Faster LLM Inference

58 views7 months ago

YouTubeAI Research Roundup

Deep Dive: Optimizing LLM inference

47K viewsMar 11, 2024

YouTubeJulien Simon

Discrete Diffusion VLA: Faster Action Decoding

128 views7 months ago

YouTubeAI Research Roundup

LLM System Design Interview: How to Optimise Inference Latency

474 views4 months ago

YouTubePeetha Academy

MacBook Neo Local AI Test – LLM Benchmarks & MLX Performance!

17K views1 month ago

YouTubeBijan Bowen

The Engineering Behind Instant AI Responses

1.5K views3 months ago

See more videos