<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2025-10-05T08:56:46+08:00</updated><id>/feed.xml</id><title type="html">Kelvin’s blog</title><subtitle>~~~~~</subtitle><author><name>Kelvin Han</name></author><entry><title type="html">Diffusion Language Models – Part Four (Post-training with Reinforcement Learning)</title><link href="/articles/25/Diffusion_LM_P4" rel="alternate" type="text/html" title="Diffusion Language Models – Part Four (Post-training with Reinforcement Learning)" /><published>2025-10-04T09:00:00+08:00</published><updated>2025-10-04T09:00:00+08:00</updated><id>/articles/25/Diffusion_LM_P4</id><content type="html" xml:base="/articles/25/Diffusion_LM_P4"><![CDATA[<p><label for="marginfigure-rlhf-good" class="margin-toggle">⊕</label><input type="checkbox" id="marginfigure-rlhf-good" class="margin-toggle" checked="" /><span class="marginnote"><img class="fullwidth" src="/assets/img/rlhf.gif" /><br />Source: PNGs in GIF generated with ChatGPT.</span></p>

<p><span class="newthought">Recently, the post-training of large language models (LLMs) with reinforcement learning</span> (RL) has been an important source for the significant progress we are seeing in LLM capabilities (for reasoning, agents/tool-use, planning etc).</p>

<p><label for="marginnote-posttrain" class="margin-toggle"> ⊕</label><input type="checkbox" id="marginnote-posttrain" class="margin-toggle" checked="" /><span class="marginnote">The <em>post-training</em> of an LLM comes after <em>pretraining</em> (which is when LLMs are trained on next-token prediction over web-scale text). It “polishes” the LLM into the useful models we are used to interacting with. This <em><a href="https://tokens-for-thoughts.notion.site/post-training-101" title="Post-training 101: A hitchhikers guide into LLM post-training">blog post</a></em> by two Meta Super Intelligence (MSL) researchers gives a good overview of the post-training phase. Much of what is in their blog post would also apply to post-training for diffusion language models (DLMs).</span></p>

<p>This became especially apparent earlier this year when DeepSeek surprised (and <a href="https://www.cnbc.com/2025/01/27/nvidia-falls-10percent-in-premarket-trading-as-chinas-deepseek-triggers-global-tech-sell-off.html" title="Nvidia drops nearly 17% as China’s cheaper AI model DeepSeek sparks global tech sell-off">moved markets</a>) with the release of their R1 model <a href="https://www.nature.com/articles/s41586-025-09422-z" title="DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning">(Guo et al, 2025)</a>, an auto-regressive LLM (AR-LLM). Their model was post-trained with an efficient RL algorithm (GRPO, see below) in a way that unlocked “thinking” for improved performance on reasoning tasks.<label for="sidenote-reasoning" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-reasoning" class="margin-toggle" checked="" /><span class="sidenote">The term ‘reasoning’ with respect to LLMs, also referred to as Large Reasoning Models or LRMs, is still being settled upon. However, there are notable differences between the “thinking” traces produced by LRMs and what we might generally accept as reasoning by humans. A good overview of this can be found in <a href="https://arxiv.org/pdf/2504.09762v1" title="(How) Do reasoning models reason?">(Kambhampati et al, 2025)</a> and this <em><a href="https://x.com/rao2z/status/1966969679739768982" title="">list</a></em>.</span> Prior to this, however, RL post-training was already crucial for aligning AR-LLM generations towards users’ preferred forms/styles of text and conversation, as well as for meeting safety and security requirements.</p>

<p>In this post, I examine similar RL methods for diffusion language models (DLMs); which will be key for pushing DLMs to parity (or more) with existing AR-LLMs in terms of capabilities. In line with the previous posts of this series, I will focus on Masked-DLMs (which are the keenest focus of current research); on the RL side, my focus will be on online policy-gradient algorithms,<label for="sidenote-policy" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-policy" class="margin-toggle" checked="" /><span class="sidenote">Policy-gradient algorithms are named for how: (i) the <span style="color: red">policy</span> (the mechanism for generating <span style="color: red">trajectories</span> i.e. sequences of tokens when used in the context of LLM RL post-training) is in the form of a parameterized model, and (ii) the policy’s parameters are learned by following the gradient of some function with respect to those parameters. (<em>The definitions of the terms in <span style="color: red">red</span> can be found <a href="#terminology">below</a>.</em>) They can be categorised as being <em>online</em> or <em>offline</em>, where the “on”/”off” relates to whether the policy is learning from trajectories coming from itself (“on”) or not (“off”; e.g. from another model’s distribution). <br /><br /><ins><em>Sidenote:</em></ins> offline methods, such as DPO <a href="https://arxiv.org/abs/2305.18290" title="Direct Preference Optimization: Your Language Model is Secretly a Reward Model">(Rafailov et al, 2023)</a>, were instrumental for aligning LLM to human preferences (e.g. used in the training for Llama 3 <a href="https://arxiv.org/pdf/2407.21783" title="The Llama 3 Herd of Models">(Llama Team, AI @ Meta, 2024)</a> models). For the interested, similar methods have been proposed for diffusion models: e.g. VRPO used to post-train the original LLaDA Masked-DLM <a href="https://arxiv.org/abs/2502.09992" title="Large Language Diffusion Models">(Nie et al, 2025)</a> to give LLaDA 1.5 <a href="https://arxiv.org/abs/2505.19223" title="LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models">(Zhu et al)</a>, as well as ones for continuous diffusion such as Diffusion-DPO <a href="https://arxiv.org/abs/2311.12908" title="Diffusion model alignment using direct preference optimization">(Wallace et al, 2023)</a> and DSPO <a href="https://openreview.net/forum?id=xyfb9HHvMe" title="Direct Score Preference Optimization for Diffusion Model Alignment">(Zhu et al, 2025)</a>.</span> namely: Group Relative Policy Optimization (<strong>GRPO</strong>) <a href="https://arxiv.org/abs/2402.03300" title="DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models">(Shao et al, 2024)</a>, <a href="https://arxiv.org/abs/2501.12948" title="DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning">(DeepSeek AI, 2025)</a> and Proximal Policy Optimization (<strong>PPO</strong>) <a href="https://arxiv.org/abs/1707.06347" title="Proximal Policy Optimization Algorithms">(Schulman et al, 2017)</a>, which (i) are being used in post-training AR-LLM today; and (ii) have been found to reach better performance compared to offline algorithms <a href="" title="Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback">(Ivison et al, 2024)</a>.</p>

<p>The outline of this post is as follows: I will start by setting the scene with <em><a href="#-1-paint-a-picture-of-policy-gradient-rl-in-2-minutes">1. an accessible introduction to RL post-training using policy-gradient algorithms</a></em>, followed by outlining <em><a href="#️-2-can-we-just-reuse-the-llm-rl-post-training-methods-that-worked">2. the main challenge for DLM post-training with such methods</a></em>. I will then highlight <em><a href="#-3-what-policy-gradient-rl-methods-have-been-proposed-for-dlms">3. some proposed approaches for RL post-training of DLMs</a>. If you are familiar with RL post-training for LLMs, you could just skip directly to <a href="#️-2-can-we-reuse-the-rl-post-training-methods-that-worked-for-ar-llms">section 2</a>. Otherwise, to fully benefit from this post, going through my earlier posts <a href="https://hankelvin.github.io/articles/25/Diffusion_LM_P1">Part 1</a>, <a href="https://hankelvin.github.io/articles/25/Diffusion_LM_P2">Part 2</a> and <a href="https://hankelvin.github.io/articles/25/Diffusion_LM_P3">Part 3</a> for some background to DLMs first would probably be useful.</em></p>

<h3 id="️-1-paint-a-picture-of-rl-post-training-of-llms-in-5-minutes">🖼️ 1. Paint a picture of RL post-training of LLMs in 5 minutes?</h3>
<p>Before discussing policy-gradient RL for DLMs, let’s get to some common ground with an introduction to such methods, as well as a sense of how we are using them with AR-LLMs currently. I always find analogies help us to better grasp complex topics, so let’s start with one:</p>

<p><span style="color: #000080; font-family: Candara">Imagine you are a parent of a child Jesse, and you want them to learn to give the right answer (let’s refer to this as <em><strong>o</strong></em>, for output) to this question: <em>“Levy has two apples in his pocket, Alex has two apples in her bag. They have a picnic and eat one of the apples. How many apples do they have left?”</em> (essentially: <em>“What is 2+2-1 equals to?”</em>); let’s refer to this question as <strong>\(q_{k}\)</strong>. The idea is that it is best to have Jesse give <em>“3”</em> (or similar) as the final answer whenever they encounter <strong>\(q_{k}\)</strong> (or a similar problem). One way to help Jesse learn this could be to (i) pose <strong>\(q_{k}\)</strong> to Jesse multiple times, then (ii) have Jesse give an answer each time (let’s call each of this <strong>\(o^{k}_{i}\)</strong>), and then (iii) tell Jesse for each <strong>\(o^{k}_{i}\)</strong> whether it is a good answer.</span></p>
<div style="background-color: #249ae9ff; max-width: 50%; color: white; padding: 20px; border-radius: 8px; margin: 10px;">
  <h3 style="margin: 0 0 15px 0; width: 100%;font-family: Candara">Some possible answers of Jesse's</h3>   
  <p style="margin: 0; width: 100%; text-align: left; font-size: 16px; font-family: Candara">   
  💬 <em>o</em><sub>1</sub>: I know this, the answer is 3! 
  <br />
  💬 <em>o</em><sub>2</sub>: I don't know, the answer is 3? 
  <br />
  💬 <em>o</em><sub>3</sub>: I love chicken nuggets! I will never eat apples!
  <br />
  💬 <em>o</em><sub>4</sub>: Levy has 2 apples and they eat 1 so there has to be 3 apples left. 
  <br />  
  💬 <em>o</em><sub>5</sub>: Three!
  <br />  
  💬 <em>o</em><sub>6</sub>: They have 2 plus 2 apples so that is 4 apples. They eat one, so 4 minus 1, that means they have 3 apples left. Duh!
  </p>
</div>

<p>We can also see from above that some of the answers that Jesse might come up with could be better than others; in terms of correctness (in the final answer, and in the reasoning) as well as for style.<label for="sidenote-Jesse" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-Jesse" class="margin-toggle" checked="" /><span class="sidenote">Some answers are clearly off (i.e. <em>o</em><sub>3</sub>). Some (i.e. <em>o</em><sub>4</sub>) give the correct final answer, but have a wrong reasoning process for getting to it. Some might be nearly identical, but one amonsgt them is preferred over another, e.g. <em>o</em><sub>1</sub> versus <em>o</em><sub>2</sub> (for a more confident Jesse). Others might be quite different yet slightly preferred over another, e.g. <em>o</em><sub>5</sub> versus <em>o</em><sub>6</sub> (depending on whether we prefer to have a Jesse that will give a reasoning along with their answer, but in a sassy way).</span> Therefore, we can expect to have some preferences between each <strong>\(o^{k}_{i}\)</strong>, and hence we might want to steer Jesse’s mind such that whenever Jesse encounters <strong>\(q_{k}\)</strong>, ideally Jesse gives <strong>\(o^{k}_{i}\)</strong> that is most preferable.</p>

<p><ins>In essence, what we want to do for Jesse is similar to what we want to do with LLMs using policy-gradient RL post-training!</ins> i.e. we want an LLM to learn, by updating its parameters, that when presented with a certain prompt <strong>\(q_{k}\)</strong> (or similar) it should generate responses that are preferred (achieve highest reward). This is done via getting the LLM to give higher likelihoods for the sequence of tokens in the higher-scoring <strong>\(o^{k}_{i}\)</strong>.</p>

<div style="background-color: #AFEEEE; max-width: 50%; color: black; padding: 20px; border-radius: 8px; margin: 10px;" id="terminology">
  <h3 style="margin: 0 0 15px 0; width: 100%;">Some terminology</h3>   
  <p style="margin: 0; width: 100%; text-align: left; font-size: 16px;">
  Before proceeding, let's set the definitions of some key RL terms first. Each of these terms are also associated with concepts (in brackets and <span style="color: #000080">blue</span> below) from the Jesse example, so as to connect them with RL on AR-LLMs.</p>
  <br />
  <p style="margin: 0; width: 100%; text-align: left; font-size: 16px;">   
  ▪️ <span style="color: red">"state"</span>: information about the current situation at a given moment in time; 
  <br />
  ▪️ <span style="color: red">"action"</span>: a decision/choice that can be taken at the point of a certain state; 
  <br />
  ▪️ <span style="color: red">"trajectory"</span>: a sequence of states and actions that can be taken (<span style="color: #000080"><em>o<sup>k</sup><sub>i</sub></em></span>); 
  <br />
  ▪️ <span style="color: red">"policy"</span>: some model that can give us trajectories (<span style="color: #000080">Jesse</span>); 
  <br />
  ▪️ <span style="color: red">"reward"</span>: feedback on a trajectory, i.e. what can be gotten if the trajectory is taken (<span style="color: #000080">whether <em>o<sup>k</sup><sub>i</sub></em> is good or bad/how good or how bad</span>); 
  <br />
  ▪️ <span style="color: red">"reward model"</span>: some method/model giving the reward for a trajectory (<span style="color: #000080">you!</span>). 
  <br />
  ▪️ <span style="color: red">"advantage"</span>: how much better taking action <em>a<sub>t</sub></em> at state <em>s<sub>t</sub></em> is compared to the average of all actions possible. 
  <br />
</p>
</div>

<p><label for="marginnote-example" class="margin-toggle"> ⊕</label><input type="checkbox" id="marginnote-example" class="margin-toggle" checked="" /><span class="marginnote">To make the definitions more concrete let’s shift the example with Jesse above to an AR-LLM: Let’s say we are at the point in time (<em>state</em>) where the AR-LLM has just processed the prompt \(q_{k}\) fed to it. Let’s call this state \(s_{0}\). For the sake of this example, let us assume that the AR-LLM can only ever give answers to \(q_{k}\) from the 6 examples above (i.e. <em>o</em><sub>1</sub> to <em>o</em><sub>6</sub>). If we prefer <em>o</em><sub>6</sub> the most, then the <em>action</em> we want from the AR-LLM immediately after \(s_{0}\) is to return the word “They” <em>(in the next-token prediction set-up of AR-LLMs, this means striving to give this word the highest probability)</em>. The objective is to have the AR-LLM learn to return a sequence (i.e. <em>trajectory</em>) of state-action decisions so as to give an answer that obtains as high a reward as possible. Note that the learning for the policy also involves cases such as these: if the action chosen was to return “I” after \(s_{0}\), then the AR-LLM should learn that at such \(s_{1}\), the word “know” should have the highest probability (applies if we prefer <em>o</em><sub>1</sub> over all the other answers (<em>o</em><sub>2</sub> and <em>o</em><sub>3</sub>) that start with “I”). and so on and so forth…</span></p>

<p>In practice, we achieve this by getting the LLM (the policy) to generate a diverse set of answers for a given prompt \(q_{k}\) by using a sufficiently high sampling <a href="https://huggingface.co/blog/how-to-generate#:~:text=the%20so%2Dcalled-,temperature,-of%20the%20softmax" title="How to generate text: using different decoding methods for language generation with Transformers">temperature</a>. The LLM learns via the feedback from the rewards of different experiences (i.e. pairs of <strong>\(q_{k}, o^{k}_{i}\)</strong>) which is the best answer to give.</p>

<p style="size: 22pt; font-weight: bold;">PPO and GRPO briefly: efficient &amp; stable training</p>
<p>(<em>Bear with me, just a little more common ground… 😅, so that we can situate the next section properly.</em>) In this section, I zoom in to focus on two aspects shared by the PPO and GRPO algorithms; a sense of these aspects are necessary for me to be able to explain the key points of the subsequent sections.<label for="sidenote-fuller" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-fuller" class="margin-toggle" checked="" /><span class="sidenote">I give a very general view here, but there is a fair bit more behind both algorithms; for a fuller understanding of them take a look at the following resources to start: <a href="https://yugeten.github.io/posts/2025/01/ppogrpo/" title="A vision researcher’s guide to some RL stuff: PPO &amp; GRPO">this post by Jimmy Shi</a>, <a href="https://rlhfbook.com" title="Reinforcement Learning from Human Feedback">this series by Nathan Lambert</a> and <a href="https://huggingface.co/blog/deep-rl-ppo" title="Proximal Policy Optimization (PPO)">this HuggingFace RL course unit</a>.</span></p>

<p>▪️ A major preoccupation for RL training in general (i.e. including PPO/GRPO) is to find some balance between <strong>exploration</strong> (i.e. generating diverse answers to receive useful feedback for learning) and <strong>exploitation</strong> (i.e. leveraging useful knowledge the policy has learned from past encounters, e.g. from Jesse’s <em>o</em><sub>5</sub> which gets a good reward).<label for="sidenote-stability" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-stability" class="margin-toggle" checked="" /><span class="sidenote">The trade-off is as follows: ▪️ allowing more exploration (i.e. via generating the <strong>\(o^{k}_{i}\)</strong> trajectories by sampling with high temperature) results in very sparse signals (to go the extreme: imagine that for every <strong>\(q_{k}\)</strong>, we have to generate all the possible combinations of words in English almost all of which would have very low reward with respect to <strong>\(q_{k}\)</strong>) and wastes compute; whereas, on the other hand, ▪️ relying on already learned knowledge (e.g. generating <strong>\(o^{k}_{i}\)</strong> by sampling with low temperature) may keep the policy around poor/sub-optimal outputs i.e. does not allow it to reach an optimal <strong>\(o^{k}_{i}\)</strong>.</span> When applying PPO/GRPO to AR-LLMs, the bottleneck is the generating of trajectories (due to the generation process being auto-regressive) and it typically takes up most of the training run-time. Hence, it is typical to reuse the same set of sampled trajectories for a few more update steps<label for="sidenote-mu" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-mu" class="margin-toggle" checked="" /><span class="sidenote">This is <em>“K epochs”</em> in Algorithm 1 of the PPO paper <a href="https://arxiv.org/abs/1707.06347" title="Proximal Policy Optimization Algorithms">(Schulman et al, 2017)</a> and <code class="language-plaintext highlighter-rouge">num_ppo_epochs</code> in the <a href="https://huggingface.co/docs/trl/main/en/ppo_trainer#trl.PPOConfig">TRL implementation</a>; the \(\mu\) hyperparameter in Algorithm 1 of the DeepSeek Math (GRPO) paper <a href="https://arxiv.org/abs/2402.03300" title="DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models">(Shao et al, 2024)</a> and <code class="language-plaintext highlighter-rouge">num_iterations</code> in the <a href="https://huggingface.co/docs/trl/main/en/grpo_trainer#trl.GRPOConfig">TRL implementation</a>.</span> to squeeze more learning out of them. <em>Hereon, I will use the term <strong>\(\mu\)-updates</strong> to refer to these update steps.</em> Think of it in this way: although going through one round of (\(q_{k}, o^k_1... o^k_6\)) with Jesse might help them get a little closer to giving the most preferred output, but it might not be sufficient… so we repeat with multiple rounds of (\({k}, o^k_1... o^k_6\)) to help Jesse learn.<label for="sidenote-onoff" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-onoff" class="margin-toggle" checked="" /><span class="sidenote"><ins><em>Sidenote:</em></ins> While PPO and GRPO are recognised as online methods, a case could be made that these subsequent \(\mu\)-updates after the first step/epoch, are at least <em>slightly off-policy</em> <a href="https://arxiv.org/abs/2505.17508" title="On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning">(Zhang et al, 2025)</a>… especially when \(\mu\) is set to a large number.</span></p>

<p>▪️ Another major preoccupation (for policy-gradient methods in general) is achieving <strong>stable training</strong> to facilitate successful policy learning.<label for="sidenote-variance" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-variance" class="margin-toggle" checked="" /><span class="sidenote">Since we typically train across diverse problems \(q_k\) that each have their own reward distributions, this adds to the variance in the gradient estimates (which is already present between trajectories of a given \(q_k\)); therefore, when taking update steps, large updates can overfit the policy to some problems at the expense of others, leading to instability and hindering overall learning.</span> Hence, one of the design principles in PPO <a href="https://arxiv.org/abs/1707.06347" title="Proximal Policy Optimization Algorithms">(Schulman et al, 2017)</a> was to ensure stability across update steps. This was done by adding the following to the training objective of vanilla policy-gradient methods (e.g. REINFORCE <a href="https://dl.acm.org/doi/10.1007/BF00992696" title="Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning">(Williams, 1992)</a>): (i) a <strong>KL-regularisation</strong> term;<label for="sidenote-kl" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-kl" class="margin-toggle" checked="" /><span class="sidenote">The <em><a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL divergence</a></em> is a measure of how close/apart one distribution (\(P\)) is to another (\(Q\)); it is an asymmetric measure; so KL of \(P || Q\) is not the same as KL of (Q||P).</span> and (ii) the use of clipping as a floor/ceiling on the update. These help avoid updates to the policy that veer too far from some "trusted" zone of some reference policy that has already been established (for e.g. from explorations in previous updates, or an initial SFT-ed policy). Since GRPO is actually based upon PPO,<label for="sidenote-grpomods" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-grpomods" class="margin-toggle" checked="" /><span class="sidenote">Doing away with the need for a separate memory- and compute-heavy value model to assess advantage, replacing it with a group-based advantage estimation.</span> a similar objective to PPO can also be found there.<label for="marginfigure-ppogrpo" class="margin-toggle">⊕</label><input type="checkbox" id="marginfigure-ppogrpo" class="margin-toggle" checked="" /><span class="marginnote"><img class="fullwidth" src="/assets/img/grpo.png" /><br />Image: PPO and GRPO; their similarities and differences – source: <a href="https://arxiv.org/abs/2402.03300" title="DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models">(Shao et al, 2024)</a>. Note that there are variants of PPO that permit generating multiple trajectories, computing their rewards and advantages in one pass (similar to the GRPO figure) but still needing a value model. </span></p>

<p><em>It is these – the <strong>KL regularisation term</strong> and the <strong>\(\mu\)-updates</strong> – that presents some challenges to overcome (as well as opportunities to leverage as in <strong>diffuGRPO</strong>) for the use of PPO/GRPO on DLMs, and we will discuss these next… (Note: The rest of this post will go into the weeds on these points and will be more technical.)</em></p>

<h3 id="️-2-can-we-reuse-the-ppogrpo-methods-that-worked-for-ar-llms">♻️ 2. Can we reuse the PPO/GRPO methods that worked for AR-LLMs?</h3>
<p>The short answer is… <span style="color: blue">broadly, yes</span> but with the need for some <span style="color: blue">non-trivial modifications</span> to address the issue of how to obtain the likelihood for trajectories from a Masked-DLM. These likelihoods are needed in two places in the PPO/GRPO objective: (i) for an <a href="https://en.wikipedia.org/wiki/Importance_sampling">importance sampling</a> weight, as well as (ii) an estimate for the KL-regularisation term.<label for="sidenote-grpo-obj" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-grpo-obj" class="margin-toggle" checked="" /><span class="sidenote">We use the GRPO objective to illustrate (<em>with clipping omitted to reduce clutter in the equation</em>): 
\(\begin{aligned}
L_{GRPO}(\theta) = - \frac{1}{\sum_{i=1}^{G} |o_i|} \sum_{i=1}^{G} \sum_{t=1}^{|o_i|} 
\\
\bigg[ {\color{red}\frac{\pi_\theta(o_{i,t}|x, o_{i,&lt;t})}{{\left[ \pi_\theta(o_{i,t}|x, o_{i,&lt;t}) \right]}_{\text{no grad}}}} \hat{A}_{i,t}  {\color{blue} - \beta D_{\text{KL}}[\pi_\theta \| \pi_{\text{ref}}} ] \bigg]
\end{aligned}\)
<br />
where \(\pi_{\theta}\) is the current policy and \(\pi_{ref}\) is either the initial (typically obtained via SFT) or some earlier-update \(\pi_{\theta}\). 
<br />
As we can see:
<br />
▪️ the per-token likelihoods (obtained twice, once with gradients through the policy \(\pi_{\theta}\) and another without gradients) are used in the first term (in <span style="color: red">red</span>). The ratio of these corresponds to an <a href="https://en.wikipedia.org/wiki/Importance_sampling">importance sampling</a> on the advantages \(\hat{A}_{i,t}\) (to address that trajectories are coming slightly off-policy in the <strong>μ-updates</strong> steps);
<br />
▪️ the <em>KL-regularisation term</em> is in <span style="color: blue">blue</span>; and this is where the sequence-level likelihoods are used. In practice, this KL estimate is implemented via this form: <code class="language-plaintext highlighter-rouge">KL</code> \(= e^r - r - 1\) where \(r = log( \pi_{\theta}(o_{i,t}|x, o_{i,&lt;t}) / \pi_{ref}(o_{i,t}|x, o_{i,&lt;t}) )\). The per-token likelihood for \(\pi_{\theta}\) above can be reused, and only the ones from \(\pi_{ref}\) need to be computed here. For a concrete feel: see the implementation in <a href="https://github.com/huggingface/trl/blob/e086f073cf6dee30acc2d3fe357db21e1901c2be/trl/trainer/grpo_trainer.py#L1719">TRL</a>. 
<br /><br /><ins><em>Sidenote:</em></ins> Recent studies <a href="https://arxiv.org/abs/2505.17508" title="On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning">(Zhang et al, 2025)</a> and <a href="" title="On a few pitfalls in KL divergence gradient estimation for RL">(Tang et al, 2025)</a> have established that there are non-trivial differences relating to a set of fine-grained choices of the method and implementation for the KL divergence estimate. <span style="color: blue"><strong>Note that these have implications for online RL of DLMs due to the need to estimate these estimates there (see <em><a href="#dlm_estimate">below</a></em>). To my mind, these two pieces are recommended reading for RL on DLMs.</strong></span> 
<br /><br /><ins><em>Sidenote:</em></ins> If the beta (β) coefficient, which controls the amount of KL-regularisation in PPO/GRPO, is set to be zero, then there is no need for the sequence-level likelihoods. Empirically, there have been reports recently that the KL-regularisation may not be necessary for AR-LLMs (quite likely under certain training setups i.e. hyperparameter setting, modeling choice where the encountered KL divergences between \(\pi_{\theta}\) and \(\pi_{ref}\) are low). See for e.g. <a href="https://ai.meta.com/research/publications/cwm-an-open-weights-llm-for-research-on-code-generation-with-world-models/" title="CWM: An Open-Weights LLM for Research on Code Generation with World Models">(Copet et al, 2025)</a>; page 13 of paper.</span></p>

<p>Computing these likelihood for trajectories is easy for AR-LLMs because of how they factorise sequence probabilities at a token-level; i.e. at each step, the AR-LLM predicts from its vocabulary the most likely token to generate. As a result, it is very easy to compute what an AR-LLM thinks is the likelihood of any sequence of tokens (by chain-rule, i.e. simply summing the log-probabilities for each token of the sequence).<label for="sidenote-factorise" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-factorise" class="margin-toggle" checked="" /><span class="sidenote">See also footnote 27 in the <em><a href="https://hankelvin.github.io/articles/25/Diffusion_LM_P2#:~:text=lower%20bound%20(ELBO)-,Unlike,-AR%2DLLMs%20which" title="Diffusion Language Models -- Part Two (What kinds are there and how is one trained?)">second post</a></em> of this series.</span></p>

<p>However, this is <span id="dlm_estimate">not the case 😵‍💫</span> for Masked-DLMs (and discrete diffusion models generally). Although we do get probabilities for tokens at each step of the diffusion generation process (which is what allows us to decide which token to unmask into), each of these steps is a denoising one <ins>that depends on all its preceding steps</ins>. In other words, computing sequence probabilities for DLMs require going through multiple denoising steps (from \(T\) to 0). To have to keep doing this for every sampled trajectory during online RL training with PPO/GRPO is very computationally expensive, and will be significantly worse for very long sequences.<label for="sidenote-efficient" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-efficient" class="margin-toggle" checked="" /><span class="sidenote">Although the efficient DLM methods I covered in the previous post (such as Block Diffusion <a href="https://openreview.net/forum?id=tyEyYT267x" title="Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models">(Arriola et al, 2025)</a>) can help alleviate this, the increase in computation required – compared to what is required in AR-LLMs – will still be substantial.</span> As such, there is a need to establish ways to efficiently, yet as accurately as possible, estimate these likelihoods with the DLM. This is the focus of much research currently and we will look into in the next section.</p>

<h3 id="-3-what-have-been-proposed-for-dlms">💡 3. What have been proposed for DLMs?</h3>
<p>This section outlines two research trends in online RL algorithms to Masked-DLMs. All of these started with <strong>diffuGRPO</strong> <a href="https://arxiv.org/abs/2504.12216?" title="d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning">(Zhao et al, 2025)</a>, which landed in Q1 this year and was the first work to explore a way of bringing online RL algorithms to Masked-DLMs. <strong>diffuGRPO</strong> and the initial wave of research is distinguished by their main contributions for ways to estimate likelihoods with Masked-DLMs, another more recent wave (released in the last month or so) begin to explore extensions for Masked-DLMs with semi-autoregressive generation for longer generations and with more efficiency (e.g. with KV caching).</p>

<p style="size: 22pt; font-weight: bold">Efficient likelihood estimation for online RL on Masked-DLMs</p>
<p>Each of the three pieces of work mentioned here proposed a way to do the likelihood estimation. Note that although they were formulated for GRPO, it should be possible to leverage their likelihood approaches for use in a PPO setup.</p>

<p>◼️ <span id="diffugrpo"><strong>diffu-GRPO</strong></span> <a href="https://arxiv.org/abs/2504.12216?" title="d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning">(Zhao et al, 2025)</a>: estimate the per-token likelihood of a trajectory by simply doing unmasking in one-step.<label for="sidenote-diffugrpo" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-diffugrpo" class="margin-toggle" checked="" /><span class="sidenote">In practical terms, this is done as follows: for a given prompt \(q_{k}\) append it with a fully-masked continuation (i.e. max sequence generation length) and pass it through the Masked-DLM; this output is the estimated per-token probability distribution (conditioned by the prompt \(q_{k}\)).</span> As noted above, such one-/few-step unmasking does not reflect the multi-step denoising in Masked-DLM – hence and quite importantly, their proposal hinges on (i) the \(\mu\)-updates typically (but not mandatorily) used in GRPO, and (ii)  a random masking to the <em>prompt \(q_{k}\)</em> portion of the input (i.e. input = \(q_{k}\) + fully masked continuation). At every of the \(\mu\)-steps, the mask is randomised but always fixed at 15%.<label for="sidenote-d1masking" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-d1masking" class="margin-toggle" checked="" /><span class="sidenote">See Appendix A of paper: <em>“In gradient update iterations, each token in the prompt is randomly masked with a probability pmask = 0.15 for log-probability estimation.”</em> </span> We can see this as obtaining slightly varied likelihood estimates for a set of inputs closely resembling the prompt \(q_{k}\) which according to the authors <em>“acts a form of regularization for policy optimization”</em>. As for estimating sequence-level likelihood of a trajectory: the authors assume <a href="https://en.wikipedia.org/wiki/Mean-field_theory">mean-field decomposition</a> (i.e. a series of localized independent distributions can be useful for approximating a complex conditional distribution), allowing them to simply sum the trajectory’s per-token log probabilities to get this estimate. At least two pieces of empirical support are available for <strong>diffu-GRPO</strong>: (i)  <a href="https://arxiv.org/abs/2504.12216?" title="d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning">(Zhao et al, 2025)</a> reported consistently stronger performance on four different math and puzzle/planning logical problems;<label for="sidenote-diffugrpo-results" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-diffugrpo-results" class="margin-toggle" checked="" /><span class="sidenote">See Table 1 and Figure 5 of their <a href="https://arxiv.org/abs/2504.12216?" title="d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning">paper</a></span> and (ii) the same approach for obtaining per-token and sequence likelihood was adopted by <strong>IGPO</strong> <a href="https://arxiv.org/pdf/2509.10396" title="Inpainting-Guided Policy Optimization for DiffusionLarge Language Models">(Zhao et al, 2025b)</a> as well and tested successfully on reasoning benchmarks there.
<br /><br /></p>
<figure><figcaption><span>Image: likelihood estimation approach in <strong>diffuGRPO</strong> via one-step denoising (varied mask on prompt tokens across \(\mu\)-updates) – source: <a href="https://arxiv.org/abs/2504.12216?" title="d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning">(Zhao et al, 2025)</a>.<br /><br /></span></figcaption><img src="/assets/img/diffugrpo.png" /></figure>

<p>◼️ <strong>coupled-GRPO</strong> <a href="https://arxiv.org/abs/2506.20639" title="DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation">(Gong et al, 2025)</a>: departs from <strong>diffu-GRPO</strong> in that they apply the masking to the continuation portion (i.e. the \(o^k_i\)) for <em>each trajectory</em>. They also use two samples on <em>each trajectory</em> at every \(\mu\)-step for the estimation (compared to <strong>diffu-GRPO</strong>’s use of only one sample of <em>each prompt \(q_{k}\)</em> at every \(\mu\)-step, which requires much less computation). Each of the two samples masks different, but paired (henced the “coupled” in the name) parts of the continuation.<label for="sidenote-coupled" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-coupled" class="margin-toggle" checked="" /><span class="sidenote">This involves sampling a random timestep \(t\), and then setting the other \(\hat{t}\) so that (i) \(t + \hat{t} = T\), the terminal timestep (i.e. 1.0). Then, what is masked for \(t\) is not masked in \(\hat{t}\) and vice-versa.</span> This ensures that (i) every token in a trajectory is involved once in the estimation giving <em>“each token a non-zero learning signal”</em>, and (ii) it also more closely mimics the denoising generation process in Masked-DLMs (where probabilities are produced on partially masked continuations). <strong>coupled-GRPO</strong> was used by <a href="https://arxiv.org/abs/2506.20639" title="DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation">(Gong et al, 2025)</a> in their training for DiffuCoder, a code generation-focused Masked-DLM, who stated that <strong>coupled-GRPO</strong> was formulated in response to their findings that <strong>diffu-GRPO</strong>’s likelihood approximation methods do not <em>“yield a stable reward improvement…, probably because code tasks demand higher token-level generation accuracy than math tasks”</em>. They go on to show that <strong>coupled-GRPO</strong> leads to stronger performance through stabler rewards (see left and centre chart of Figure 7 in their paper) over <strong>diffu-GRPO</strong> for coding tasks, as well as for another baseline where they remove the masking coupling of (\(t, \hat{t}\)). Interestingly they also found that coupled-GRPO required sampling trajectories at a higher temperature for success (see right chart of Figure 7 in their paper) which has congruence with similar findings on online RL for AR-LLMs recently <a href="https://arxiv.org/abs/2505.24864" title="ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models">(Liu et al, 2025)</a>. 
<br /><br /></p>
<figure><figcaption><span>Image: likelihood estimation approach in <strong>Coupled-GRPO</strong> to balance coverage and reduce variance – source: <a href="https://arxiv.org/abs/2506.20639" title="DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation">(Gong et al, 2025)</a>.</span></figcaption><img src="/assets/img/coupled-grpo.png" /></figure>

<p>◼️ <strong>uni-GRPO</strong> <a href="https://arxiv.org/abs/2505.15809" title="MMaDA: Multimodal Large Diffusion Language Models">(Yang et al, 2025)</a>: was applied in the training procedure for MMaDA, a Masked-DLM with multi-modal (vision and text) capabilities. It is similar to <strong>coupled-GRPO</strong> in that it also masks on the continuation to obtain the likelihood estimates. Specifically, the noise for each \(\mu\)-step update is randomly sampled from a uniform distribution (instead of the fixed 15% of the prompt in <strong>diffu-GRPO</strong>, which also meant that the same timestep (i.e. \(T\)) across all samples was used there). Only one sample on <em>each trajectory</em> at every \(\mu\)-step is taken (i.e. more computation than <strong>diffu-GRPO</strong> but less than <strong>coupled-GRPO</strong>). In a departure from the other two approaches, the per-token likelihood is computed with the masked tokens (i.e. this relates to the ELBO of the Masked-DLM),<label for="sidenote-elbo" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-elbo" class="margin-toggle" checked="" /><span class="sidenote">See Equation 3 of  the MMaDA paper for <strong>uni-GRPO</strong>; and compare with Equation 4 of the DiffuCoder paper for <strong>coupled-GRPO</strong>.</span> and the sequence level likelihood <em>“is then approximated by averaging over masked tokens”</em> (see Equation 4 in paper). Taking the ELBO as the estimate is quite a meaningful departure from the <strong>diffu-GRPO</strong> and <strong>coupled-GRPO</strong> approaches, and although it has a theoretical connection to the Masked-DLM training objective, it is not clear that it provides a better estimate for online RL training compared to the case where all tokens are considered (as in <strong>coupled-GRPO</strong>); nonetheless, it is clear that <strong>uni-GRPO</strong> outperforms <strong>diffu-GRPO</strong> (likely due to the larger-sized sampling, i.e. every trajectory every \(\mu\)-step).<label for="sidenote-compare" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-compare" class="margin-toggle" checked="" /><span class="sidenote">See comparisons of their performance in Figure 3 in §5.2 of the MMaDA paper and Table 1 of the IGPO paper that also leverages <strong>diffu-GRPO</strong>.</span></p>

<p style="size: 22pt; font-weight: bold">Other proposals for Masked-DLMs</p>

<p>◼️ <strong>wd1</strong> <a href="https://arxiv.org/abs/2507.08838" title="wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models">(Tang et al, 2025)</a>: proposes a few modifications to the GRPO objective which allow it to be used on a Masked-DLM with likelihood evaluation through one policy only (the current policy \(\pi_{\theta}\)). This is desirable as it is much more computationally efficient compared to the approaches above, which needed to do so for the policy before the \(\mu\)-update (\(\pi_{old}\)) and the reference policy (\(\pi_{ref}\)). Briefly, their approach hinges on shifting from (i) applying the importance sampling to the advantage (as per original PPO; see above) to (ii) applying a reverse KL-divergence penalty.<label for="sidenote-wd1" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-wd1" class="margin-toggle" checked="" /><span class="sidenote">See §3.1 and Equation 3 of their <a href="https://arxiv.org/abs/2507.08838" title="wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models">paper</a>.</span> This enables their derivation of an expression on the GRPO objective that only needs likelihood estimates from \(\pi_{\theta}\).<label for="sidenote-wd1-obj" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-wd1-obj" class="margin-toggle" checked="" /><span class="sidenote">In order to obtain their expression of the GRPO objective, it also involves shifting the trajectory sampling (from \(\pi_{old}\)) to a geometric mixture of \(\pi_{old}\) and \(\pi_{ref}\) (see §3.1 of their paper).</span> They report obtaining up to 16% better performance over <strong>diffuGRPO</strong> on math and logic/puzzle planning benchmarks even without having to do an SFT phase (which is, on the other hand, needed for <strong>diffuGRPO</strong> to reach reasonable performance).<label for="sidenote-wd1-results" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-wd1-results" class="margin-toggle" checked="" /><span class="sidenote"><ins><em>Sidenote:</em></ins> Although, a case could be made that the settings <a href="https://arxiv.org/abs/2507.08838" title="wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models">(Tang et al, 2025)</a> use for comparison with <strong>diffuGRPO</strong> might not be fully like-for-like. Their wd1 objective is obtained assuming the \(\beta\)-controlled KL-regularisation term (see above) is included in the GRPO objective (see their Equation 5). While the final expression of the wd1 objective does away with the KL-regularisation term and does not include it explicitly, the controlling \(\beta\) term remains embedded throughout the wd1 objective (their Equations 9 and 6). Yet in practice \(\beta\) is set to 0 (see “Implementation” in §4 of their paper) for the model trained with wd1 in their experiments, in effect this leaves out the consideration of any KL-regularisation. On the other hand, the \(\beta\) from the original <strong>diffuGRPO</strong> paper (0.04) was kept (see Table 5 of Appendix B.4). Perhaps it will be helpful to also understand how <strong>diffuGRPO</strong> performs without the KL-regularisation applied.</span> Notably however, it does not appear that this approach leads to stronger empirical outcomes when compared with <strong>uniGRPO</strong>.<label for="sidenote-wd1-less" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-wd1-less" class="margin-toggle" checked="" /><span class="sidenote">Compare the reported scores on GSM8K and MATH500 by <a href="https://arxiv.org/pdf/2509.10396" title="Inpainting-Guided Policy Optimization for DiffusionLarge Language Models">(Zhao et al, 2025b)</a> (refer to Table 1) and <a href="https://arxiv.org/abs/2507.08838" title="wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models">(Tang et al, 2025)</a> (refer to Table 3; look at the 256-length results as the other paper used a 256 length setting – see Appendix A there). </span></p>

<!-- Note however, a slight discrepancy: the __diffuGRPO__ results reported by <a href='https://arxiv.org/pdf/2509.10396' title='Inpainting-Guided Policy Optimization for DiffusionLarge Language Models'>(Zhao et al, 2025b)</a> is actually higher than the figures report by <a href='https://arxiv.org/abs/2507.08838' title='wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models'>(Tang et al, 2025)</a>. -->

<p>◼️ <strong>IGPO</strong> <a href="https://arxiv.org/pdf/2509.10396" title="Inpainting-Guided Policy Optimization for DiffusionLarge Language Models">(Zhao et al, 2025b)</a>: shares the same first author as <strong>diffu-GRPO</strong>, and as mentioned above, uses the same likelihood estimation approach as <strong>diffu-GRPO</strong>. The novelty here is a procedure to leverage the inpainting capabilities inherent in DLMs (see my first <a href="https://hankelvin.github.io/articles/25/Diffusion_LM_P1#:~:text=strategies%20such%20as-,infilling,-.%20In%20the%20GIF">post</a>) to optimise the training efficiency and efficacy of GRPO. In GRPO,<label for="sidenote-refer" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-refer" class="margin-toggle" checked="" /><span class="sidenote">Refer to the image of PPO vs GRPO in the margins above for a sense.</span> when the entire group of sampled trajectories for a given prompt \(q_{k}\) (for e.g. a math problem) is zero, this results in no useful signal for the model to update its parameters. <strong>IGPO</strong>’s proposal assumes access to ground-truth or sufficiently high quality reasoning traces for \(q_{k}\) and to use some segment of the reasoning traces when such zero-reward groups are encountered. Specifically, by “seeding” a fragment of the reasoning trace amongst the masked tokens, we get a chance to steer the Masked-DLM towards generating a trajectory of good quality,<label for="sidenote-hint" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-hint" class="margin-toggle" checked="" /><span class="sidenote">It is akin to hinting to Jesse <em>“They have 2 plus 2 apples so that is 4 apples…“</em></span> which is then swopped with a zero-reward trajectory from the group. Training with this way to avoid zero-reward update steps led to stabler learning and enabled improvements on math and planning benchmarks over <strong>diffuGRPO</strong>; but importantly, it also outperforms <strong>uniGRPO</strong> that requires more samples to be taken for the likelihood estimation (per-trajectory per \(\mu\)-update vs per-prompt per \(\mu\)-update).<label for="sidenote-igporesults" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-igporesults" class="margin-toggle" checked="" /><span class="sidenote">See Table 1 of their paper <a href="https://arxiv.org/pdf/2509.10396" title="Inpainting-Guided Policy Optimization for DiffusionLarge Language Models">(Zhao et al, 2025b)</a>.</span></p>

<p>◼️ <strong>TraceRL</strong> <a href="https://arxiv.org/abs/2509.06949" title="Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models">(Wang et al, 2025)</a>: encapsulates some of the latest developments on a few fronts in Masked-DLM research. To summarise, they propose the use of a value model (another DLM) to manage the variance across updates (<em>à la</em> PPO). In addition, they leverage the semi-autoregressive approach of Fast-dLLM <a href="https://arxiv.org/abs/2505.22618" title="Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding">(Wu et al, 2025)</a>, which I discussed in my previous <a href="https://hankelvin.github.io/articles/25/Diffusion_LM_P3#:~:text=There%20are%20a-,number,-of%20approaches%20proposed">post</a>, that denoises blocks of tokens auto-regressively with efficiency via the use of approximated KV caches. This has the effect of giving a speed up to trajectory sampling, easing a major bottleneck especially for problems that are best solved with lengthy reasoning traces. To put these extensions together required special treatment (e.g. their §4.3), and this work is notable for putting forward a proposed solution for doing so. They report impressive performance on math benchmarks (87.4 for GSM8K and 94.2 for MATH500) that outperform the other methods listed above in this section,<label for="sidenote-tracerl" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-tracerl" class="margin-toggle" checked="" /><span class="sidenote">See Table 2 of their paper <a href="https://arxiv.org/abs/2509.06949" title="Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models">(Wang et al, 2025)</a>; and compare against results report in the other papers.</span> as well as ones for coding. Helpfully, the authors released the TraDo series of <a href="https://huggingface.co/collections/Gen-Verse/trado-series-68beb6cd6a26c27cde9fe3af">4B/8B parameters Masked-DLMs</a> that they trained with this approach, alongside their codebase.</p>

<h3 id="-4-whats-next">🤔 4. What’s next?</h3>
<p>To sum up, in this post, I started with a general overview of online RL post-training for LLMs. With some common ground established on that, I highlighted the main challenge for extending existing methods for AR-LLMs to Masked-DLMs, which is the issue of how to efficiently and accurately estimate token and sequence likelihoods needed for the PPO/GRPO objective. Finally, I gave an outline for recent work bringing PPO/GRPO to Masked-DLMs, focusing on how they proposed to address this estimation challenge.</p>

<p>This wraps up the first round on the topics I intended to cover in this series; the next posts – probably slightly further out – would look at all the areas I have discussed so far, but with multi-modality in consideration.</p>]]></content><author><name>Kelvin Han</name></author><category term="discrete" /><category term="diffusion" /><summary type="html"><![CDATA[⊕Source: PNGs in GIF generated with ChatGPT. Recently, the post-training of large language models (LLMs) with reinforcement learning (RL) has been an important source for the significant progress we are seeing in LLM capabilities (for reasoning, agents/tool-use, planning etc). ⊕The post-training of an LLM comes after pretraining (which is when LLMs are trained on next-token prediction over web-scale text). It “polishes” the LLM into the useful models we are used to interacting with. This blog post by two Meta Super Intelligence (MSL) researchers gives a good overview of the post-training phase. Much of what is in their blog post would also apply to post-training for diffusion language models (DLMs). This became especially apparent earlier this year when DeepSeek surprised (and moved markets) with the release of their R1 model (Guo et al, 2025), an auto-regressive LLM (AR-LLM). Their model was post-trained with an efficient RL algorithm (GRPO, see below) in a way that unlocked “thinking” for improved performance on reasoning tasks.The term ‘reasoning’ with respect to LLMs, also referred to as Large Reasoning Models or LRMs, is still being settled upon. However, there are notable differences between the “thinking” traces produced by LRMs and what we might generally accept as reasoning by humans. A good overview of this can be found in (Kambhampati et al, 2025) and this list. Prior to this, however, RL post-training was already crucial for aligning AR-LLM generations towards users’ preferred forms/styles of text and conversation, as well as for meeting safety and security requirements. In this post, I examine similar RL methods for diffusion language models (DLMs); which will be key for pushing DLMs to parity (or more) with existing AR-LLMs in terms of capabilities. In line with the previous posts of this series, I will focus on Masked-DLMs (which are the keenest focus of current research); on the RL side, my focus will be on online policy-gradient algorithms,Policy-gradient algorithms are named for how: (i) the policy (the mechanism for generating trajectories i.e. sequences of tokens when used in the context of LLM RL post-training) is in the form of a parameterized model, and (ii) the policy’s parameters are learned by following the gradient of some function with respect to those parameters. (The definitions of the terms in red can be found below.) They can be categorised as being online or offline, where the “on”/”off” relates to whether the policy is learning from trajectories coming from itself (“on”) or not (“off”; e.g. from another model’s distribution). Sidenote: offline methods, such as DPO (Rafailov et al, 2023), were instrumental for aligning LLM to human preferences (e.g. used in the training for Llama 3 (Llama Team, AI @ Meta, 2024) models). For the interested, similar methods have been proposed for diffusion models: e.g. VRPO used to post-train the original LLaDA Masked-DLM (Nie et al, 2025) to give LLaDA 1.5 (Zhu et al), as well as ones for continuous diffusion such as Diffusion-DPO (Wallace et al, 2023) and DSPO (Zhu et al, 2025). namely: Group Relative Policy Optimization (GRPO) (Shao et al, 2024), (DeepSeek AI, 2025) and Proximal Policy Optimization (PPO) (Schulman et al, 2017), which (i) are being used in post-training AR-LLM today; and (ii) have been found to reach better performance compared to offline algorithms (Ivison et al, 2024). The outline of this post is as follows: I will start by setting the scene with 1. an accessible introduction to RL post-training using policy-gradient algorithms, followed by outlining 2. the main challenge for DLM post-training with such methods. I will then highlight 3. some proposed approaches for RL post-training of DLMs. If you are familiar with RL post-training for LLMs, you could just skip directly to section 2. Otherwise, to fully benefit from this post, going through my earlier posts Part 1, Part 2 and Part 3 for some background to DLMs first would probably be useful. 🖼️ 1. Paint a picture of RL post-training of LLMs in 5 minutes? Before discussing policy-gradient RL for DLMs, let’s get to some common ground with an introduction to such methods, as well as a sense of how we are using them with AR-LLMs currently. I always find analogies help us to better grasp complex topics, so let’s start with one: Imagine you are a parent of a child Jesse, and you want them to learn to give the right answer (let’s refer to this as o, for output) to this question: “Levy has two apples in his pocket, Alex has two apples in her bag. They have a picnic and eat one of the apples. How many apples do they have left?” (essentially: “What is 2+2-1 equals to?”); let’s refer to this question as \(q_{k}\). The idea is that it is best to have Jesse give “3” (or similar) as the final answer whenever they encounter \(q_{k}\) (or a similar problem). One way to help Jesse learn this could be to (i) pose \(q_{k}\) to Jesse multiple times, then (ii) have Jesse give an answer each time (let’s call each of this \(o^{k}_{i}\)), and then (iii) tell Jesse for each \(o^{k}_{i}\) whether it is a good answer. Some possible answers of Jesse's 💬 o1: I know this, the answer is 3! 💬 o2: I don't know, the answer is 3? 💬 o3: I love chicken nuggets! I will never eat apples! 💬 o4: Levy has 2 apples and they eat 1 so there has to be 3 apples left. 💬 o5: Three! 💬 o6: They have 2 plus 2 apples so that is 4 apples. They eat one, so 4 minus 1, that means they have 3 apples left. Duh! We can also see from above that some of the answers that Jesse might come up with could be better than others; in terms of correctness (in the final answer, and in the reasoning) as well as for style.Some answers are clearly off (i.e. o3). Some (i.e. o4) give the correct final answer, but have a wrong reasoning process for getting to it. Some might be nearly identical, but one amonsgt them is preferred over another, e.g. o1 versus o2 (for a more confident Jesse). Others might be quite different yet slightly preferred over another, e.g. o5 versus o6 (depending on whether we prefer to have a Jesse that will give a reasoning along with their answer, but in a sassy way). Therefore, we can expect to have some preferences between each \(o^{k}_{i}\), and hence we might want to steer Jesse’s mind such that whenever Jesse encounters \(q_{k}\), ideally Jesse gives \(o^{k}_{i}\) that is most preferable. In essence, what we want to do for Jesse is similar to what we want to do with LLMs using policy-gradient RL post-training! i.e. we want an LLM to learn, by updating its parameters, that when presented with a certain prompt \(q_{k}\) (or similar) it should generate responses that are preferred (achieve highest reward). This is done via getting the LLM to give higher likelihoods for the sequence of tokens in the higher-scoring \(o^{k}_{i}\). Some terminology Before proceeding, let's set the definitions of some key RL terms first. Each of these terms are also associated with concepts (in brackets and blue below) from the Jesse example, so as to connect them with RL on AR-LLMs. ▪️ "state": information about the current situation at a given moment in time; ▪️ "action": a decision/choice that can be taken at the point of a certain state; ▪️ "trajectory": a sequence of states and actions that can be taken (oki); ▪️ "policy": some model that can give us trajectories (Jesse); ▪️ "reward": feedback on a trajectory, i.e. what can be gotten if the trajectory is taken (whether oki is good or bad/how good or how bad); ▪️ "reward model": some method/model giving the reward for a trajectory (you!). ▪️ "advantage": how much better taking action at at state st is compared to the average of all actions possible. ⊕To make the definitions more concrete let’s shift the example with Jesse above to an AR-LLM: Let’s say we are at the point in time (state) where the AR-LLM has just processed the prompt \(q_{k}\) fed to it. Let’s call this state \(s_{0}\). For the sake of this example, let us assume that the AR-LLM can only ever give answers to \(q_{k}\) from the 6 examples above (i.e. o1 to o6). If we prefer o6 the most, then the action we want from the AR-LLM immediately after \(s_{0}\) is to return the word “They” (in the next-token prediction set-up of AR-LLMs, this means striving to give this word the highest probability). The objective is to have the AR-LLM learn to return a sequence (i.e. trajectory) of state-action decisions so as to give an answer that obtains as high a reward as possible. Note that the learning for the policy also involves cases such as these: if the action chosen was to return “I” after \(s_{0}\), then the AR-LLM should learn that at such \(s_{1}\), the word “know” should have the highest probability (applies if we prefer o1 over all the other answers (o2 and o3) that start with “I”). and so on and so forth… In practice, we achieve this by getting the LLM (the policy) to generate a diverse set of answers for a given prompt \(q_{k}\) by using a sufficiently high sampling temperature. The LLM learns via the feedback from the rewards of different experiences (i.e. pairs of \(q_{k}, o^{k}_{i}\)) which is the best answer to give. PPO and GRPO briefly: efficient &amp; stable training (Bear with me, just a little more common ground… 😅, so that we can situate the next section properly.) In this section, I zoom in to focus on two aspects shared by the PPO and GRPO algorithms; a sense of these aspects are necessary for me to be able to explain the key points of the subsequent sections.I give a very general view here, but there is a fair bit more behind both algorithms; for a fuller understanding of them take a look at the following resources to start: this post by Jimmy Shi, this series by Nathan Lambert and this HuggingFace RL course unit. ▪️ A major preoccupation for RL training in general (i.e. including PPO/GRPO) is to find some balance between exploration (i.e. generating diverse answers to receive useful feedback for learning) and exploitation (i.e. leveraging useful knowledge the policy has learned from past encounters, e.g. from Jesse’s o5 which gets a good reward).The trade-off is as follows: ▪️ allowing more exploration (i.e. via generating the \(o^{k}_{i}\) trajectories by sampling with high temperature) results in very sparse signals (to go the extreme: imagine that for every \(q_{k}\), we have to generate all the possible combinations of words in English almost all of which would have very low reward with respect to \(q_{k}\)) and wastes compute; whereas, on the other hand, ▪️ relying on already learned knowledge (e.g. generating \(o^{k}_{i}\) by sampling with low temperature) may keep the policy around poor/sub-optimal outputs i.e. does not allow it to reach an optimal \(o^{k}_{i}\). When applying PPO/GRPO to AR-LLMs, the bottleneck is the generating of trajectories (due to the generation process being auto-regressive) and it typically takes up most of the training run-time. Hence, it is typical to reuse the same set of sampled trajectories for a few more update stepsThis is “K epochs” in Algorithm 1 of the PPO paper (Schulman et al, 2017) and num_ppo_epochs in the TRL implementation; the \(\mu\) hyperparameter in Algorithm 1 of the DeepSeek Math (GRPO) paper (Shao et al, 2024) and num_iterations in the TRL implementation. to squeeze more learning out of them. Hereon, I will use the term \(\mu\)-updates to refer to these update steps. Think of it in this way: although going through one round of (\(q_{k}, o^k_1... o^k_6\)) with Jesse might help them get a little closer to giving the most preferred output, but it might not be sufficient… so we repeat with multiple rounds of (\({k}, o^k_1... o^k_6\)) to help Jesse learn.Sidenote: While PPO and GRPO are recognised as online methods, a case could be made that these subsequent \(\mu\)-updates after the first step/epoch, are at least slightly off-policy (Zhang et al, 2025)… especially when \(\mu\) is set to a large number. ▪️ Another major preoccupation (for policy-gradient methods in general) is achieving stable training to facilitate successful policy learning.Since we typically train across diverse problems \(q_k\) that each have their own reward distributions, this adds to the variance in the gradient estimates (which is already present between trajectories of a given \(q_k\)); therefore, when taking update steps, large updates can overfit the policy to some problems at the expense of others, leading to instability and hindering overall learning. Hence, one of the design principles in PPO (Schulman et al, 2017) was to ensure stability across update steps. This was done by adding the following to the training objective of vanilla policy-gradient methods (e.g. REINFORCE (Williams, 1992)): (i) a KL-regularisation term;The KL divergence is a measure of how close/apart one distribution (\(P\)) is to another (\(Q\)); it is an asymmetric measure; so KL of \(P || Q\) is not the same as KL of (Q||P). and (ii) the use of clipping as a floor/ceiling on the update. These help avoid updates to the policy that veer too far from some "trusted" zone of some reference policy that has already been established (for e.g. from explorations in previous updates, or an initial SFT-ed policy). Since GRPO is actually based upon PPO,Doing away with the need for a separate memory- and compute-heavy value model to assess advantage, replacing it with a group-based advantage estimation. a similar objective to PPO can also be found there.⊕Image: PPO and GRPO; their similarities and differences – source: (Shao et al, 2024). Note that there are variants of PPO that permit generating multiple trajectories, computing their rewards and advantages in one pass (similar to the GRPO figure) but still needing a value model. It is these – the KL regularisation term and the \(\mu\)-updates – that presents some challenges to overcome (as well as opportunities to leverage as in diffuGRPO) for the use of PPO/GRPO on DLMs, and we will discuss these next… (Note: The rest of this post will go into the weeds on these points and will be more technical.) ♻️ 2. Can we reuse the PPO/GRPO methods that worked for AR-LLMs? The short answer is… broadly, yes but with the need for some non-trivial modifications to address the issue of how to obtain the likelihood for trajectories from a Masked-DLM. These likelihoods are needed in two places in the PPO/GRPO objective: (i) for an importance sampling weight, as well as (ii) an estimate for the KL-regularisation term.We use the GRPO objective to illustrate (with clipping omitted to reduce clutter in the equation): \(\begin{aligned} L_{GRPO}(\theta) = - \frac{1}{\sum_{i=1}^{G} |o_i|} \sum_{i=1}^{G} \sum_{t=1}^{|o_i|} \\ \bigg[ {\color{red}\frac{\pi_\theta(o_{i,t}|x, o_{i,&lt;t})}{{\left[ \pi_\theta(o_{i,t}|x, o_{i,&lt;t}) \right]}_{\text{no grad}}}} \hat{A}_{i,t} {\color{blue} - \beta D_{\text{KL}}[\pi_\theta \| \pi_{\text{ref}}} ] \bigg] \end{aligned}\) where \(\pi_{\theta}\) is the current policy and \(\pi_{ref}\) is either the initial (typically obtained via SFT) or some earlier-update \(\pi_{\theta}\). As we can see: ▪️ the per-token likelihoods (obtained twice, once with gradients through the policy \(\pi_{\theta}\) and another without gradients) are used in the first term (in red). The ratio of these corresponds to an importance sampling on the advantages \(\hat{A}_{i,t}\) (to address that trajectories are coming slightly off-policy in the μ-updates steps); ▪️ the KL-regularisation term is in blue; and this is where the sequence-level likelihoods are used. In practice, this KL estimate is implemented via this form: KL \(= e^r - r - 1\) where \(r = log( \pi_{\theta}(o_{i,t}|x, o_{i,&lt;t}) / \pi_{ref}(o_{i,t}|x, o_{i,&lt;t}) )\). The per-token likelihood for \(\pi_{\theta}\) above can be reused, and only the ones from \(\pi_{ref}\) need to be computed here. For a concrete feel: see the implementation in TRL. Sidenote: Recent studies (Zhang et al, 2025) and (Tang et al, 2025) have established that there are non-trivial differences relating to a set of fine-grained choices of the method and implementation for the KL divergence estimate. Note that these have implications for online RL of DLMs due to the need to estimate these estimates there (see below). To my mind, these two pieces are recommended reading for RL on DLMs. Sidenote: If the beta (β) coefficient, which controls the amount of KL-regularisation in PPO/GRPO, is set to be zero, then there is no need for the sequence-level likelihoods. Empirically, there have been reports recently that the KL-regularisation may not be necessary for AR-LLMs (quite likely under certain training setups i.e. hyperparameter setting, modeling choice where the encountered KL divergences between \(\pi_{\theta}\) and \(\pi_{ref}\) are low). See for e.g. (Copet et al, 2025); page 13 of paper. Computing these likelihood for trajectories is easy for AR-LLMs because of how they factorise sequence probabilities at a token-level; i.e. at each step, the AR-LLM predicts from its vocabulary the most likely token to generate. As a result, it is very easy to compute what an AR-LLM thinks is the likelihood of any sequence of tokens (by chain-rule, i.e. simply summing the log-probabilities for each token of the sequence).See also footnote 27 in the second post of this series. However, this is not the case 😵‍💫 for Masked-DLMs (and discrete diffusion models generally). Although we do get probabilities for tokens at each step of the diffusion generation process (which is what allows us to decide which token to unmask into), each of these steps is a denoising one that depends on all its preceding steps. In other words, computing sequence probabilities for DLMs require going through multiple denoising steps (from \(T\) to 0). To have to keep doing this for every sampled trajectory during online RL training with PPO/GRPO is very computationally expensive, and will be significantly worse for very long sequences.Although the efficient DLM methods I covered in the previous post (such as Block Diffusion (Arriola et al, 2025)) can help alleviate this, the increase in computation required – compared to what is required in AR-LLMs – will still be substantial. As such, there is a need to establish ways to efficiently, yet as accurately as possible, estimate these likelihoods with the DLM. This is the focus of much research currently and we will look into in the next section. 💡 3. What have been proposed for DLMs? This section outlines two research trends in online RL algorithms to Masked-DLMs. All of these started with diffuGRPO (Zhao et al, 2025), which landed in Q1 this year and was the first work to explore a way of bringing online RL algorithms to Masked-DLMs. diffuGRPO and the initial wave of research is distinguished by their main contributions for ways to estimate likelihoods with Masked-DLMs, another more recent wave (released in the last month or so) begin to explore extensions for Masked-DLMs with semi-autoregressive generation for longer generations and with more efficiency (e.g. with KV caching). Efficient likelihood estimation for online RL on Masked-DLMs Each of the three pieces of work mentioned here proposed a way to do the likelihood estimation. Note that although they were formulated for GRPO, it should be possible to leverage their likelihood approaches for use in a PPO setup. ◼️ diffu-GRPO (Zhao et al, 2025): estimate the per-token likelihood of a trajectory by simply doing unmasking in one-step.In practical terms, this is done as follows: for a given prompt \(q_{k}\) append it with a fully-masked continuation (i.e. max sequence generation length) and pass it through the Masked-DLM; this output is the estimated per-token probability distribution (conditioned by the prompt \(q_{k}\)). As noted above, such one-/few-step unmasking does not reflect the multi-step denoising in Masked-DLM – hence and quite importantly, their proposal hinges on (i) the \(\mu\)-updates typically (but not mandatorily) used in GRPO, and (ii) a random masking to the prompt \(q_{k}\) portion of the input (i.e. input = \(q_{k}\) + fully masked continuation). At every of the \(\mu\)-steps, the mask is randomised but always fixed at 15%.See Appendix A of paper: “In gradient update iterations, each token in the prompt is randomly masked with a probability pmask = 0.15 for log-probability estimation.” We can see this as obtaining slightly varied likelihood estimates for a set of inputs closely resembling the prompt \(q_{k}\) which according to the authors “acts a form of regularization for policy optimization”. As for estimating sequence-level likelihood of a trajectory: the authors assume mean-field decomposition (i.e. a series of localized independent distributions can be useful for approximating a complex conditional distribution), allowing them to simply sum the trajectory’s per-token log probabilities to get this estimate. At least two pieces of empirical support are available for diffu-GRPO: (i) (Zhao et al, 2025) reported consistently stronger performance on four different math and puzzle/planning logical problems;See Table 1 and Figure 5 of their paper and (ii) the same approach for obtaining per-token and sequence likelihood was adopted by IGPO (Zhao et al, 2025b) as well and tested successfully on reasoning benchmarks there. Image: likelihood estimation approach in diffuGRPO via one-step denoising (varied mask on prompt tokens across \(\mu\)-updates) – source: (Zhao et al, 2025). ◼️ coupled-GRPO (Gong et al, 2025): departs from diffu-GRPO in that they apply the masking to the continuation portion (i.e. the \(o^k_i\)) for each trajectory. They also use two samples on each trajectory at every \(\mu\)-step for the estimation (compared to diffu-GRPO’s use of only one sample of each prompt \(q_{k}\) at every \(\mu\)-step, which requires much less computation). Each of the two samples masks different, but paired (henced the “coupled” in the name) parts of the continuation.This involves sampling a random timestep \(t\), and then setting the other \(\hat{t}\) so that (i) \(t + \hat{t} = T\), the terminal timestep (i.e. 1.0). Then, what is masked for \(t\) is not masked in \(\hat{t}\) and vice-versa. This ensures that (i) every token in a trajectory is involved once in the estimation giving “each token a non-zero learning signal”, and (ii) it also more closely mimics the denoising generation process in Masked-DLMs (where probabilities are produced on partially masked continuations). coupled-GRPO was used by (Gong et al, 2025) in their training for DiffuCoder, a code generation-focused Masked-DLM, who stated that coupled-GRPO was formulated in response to their findings that diffu-GRPO’s likelihood approximation methods do not “yield a stable reward improvement…, probably because code tasks demand higher token-level generation accuracy than math tasks”. They go on to show that coupled-GRPO leads to stronger performance through stabler rewards (see left and centre chart of Figure 7 in their paper) over diffu-GRPO for coding tasks, as well as for another baseline where they remove the masking coupling of (\(t, \hat{t}\)). Interestingly they also found that coupled-GRPO required sampling trajectories at a higher temperature for success (see right chart of Figure 7 in their paper) which has congruence with similar findings on online RL for AR-LLMs recently (Liu et al, 2025). Image: likelihood estimation approach in Coupled-GRPO to balance coverage and reduce variance – source: (Gong et al, 2025). ◼️ uni-GRPO (Yang et al, 2025): was applied in the training procedure for MMaDA, a Masked-DLM with multi-modal (vision and text) capabilities. It is similar to coupled-GRPO in that it also masks on the continuation to obtain the likelihood estimates. Specifically, the noise for each \(\mu\)-step update is randomly sampled from a uniform distribution (instead of the fixed 15% of the prompt in diffu-GRPO, which also meant that the same timestep (i.e. \(T\)) across all samples was used there). Only one sample on each trajectory at every \(\mu\)-step is taken (i.e. more computation than diffu-GRPO but less than coupled-GRPO). In a departure from the other two approaches, the per-token likelihood is computed with the masked tokens (i.e. this relates to the ELBO of the Masked-DLM),See Equation 3 of the MMaDA paper for uni-GRPO; and compare with Equation 4 of the DiffuCoder paper for coupled-GRPO. and the sequence level likelihood “is then approximated by averaging over masked tokens” (see Equation 4 in paper). Taking the ELBO as the estimate is quite a meaningful departure from the diffu-GRPO and coupled-GRPO approaches, and although it has a theoretical connection to the Masked-DLM training objective, it is not clear that it provides a better estimate for online RL training compared to the case where all tokens are considered (as in coupled-GRPO); nonetheless, it is clear that uni-GRPO outperforms diffu-GRPO (likely due to the larger-sized sampling, i.e. every trajectory every \(\mu\)-step).See comparisons of their performance in Figure 3 in §5.2 of the MMaDA paper and Table 1 of the IGPO paper that also leverages diffu-GRPO. Other proposals for Masked-DLMs ◼️ wd1 (Tang et al, 2025): proposes a few modifications to the GRPO objective which allow it to be used on a Masked-DLM with likelihood evaluation through one policy only (the current policy \(\pi_{\theta}\)). This is desirable as it is much more computationally efficient compared to the approaches above, which needed to do so for the policy before the \(\mu\)-update (\(\pi_{old}\)) and the reference policy (\(\pi_{ref}\)). Briefly, their approach hinges on shifting from (i) applying the importance sampling to the advantage (as per original PPO; see above) to (ii) applying a reverse KL-divergence penalty.See §3.1 and Equation 3 of their paper. This enables their derivation of an expression on the GRPO objective that only needs likelihood estimates from \(\pi_{\theta}\).In order to obtain their expression of the GRPO objective, it also involves shifting the trajectory sampling (from \(\pi_{old}\)) to a geometric mixture of \(\pi_{old}\) and \(\pi_{ref}\) (see §3.1 of their paper). They report obtaining up to 16% better performance over diffuGRPO on math and logic/puzzle planning benchmarks even without having to do an SFT phase (which is, on the other hand, needed for diffuGRPO to reach reasonable performance).Sidenote: Although, a case could be made that the settings (Tang et al, 2025) use for comparison with diffuGRPO might not be fully like-for-like. Their wd1 objective is obtained assuming the \(\beta\)-controlled KL-regularisation term (see above) is included in the GRPO objective (see their Equation 5). While the final expression of the wd1 objective does away with the KL-regularisation term and does not include it explicitly, the controlling \(\beta\) term remains embedded throughout the wd1 objective (their Equations 9 and 6). Yet in practice \(\beta\) is set to 0 (see “Implementation” in §4 of their paper) for the model trained with wd1 in their experiments, in effect this leaves out the consideration of any KL-regularisation. On the other hand, the \(\beta\) from the original diffuGRPO paper (0.04) was kept (see Table 5 of Appendix B.4). Perhaps it will be helpful to also understand how diffuGRPO performs without the KL-regularisation applied. Notably however, it does not appear that this approach leads to stronger empirical outcomes when compared with uniGRPO.Compare the reported scores on GSM8K and MATH500 by (Zhao et al, 2025b) (refer to Table 1) and (Tang et al, 2025) (refer to Table 3; look at the 256-length results as the other paper used a 256 length setting – see Appendix A there). ◼️ IGPO (Zhao et al, 2025b): shares the same first author as diffu-GRPO, and as mentioned above, uses the same likelihood estimation approach as diffu-GRPO. The novelty here is a procedure to leverage the inpainting capabilities inherent in DLMs (see my first post) to optimise the training efficiency and efficacy of GRPO. In GRPO,Refer to the image of PPO vs GRPO in the margins above for a sense. when the entire group of sampled trajectories for a given prompt \(q_{k}\) (for e.g. a math problem) is zero, this results in no useful signal for the model to update its parameters. IGPO’s proposal assumes access to ground-truth or sufficiently high quality reasoning traces for \(q_{k}\) and to use some segment of the reasoning traces when such zero-reward groups are encountered. Specifically, by “seeding” a fragment of the reasoning trace amongst the masked tokens, we get a chance to steer the Masked-DLM towards generating a trajectory of good quality,It is akin to hinting to Jesse “They have 2 plus 2 apples so that is 4 apples…“ which is then swopped with a zero-reward trajectory from the group. Training with this way to avoid zero-reward update steps led to stabler learning and enabled improvements on math and planning benchmarks over diffuGRPO; but importantly, it also outperforms uniGRPO that requires more samples to be taken for the likelihood estimation (per-trajectory per \(\mu\)-update vs per-prompt per \(\mu\)-update).See Table 1 of their paper (Zhao et al, 2025b). ◼️ TraceRL (Wang et al, 2025): encapsulates some of the latest developments on a few fronts in Masked-DLM research. To summarise, they propose the use of a value model (another DLM) to manage the variance across updates (à la PPO). In addition, they leverage the semi-autoregressive approach of Fast-dLLM (Wu et al, 2025), which I discussed in my previous post, that denoises blocks of tokens auto-regressively with efficiency via the use of approximated KV caches. This has the effect of giving a speed up to trajectory sampling, easing a major bottleneck especially for problems that are best solved with lengthy reasoning traces. To put these extensions together required special treatment (e.g. their §4.3), and this work is notable for putting forward a proposed solution for doing so. They report impressive performance on math benchmarks (87.4 for GSM8K and 94.2 for MATH500) that outperform the other methods listed above in this section,See Table 2 of their paper (Wang et al, 2025); and compare against results report in the other papers. as well as ones for coding. Helpfully, the authors released the TraDo series of 4B/8B parameters Masked-DLMs that they trained with this approach, alongside their codebase. 🤔 4. What’s next? To sum up, in this post, I started with a general overview of online RL post-training for LLMs. With some common ground established on that, I highlighted the main challenge for extending existing methods for AR-LLMs to Masked-DLMs, which is the issue of how to efficiently and accurately estimate token and sequence likelihoods needed for the PPO/GRPO objective. Finally, I gave an outline for recent work bringing PPO/GRPO to Masked-DLMs, focusing on how they proposed to address this estimation challenge. This wraps up the first round on the topics I intended to cover in this series; the next posts – probably slightly further out – would look at all the areas I have discussed so far, but with multi-modality in consideration.]]></summary></entry><entry><title type="html">Diffusion Language Models – Part Three (Generating with DLMs; through some art &amp;amp; linguistics)</title><link href="/articles/25/Diffusion_LM_P3" rel="alternate" type="text/html" title="Diffusion Language Models – Part Three (Generating with DLMs; through some art &amp;amp; linguistics)" /><published>2025-09-01T09:00:00+08:00</published><updated>2025-09-01T09:00:00+08:00</updated><id>/articles/25/Diffusion_LM_P3</id><content type="html" xml:base="/articles/25/Diffusion_LM_P3"><![CDATA[<p><span class="newthought">Generating text with DLMs is quite different from doing so with AR-LLMs</span>, and in my earlier posts <a href="https://hankelvin.github.io/articles/25/Diffusion_LM_P1">here</a> and <a href="https://hankelvin.github.io/articles/25/Diffusion_LM_P2">here</a> I have sketched a brief outline of how it works for Masked-DLMs (using a Wheel of Fortune analogy). In this post, I will go a little deeper into the generation process and examine a number of limitations/challenges there, together with what has been recently proposed for addressing them. This post also starts with a slightly different flavour, a light detour from the so-far technical posts with a foray covering interesting artwork I saw recently, with associations to DLMs and which brought out connections to ideas in linguistics; I was thinking that starting this way could help ground the technical aspects of DLMs to some visual concepts which might aid in understanding these technical aspects. <label for="marginnote-delay" class="margin-toggle"> ⊕</label><input type="checkbox" id="marginnote-delay" class="margin-toggle" checked="" /><span class="marginnote"><em>This post is a little delayed as I was engaged in some community duties and it also took some time to take a deeper investigation into some of the works covered in this post.</em></span></p>

<h3 id="️-1-seeing-dlms-through-art-and-linguistics">🖼️ 1. Seeing DLMs through art and linguistics</h3>
<p>In July, I visited the <a href="https://www.instagram.com/hemanchong/" title="Heman Chong's Instagram">Heman Chong</a> retrospective at the <a href="https://www.singaporeartmuseum.sg/">Singapore Art Museum</a>, and encountered a piece of work that made me smile<label for="marginnote-heman" class="margin-toggle"> ⊕</label><input type="checkbox" id="marginnote-heman" class="margin-toggle" checked="" /><span class="marginnote">In fact, quite a few of Chong’s pieces at the retrospective brought a smile to my mind; it was a very pleasant visit for being thought-provoking on a number of levels. I do think his work deserves more local (Singaporean) appreciation (they are highly incisive commentaries, many with complex multi-layered abstractions of remarkable spareness that slowly unfurl in your mind; and are world-class with a Singaporean flavour to them).</span> and also immediately made me think of DLMs.</p>

<p><br />
<label for="marginfigure-heman" class="margin-toggle">⊕</label><input type="checkbox" id="marginfigure-heman" class="margin-toggle" checked="" /><span class="marginnote"><img class="fullwidth" src="/assets/img/IMG_5426.JPG" /><br />Call for the Dead, 2020,  Screenprint and acrylic on linen, Collection of the artist<br /><em>“While on residency at STPI (Singapore Tyler Print Institute) in 2020, Heman Chong read and then redacted John le Carré’s first book, Call for the Dead… Erasing everything except for its verbs, Chong’s Call for the Dead leaves us only with a sense of something having happened and the awareness that the text holds secrets not meant for us.”</em></span></p>
<figure><figcaption><span></span></figcaption><img src="/assets/img/IMG_5425.JPG" /></figure>

<p>The work (close-up in the image above, and a wide-shot of the hanging in the right margin) is titled <a href="https://www.stpi.com.sg/exhibitions/heman-chong-peace-prosperity-and-friendship-with-all-nations/" title="STPI -- Heman Chong: Peace Prosperity And Friendship With All Nations">“Call for the Dead”</a> and is the text of a John le Carre novel that Chong had meticulously blacklined throughout to mask every single word except for the verbs. What immediately struck me was how there is a liminal quality to the work: at once filled with meaning yet poised for more – the state of the text shown permits/carries potential, similar to being in the midst of complete masking/unmasking in the forward/reverse process. What also struck me as I got close to the work was how, despite there only being verbs remaining, I could still reasonably make out the narrative arc of the text (the numbers and lengths of the masked words did help in the decoding), and that brought to mind Davidsonian<label for="sidenote-davidson" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-davidson" class="margin-toggle" checked="" /><span class="sidenote">The view that sentence meaning is quantified over events denoted by verbs; verbs whose arguments are filled by the participants of the event; for e.g. the sentence <em>“Chong masked words from the book”</em> can be represented as: \(∃e(mask(Chong, words, e)\) ∧ \(from(words, book, e))\) or equivalently <em>“There exists some event \(e\) where a masking action by Chong on words takes place, and the words are from the book in \(e\)“</em>. See <a href="https://user.phil-fak.uni-duesseldorf.de/~filip/Davidson.67.pdf" title="Filip, Hana. _Lexical Semantics of Verbs: The Davidsonian event argument._ 2013">(Filip, 2013)</a> for a nice set of notes on Davidsonian event semantics. <br /><ins><em>Sidenote:</em></ins> this is a nice paper where visual question answering, scene graphs and Davidsonian semantics come together for interpretable verification of the contents of generated outputs <a href="https://arxiv.org/abs/2310.18235" title="Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation">(Cho et al, 2024)</a></span>  and neo-Davidsonian event semantics! This was also the inspiration for wondering in the previous post which set of word types a DLM would be most confident in unmasking at the start of the reverse process.<label for="sidenote_likely" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote_likely" class="margin-toggle" checked="" /><span class="sidenote">For conditional generation, a reasonable hypothesis is that this set might be a mix of verbs and proper nouns (i.e. entities in the context/prompts), but it would be interesting to verify this over a few Masked-DLMs to better understand what/how a DLM might be learning and generating.</span> Sometimes there can be connections between art, linguistics and computer science in quite beautiful ways.<label for="sidenote-calder" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-calder" class="margin-toggle" checked="" /><span class="sidenote">Another is Alexander Calder’s <a href="https://calder.org/works/hanging-mobile/untitled-c-1942-7" title="Untitled, 1942">mobiles</a>, whose elements hang in balance, suspended in place like constituent syntax trees of an utterance. And another is <a href="https://www.nationalgallery.sg/sg/en/learn-about-art/magazine/lin-hsin-hsin-speed-of-thought.html" title="Lin Hsin Hsin @ speed of thought">Lim Hsin Hsin</a> a trailblazing (on so many levels: female multi-disciplinary artist formally trained in mathematics and computer science practising from the 1970s in Singapore) artist whom I recently learned about at the latest iteration of the National Gallery’s <a href="https://www.nationalgallery.sg/sg/en/exhibitions/singapore-stories.html" title="Singapore Stories: Pathways and Detours in Art">permanent exhibition</a> of Singapore art.</span></p>

<h3 id="-2-masked-dlms-since-we-mask-randomly-in-the-forward-process-why-not-just-unmask-randomly-in-the-reverse-process-too">🎲 2. Masked-DLMs: Since we mask randomly in the forward process, <br />why not just unmask randomly in the reverse process too?</h3>
<p>We mask tokens randomly in the forward process for training Masked-DLMs, and this is what allows the DLM to learn so that we can do any-order/non-autoregressive generation in the reverse process. Therefore, it is natural to also think of doing the unmasking in the reverse process in a similarly random manner i.e. at each timestep \(t\), randomly choose a K-sized<label for="sidenote-k" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-k" class="margin-toggle" checked="" /><span class="sidenote">i.e. K = sequence length / number of denoising steps</span> set of still-masked index positions to be unmasked by the model <a href="https://proceedings.neurips.cc/paper/2021/file/958c530554f78bcd8e97125b70e6973d-Paper.pdf" title="Structured Denoising Diffusion Models in Discrete State-Spaces">(Austin et al, 2021)</a>.</p>

<p>However, this is <ins>sub-optimal</ins>; especially if we follow the “classical” Masked-DLM formulation (i) where, once a token is unmasked it stays fixed in that state, and (ii) if we take large steps to unmask much more than a single token at each step, both of which would come together to give poorer generation quality. A key reason for this is because, when taking steps of \(\gt\) 1 tokens in the reverse process, <ins>the dependence of the tokens being unmasked are not taken into account</ins>, due to how Masked-DLM models are parameterised.<label for="sidenote-multidim" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-multidim" class="margin-toggle" checked="" /><span class="sidenote">For details, see the paragraph under the “Multi-dimension” subheader in §2.1 of <a href="https://arxiv.org/pdf/2406.03736" title="Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data">(Ou et al, 2024)</a> and §3.3 of <a href="https://arxiv.org/abs/2310.16834" title="Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution">(Lou et al, 2023)</a>.</span> Hence, if 100 tokens have to be unmasked at a step, it is quite possible that some of them could be incompatible with each other. The worst case scenario is when the incompatible tokens unmasked in the same step are located next to or close to each other in terms of sequence positions, which would leave little leeway for recovery in the later steps from the incompatibility. Such errors compound and could lead to increasingly less coherent text.</p>

<p>The most obvious solution for this would be to take small steps (and the safest would be to take a step of only one token at a time). Another easy solution could be to take smaller steps initially to try to avoid early irreversible clashes before increasing the step size over time. However these simple fixes are inadequate,<label for="sidenote-simple" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-simple" class="margin-toggle" checked="" /><span class="sidenote">The former entirely removes a major appeal of DLMs, which is its potential for “parallel decoding” to achieve faster generation speed; whereas the latter does not fully address the token independence issue in a principled manner – it would surface at later larger steps and hence could still lead to issues with generation quality.</span> and the following are two more sophisticated directions that have been proposed:</p>

<p>◼️ <ins><em>top-k confidence:</em></ins> Since Masked-DLMs parameterise the probability distribution over the vocabulary for each sequence position (\(0 \leq i \leq L\), where \(L\) is the length of the sequence to be generated), it makes sense to adopt a “top-k” strategy over the model’s highest confidence at each position.<label for="sidenote-topk" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-topk" class="margin-toggle" checked="" /><span class="sidenote">Note that this is different from the top-k sampling in AR-LLMs, where for a given token to be generated it is sampled from the set of top-k tokens with the highest probabilities; here it refers to the top-k sequence positions with the highest probability values in their distributions.</span> This was used in <a href="https://arxiv.org/pdf/2302.05737" title="A Reparameterized Discrete Diffusion Model for Text Generation">(Zheng et al, 2024)</a> and recently systematically studied in <a href="https://arxiv.org/abs/2502.06768" title="Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions">(Kim et al, 2025)</a>. However, the top-k strategy loses its effectiveness in instances where the model has similar confidence (i.e. uncertainty) at some positions, and to address that <a href="https://arxiv.org/abs/2502.06768" title="Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions">(Kim et al, 2025)</a> proposed “top-k margin confidence”.<label for="sidenote-topkm" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-topkm" class="margin-toggle" checked="" /><span class="sidenote">Here, the margin of the top-2 most probable tokens at each position is computed and is used to select the top-k subset for unmasking. This approach leaves out sequence positions that may have a high probability value in it, but where the model has uncertainty between its most probable tokens.</span> This approach has been shown to improve generation quality meaningfully and these methods have been incorporated as options in the LLaDA <a href="https://arxiv.org/abs/2502.09992" title="Large Language Diffusion Models">(Nie et al, 2024)</a> and Dream <a href="https://hkunlp.github.io/blog/2025/dream" title="Dream 7B"> (Ye et al, 2025)</a> scripts.</p>

<p>◼️ <ins><em>allowing corrections:</em></ins> Another direction is to allow corrections to be made to the already unmasked tokens, instead of keeping them fixed entirely. The corrections could come (i) via remasking, or (ii) using a setup with a specialised model.</p>

<blockquote>
  <p>◽️ The former, also referred to as “forward-backward” correction, involves picking a subset of the already unmasked tokens – early approaches picked these by randomly sampling – and returning (hence the “backward”) them to the masked state. Doing so allows another shot at unmasking them to appropriate tokens, since the model can now take into consideration the other tokens already unmasked (including the ones in the same step as it). Notably, it has been shown by <a href="https://arxiv.org/abs/2503.00307" title="Remasking discrete diffusion models with inference-time scaling">(Wang et al, 2025)</a> that it is not necessary for special training procedures to use such remasking.<label for="sidenote-remdm" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-remdm" class="margin-toggle" checked="" /><span class="sidenote">Their work outline a set of proofs showing that an already trained Masked-DLM can be used with remasking in a theoretically-supported manner; subject to the Masked-DLM having been trained with a negative ELBO tighter than the one for ReMDM that they specify (see §3.2 of their paper).</span> Note also that remasking permits a reverse process with more steps than the sequence length \(L\), which has analogues to test-time scaling in AR-LLMs <a href="https://arxiv.org/abs/2408.03314" title="Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters">(Snell et al, 2024)</a>.<label for="sidenote-tts" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-tts" class="margin-toggle" checked="" /><span class="sidenote">Without remasking, i.e. masked tokens stay masked, hence once all of the sequence positions are unmasked, further steps have no effect. With remasking, this constraint is removed and can be seen as a form of test time-scaling for DLMs; this shares parallels with how additional generation steps in AR-LLMs, for e.g. “think” tokens (that do not count towards the answer) have been shown to improve AR-LLMs’ answer performance.</span> The drawbacks with remasking are that: (i) using random sampling for tokens to remask, or “uninformed correction”, is not optimal as it does not directly target the tokens that need correction and could even miss them<label for="sidenote-informed" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-informed" class="margin-toggle" checked="" /><span class="sidenote">To this, <a href="https://arxiv.org/abs/2503.00307" title="Remasking discrete diffusion models with inference-time scaling">(Wang et al, 2025)</a> also examined a few ways for selecting the tokens to be remasked (“informed correction”), including a customisable approach (<em>ReMDM-conf</em>) that is (1) based on the confidence the model had when unmasking a token, and (2) which can be activated at points of the reverse process where remasking is most helpful.</span>, and (ii) remasking does add inference overhead<label for="sidenote-overhead" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-overhead" class="margin-toggle" checked="" /><span class="sidenote">Given some fixed number of tokens to unmask per-step, setting a portion to remask at each step increases the number of steps required.</span>.</p>
</blockquote>

<blockquote>
  <p>◽️ On the other hand, the second approach such as the one proposed by <a href="https://arxiv.org/abs/2407.21243" title="Informed Correctors for Discrete Diffusion Models">(Zhao et al, 2024)</a>, involve a separate model that can be designed to identify and predict for the direct transitioning of certain unmasked tokens to another non-mask token. The appeal here is that this does away with having to do the superfluous transition to the mask token first, before transitioning to another prediction for the token. However, needing a separate model comes with the drawback that it adds to <strong>both</strong> training and inference overhead.</p>
</blockquote>

<p>Nonetheless, it should be possible to combine these approaches for Masked-DLM generation without negative effects on quality, i.e. use top-k confidence to select which positions to unmask as well as allow for correction with remasking.</p>

<h3 id="️-3-are-there-alternatives-to-independent-token-level-unmasking">🖇️ 3. Are there alternatives to independent token-level unmasking?</h3>
<p>Since large-step parallel decoding in Masked-DLMs is sub-optimal (due to the independence of the tokens unmasked at every such steps; see above) this acts as a limit to faster decoding (i.e. generating with fewer denoising steps). As such, some work have examined how to mitigate this, including two that share some similarities with <strong>draft-model based speculative decoding</strong> in AR-LLMs <a href="https://arxiv.org/abs/2211.17192" title="Fast inference from transformers via speculative decoding">(Leviathan et al, 2023)</a>,<label for="sidenote" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote" class="margin-toggle" checked="" /><span class="sidenote">Broadly, this approach involves using a small AR-LLM (of the same architecture but with fewer parameters hence faster to run; the ‘draft model’) to auto-regressively generate multiple (\(K\)) tokens ahead of the larger and more capable AR-LLM used to generate the output (‘target model’). It involves putting the \(K\) draft tokens through the target model (any speed-up is due to this: by computing probabilities for multiple tokens in parallel instead of one at a time) and accepting the draft tokens up the token before the predictions (based on probabilities) of the draft and target LLM depart, and then generating +1 token from the target model; if there is no departure, all tokens are accepted. Generation then continues with the draft model proposing the next \(K\) tokens. <br /><ins><em>Sidenote:</em></ins> Here’s a nice <a href="https://research.google/blog/looking-back-at-speculative-decoding/" title="Looking back at speculative decoding">blog post</a> recently written by the authors of <a href="https://arxiv.org/abs/2211.17192" title="Fast inference from transformers via speculative decoding">(Leviathan et al, 2023)</a>; the AI search summaries on Google’s search results page are served with the help of speculative decoding! This vLLM <a href="https://blog.vllm.ai/2024/10/17/spec-decode.html" title="How Speculative Decoding Boosts vLLM Performance by up to 2.8x">blog post</a> is also a good read.</span> in the sense that all these approaches leverage two models (one larger and one usually smaller, more efficient one) to improve generation quality.</p>

<p>An example of this is <a href="https://arxiv.org/abs/2410.01949" title="Discrete Copula Diffusion">(Liu et al, 2024)</a>’s Discrete Copula Diffusion (<strong>DCD</strong>), an approach where the probabilities of a Masked-DLM (<a href="" title="">SEDD Absorb</a>) are augmented with that of a much smaller-sized AR-LLM <a href="https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf" title="Language models are unsupervised multitask learners">(GPT-2 small)</a> adding information about the joint distributions between tokens. This “enhanced” distibution enables better generation quality in the form of lower generative perplexity while allowing the model to take fewer denoising steps.<label for="sidenote-copula" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-copula" class="margin-toggle" checked="" /><span class="sidenote">This is inspired by the concept of a copula model in statistics that parameterises a joint distribution using information from known marginals; which is what we have with the per-step token-level independent distributions from the Masked-DLM. Note however, that there are many details in this work and it might be worth looking into these further, one of which is how an off-the-shelf pretrained AR-LLM (GPT-2 small) can provide the joint distibution information on noisy (masked) data.<br /><ins><em>Sidenote:</em></ins> I found this book by <a href="https://www.nowpublishers.com/article/Details/ECO-005" title="Copula Modeling: An Introduction for Practitioners">(Trivedi &amp; Zimmer, 2007)</a> helpful to get a primer on copula modeling.</span> Whilst this approach will require the Masked-DLM and supporting AR-LLM share the same tokenizer, this is supported by the works in the vein of <a href="" title="Dream 7B: Diffusion Large Language Models">Dream (Ye et al, 2025)</a> and <a href="https://arxiv.org/abs/2508.15487" title="Scaling Diffusion Language Models via Adaptation from Autoregressive Models">DiffuLlama (Gong et al, 2025)</a> for converting AR-LLMs to Masked-DLMs.</p>

<p><br /></p>
<figure><figcaption><span>Image: Discrete Copula Diffusion – source: <a href="https://arxiv.org/abs/2410.01949" title="Discrete Copula Diffusion">(Liu et al, 2024)</a></span></figcaption><img src="/assets/img/dcd.png" /></figure>

<p>Another is <a href="https://openreview.net/forum?id=sL2F9YCMXf" title="Energy-Based Diffusion Language Models for Text Generation">(Xu et al, 2025)</a>’s Energy-based Diffusion Language Model (<strong>EDLM</strong>), which shares some similarities in the use of a separate energy-based model (EBM) for obtaining information about token dependencies; and one of two ways they propose to obtain such EBMs is similarly through the use of AR-LLMs. <em>Tangentially related here is work such as <a href="https://aclanthology.org/2025.naacl-long.601.pdf" title="Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion">(Christopher et al, 2025)</a> that go in the other direction, that is to use DLMs to help speculative decoding for AR-LLMs.</em></p>

<h3 id="-4-the-sequence-lengths-have-to-be-fixed-and-caching-cant-be-done">📐 4. The sequence lengths have to be fixed and caching can’t be done?</h3>

<p>Another limitation of initial Masked-DLMs (and all DLM variants for that matter) is that generation length is a hyperparameter that has to be set in advance (i.e. some length needs to be specified, which is then used to set the noised input, and then the denoising steps can take place over it). This hard constraint is very limiting; for outputs that turn out to be shorter in length, this leads to wasted computation. More significantly, if the ideal output happens to require more tokens than the length than specified, this results in reduced generation quality (through truncation, or via throwing off coherence as the model contorts to denoise within the specified length). At the same time, another issue relates to how the standard KV caching used in AR-LLMs<label for="sidenote-kvcache" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-kvcache" class="margin-toggle" checked="" /><span class="sidenote">See this excellent post by <a href="https://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llms" title="Understanding and Coding the KV Cache in LLMs from Scratch">Sebastian Raschka</a></span>, that helps speed generation there, is not directly transplantable to DLMs (due to how tokens can be unmasked anywhere across the sequence at each step, and how the state of the entire sequence at step \(t\) is needed for predicting the distributions of the following denoising step).</p>

<p>There are a number of approaches proposed to address the former (via variable length decoding) as well as the latter (through block-style diffusion). I highlight two recent pieces of work that combine them: (1) <strong>BD3-LMS</strong> proposed by <a href="https://openreview.net/forum?id=tyEyYT267x" title="Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models">(Arriola et al, 2025)</a>, and <strong>Fast-dLLM</strong> proposed by <a href="https://arxiv.org/abs/2505.22618" title="Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding">(Wu et al, 2025)</a>. Both introduce a notion of semi-autoregressive generation using blocks; each block is autoregressively processed up to the point where an end-of-sequence token is generated, and within each block the typical diffusion denoising steps are carried out (see GIF below). For fast-DLLM, they retain the typical DLM bidirectional attention across all blocks during generation and therefore have to rely on approximations for the caching, approximations which they support with analysis of the typical attention patterns across denoising steps. Notably, <a href="https://arxiv.org/abs/2505.16933" title="LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning">LLaDA-V</a> – the multi-modal (vision &amp; language) version of LLaDA – has fast-DLLM integrations built into its <a href="https://github.com/ML-GSAI/LLaDA-V">implementation</a>. On the other hand, the authors of BD3-LMS use a special attention mask that allows attention to be causal across blocks, which then allows them to cache the KV computation of earlier blocks (akin to how it are done in AR-LLMs at token level); as a result their approach does not need to rely on approximate caches, which in theory should guarantee better results than fast-DLLM.</p>

<p>~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~</p>
<figure><figcaption><span></span></figcaption><img src="/assets/img/bd3lms-ar.gif" /></figure>
<p>~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~</p>
<figure><figcaption><span></span></figcaption><img src="/assets/img/bd3lms-mdlm.gif" /></figure>
<p>~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~</p>
<figure><figcaption><span>Image: comparing auto-regressive generation with AR-LLMs (top), vanilla DLM denoising (middle), and BD3-LMS’ semi-autoregressive generation (bottoms) – source: <a href="https://m-arriola.com/bd3lms/" title="Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models">(Arriola et al, 2025)</a></span></figcaption><img src="/assets/img/bd3lms-sar.gif" /></figure>
<p>~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~</p>

<h3 id="-4-whats-next">👉 4. What’s next?</h3>
<p>This post went into a few of the key issues and considerations faced in the generation (denoising) process of Masked-DLMs, including looking at (i) how informed denoising strategies is used to help improve generation quality, (ii) how step-wise token-level independence is a limitation for faster generation speed together with some of the (early) ways proposed to address this, and (iii) how block diffusion approaches such as BD3-LMS enable variable length generation as well as KV caching similar to what’s done in AR-LLM for efficient LLM inference (which are particularly important for long sequence generation with Masked-DLMs). In the next post, I will take a look into reinforcement learning (RL) for masked-DLMs, covering what the key challenges there are, followed by recent proposals for addressing them; the focus will be on policy-gradient approaches similar to PPO and GRPO, that have recently been instrumental in AR-LLM post-training.</p>]]></content><author><name>Kelvin Han</name></author><category term="discrete" /><category term="diffusion" /><summary type="html"><![CDATA[Generating text with DLMs is quite different from doing so with AR-LLMs, and in my earlier posts here and here I have sketched a brief outline of how it works for Masked-DLMs (using a Wheel of Fortune analogy). In this post, I will go a little deeper into the generation process and examine a number of limitations/challenges there, together with what has been recently proposed for addressing them. This post also starts with a slightly different flavour, a light detour from the so-far technical posts with a foray covering interesting artwork I saw recently, with associations to DLMs and which brought out connections to ideas in linguistics; I was thinking that starting this way could help ground the technical aspects of DLMs to some visual concepts which might aid in understanding these technical aspects. ⊕This post is a little delayed as I was engaged in some community duties and it also took some time to take a deeper investigation into some of the works covered in this post. 🖼️ 1. Seeing DLMs through art and linguistics In July, I visited the Heman Chong retrospective at the Singapore Art Museum, and encountered a piece of work that made me smile ⊕In fact, quite a few of Chong’s pieces at the retrospective brought a smile to my mind; it was a very pleasant visit for being thought-provoking on a number of levels. I do think his work deserves more local (Singaporean) appreciation (they are highly incisive commentaries, many with complex multi-layered abstractions of remarkable spareness that slowly unfurl in your mind; and are world-class with a Singaporean flavour to them). and also immediately made me think of DLMs. ⊕Call for the Dead, 2020, Screenprint and acrylic on linen, Collection of the artist“While on residency at STPI (Singapore Tyler Print Institute) in 2020, Heman Chong read and then redacted John le Carré’s first book, Call for the Dead… Erasing everything except for its verbs, Chong’s Call for the Dead leaves us only with a sense of something having happened and the awareness that the text holds secrets not meant for us.” The work (close-up in the image above, and a wide-shot of the hanging in the right margin) is titled “Call for the Dead” and is the text of a John le Carre novel that Chong had meticulously blacklined throughout to mask every single word except for the verbs. What immediately struck me was how there is a liminal quality to the work: at once filled with meaning yet poised for more – the state of the text shown permits/carries potential, similar to being in the midst of complete masking/unmasking in the forward/reverse process. What also struck me as I got close to the work was how, despite there only being verbs remaining, I could still reasonably make out the narrative arc of the text (the numbers and lengths of the masked words did help in the decoding), and that brought to mind DavidsonianThe view that sentence meaning is quantified over events denoted by verbs; verbs whose arguments are filled by the participants of the event; for e.g. the sentence “Chong masked words from the book” can be represented as: \(∃e(mask(Chong, words, e)\) ∧ \(from(words, book, e))\) or equivalently “There exists some event \(e\) where a masking action by Chong on words takes place, and the words are from the book in \(e\)“. See (Filip, 2013) for a nice set of notes on Davidsonian event semantics. Sidenote: this is a nice paper where visual question answering, scene graphs and Davidsonian semantics come together for interpretable verification of the contents of generated outputs (Cho et al, 2024) and neo-Davidsonian event semantics! This was also the inspiration for wondering in the previous post which set of word types a DLM would be most confident in unmasking at the start of the reverse process.For conditional generation, a reasonable hypothesis is that this set might be a mix of verbs and proper nouns (i.e. entities in the context/prompts), but it would be interesting to verify this over a few Masked-DLMs to better understand what/how a DLM might be learning and generating. Sometimes there can be connections between art, linguistics and computer science in quite beautiful ways.Another is Alexander Calder’s mobiles, whose elements hang in balance, suspended in place like constituent syntax trees of an utterance. And another is Lim Hsin Hsin a trailblazing (on so many levels: female multi-disciplinary artist formally trained in mathematics and computer science practising from the 1970s in Singapore) artist whom I recently learned about at the latest iteration of the National Gallery’s permanent exhibition of Singapore art. 🎲 2. Masked-DLMs: Since we mask randomly in the forward process, why not just unmask randomly in the reverse process too? We mask tokens randomly in the forward process for training Masked-DLMs, and this is what allows the DLM to learn so that we can do any-order/non-autoregressive generation in the reverse process. Therefore, it is natural to also think of doing the unmasking in the reverse process in a similarly random manner i.e. at each timestep \(t\), randomly choose a K-sizedi.e. K = sequence length / number of denoising steps set of still-masked index positions to be unmasked by the model (Austin et al, 2021). However, this is sub-optimal; especially if we follow the “classical” Masked-DLM formulation (i) where, once a token is unmasked it stays fixed in that state, and (ii) if we take large steps to unmask much more than a single token at each step, both of which would come together to give poorer generation quality. A key reason for this is because, when taking steps of \(\gt\) 1 tokens in the reverse process, the dependence of the tokens being unmasked are not taken into account, due to how Masked-DLM models are parameterised.For details, see the paragraph under the “Multi-dimension” subheader in §2.1 of (Ou et al, 2024) and §3.3 of (Lou et al, 2023). Hence, if 100 tokens have to be unmasked at a step, it is quite possible that some of them could be incompatible with each other. The worst case scenario is when the incompatible tokens unmasked in the same step are located next to or close to each other in terms of sequence positions, which would leave little leeway for recovery in the later steps from the incompatibility. Such errors compound and could lead to increasingly less coherent text. The most obvious solution for this would be to take small steps (and the safest would be to take a step of only one token at a time). Another easy solution could be to take smaller steps initially to try to avoid early irreversible clashes before increasing the step size over time. However these simple fixes are inadequate,The former entirely removes a major appeal of DLMs, which is its potential for “parallel decoding” to achieve faster generation speed; whereas the latter does not fully address the token independence issue in a principled manner – it would surface at later larger steps and hence could still lead to issues with generation quality. and the following are two more sophisticated directions that have been proposed: ◼️ top-k confidence: Since Masked-DLMs parameterise the probability distribution over the vocabulary for each sequence position (\(0 \leq i \leq L\), where \(L\) is the length of the sequence to be generated), it makes sense to adopt a “top-k” strategy over the model’s highest confidence at each position.Note that this is different from the top-k sampling in AR-LLMs, where for a given token to be generated it is sampled from the set of top-k tokens with the highest probabilities; here it refers to the top-k sequence positions with the highest probability values in their distributions. This was used in (Zheng et al, 2024) and recently systematically studied in (Kim et al, 2025). However, the top-k strategy loses its effectiveness in instances where the model has similar confidence (i.e. uncertainty) at some positions, and to address that (Kim et al, 2025) proposed “top-k margin confidence”.Here, the margin of the top-2 most probable tokens at each position is computed and is used to select the top-k subset for unmasking. This approach leaves out sequence positions that may have a high probability value in it, but where the model has uncertainty between its most probable tokens. This approach has been shown to improve generation quality meaningfully and these methods have been incorporated as options in the LLaDA (Nie et al, 2024) and Dream (Ye et al, 2025) scripts. ◼️ allowing corrections: Another direction is to allow corrections to be made to the already unmasked tokens, instead of keeping them fixed entirely. The corrections could come (i) via remasking, or (ii) using a setup with a specialised model. ◽️ The former, also referred to as “forward-backward” correction, involves picking a subset of the already unmasked tokens – early approaches picked these by randomly sampling – and returning (hence the “backward”) them to the masked state. Doing so allows another shot at unmasking them to appropriate tokens, since the model can now take into consideration the other tokens already unmasked (including the ones in the same step as it). Notably, it has been shown by (Wang et al, 2025) that it is not necessary for special training procedures to use such remasking.Their work outline a set of proofs showing that an already trained Masked-DLM can be used with remasking in a theoretically-supported manner; subject to the Masked-DLM having been trained with a negative ELBO tighter than the one for ReMDM that they specify (see §3.2 of their paper). Note also that remasking permits a reverse process with more steps than the sequence length \(L\), which has analogues to test-time scaling in AR-LLMs (Snell et al, 2024).Without remasking, i.e. masked tokens stay masked, hence once all of the sequence positions are unmasked, further steps have no effect. With remasking, this constraint is removed and can be seen as a form of test time-scaling for DLMs; this shares parallels with how additional generation steps in AR-LLMs, for e.g. “think” tokens (that do not count towards the answer) have been shown to improve AR-LLMs’ answer performance. The drawbacks with remasking are that: (i) using random sampling for tokens to remask, or “uninformed correction”, is not optimal as it does not directly target the tokens that need correction and could even miss themTo this, (Wang et al, 2025) also examined a few ways for selecting the tokens to be remasked (“informed correction”), including a customisable approach (ReMDM-conf) that is (1) based on the confidence the model had when unmasking a token, and (2) which can be activated at points of the reverse process where remasking is most helpful., and (ii) remasking does add inference overheadGiven some fixed number of tokens to unmask per-step, setting a portion to remask at each step increases the number of steps required.. ◽️ On the other hand, the second approach such as the one proposed by (Zhao et al, 2024), involve a separate model that can be designed to identify and predict for the direct transitioning of certain unmasked tokens to another non-mask token. The appeal here is that this does away with having to do the superfluous transition to the mask token first, before transitioning to another prediction for the token. However, needing a separate model comes with the drawback that it adds to both training and inference overhead. Nonetheless, it should be possible to combine these approaches for Masked-DLM generation without negative effects on quality, i.e. use top-k confidence to select which positions to unmask as well as allow for correction with remasking. 🖇️ 3. Are there alternatives to independent token-level unmasking? Since large-step parallel decoding in Masked-DLMs is sub-optimal (due to the independence of the tokens unmasked at every such steps; see above) this acts as a limit to faster decoding (i.e. generating with fewer denoising steps). As such, some work have examined how to mitigate this, including two that share some similarities with draft-model based speculative decoding in AR-LLMs (Leviathan et al, 2023),Broadly, this approach involves using a small AR-LLM (of the same architecture but with fewer parameters hence faster to run; the ‘draft model’) to auto-regressively generate multiple (\(K\)) tokens ahead of the larger and more capable AR-LLM used to generate the output (‘target model’). It involves putting the \(K\) draft tokens through the target model (any speed-up is due to this: by computing probabilities for multiple tokens in parallel instead of one at a time) and accepting the draft tokens up the token before the predictions (based on probabilities) of the draft and target LLM depart, and then generating +1 token from the target model; if there is no departure, all tokens are accepted. Generation then continues with the draft model proposing the next \(K\) tokens. Sidenote: Here’s a nice blog post recently written by the authors of (Leviathan et al, 2023); the AI search summaries on Google’s search results page are served with the help of speculative decoding! This vLLM blog post is also a good read. in the sense that all these approaches leverage two models (one larger and one usually smaller, more efficient one) to improve generation quality. An example of this is (Liu et al, 2024)’s Discrete Copula Diffusion (DCD), an approach where the probabilities of a Masked-DLM (SEDD Absorb) are augmented with that of a much smaller-sized AR-LLM (GPT-2 small) adding information about the joint distributions between tokens. This “enhanced” distibution enables better generation quality in the form of lower generative perplexity while allowing the model to take fewer denoising steps.This is inspired by the concept of a copula model in statistics that parameterises a joint distribution using information from known marginals; which is what we have with the per-step token-level independent distributions from the Masked-DLM. Note however, that there are many details in this work and it might be worth looking into these further, one of which is how an off-the-shelf pretrained AR-LLM (GPT-2 small) can provide the joint distibution information on noisy (masked) data.Sidenote: I found this book by (Trivedi &amp; Zimmer, 2007) helpful to get a primer on copula modeling. Whilst this approach will require the Masked-DLM and supporting AR-LLM share the same tokenizer, this is supported by the works in the vein of Dream (Ye et al, 2025) and DiffuLlama (Gong et al, 2025) for converting AR-LLMs to Masked-DLMs. Image: Discrete Copula Diffusion – source: (Liu et al, 2024) Another is (Xu et al, 2025)’s Energy-based Diffusion Language Model (EDLM), which shares some similarities in the use of a separate energy-based model (EBM) for obtaining information about token dependencies; and one of two ways they propose to obtain such EBMs is similarly through the use of AR-LLMs. Tangentially related here is work such as (Christopher et al, 2025) that go in the other direction, that is to use DLMs to help speculative decoding for AR-LLMs. 📐 4. The sequence lengths have to be fixed and caching can’t be done? Another limitation of initial Masked-DLMs (and all DLM variants for that matter) is that generation length is a hyperparameter that has to be set in advance (i.e. some length needs to be specified, which is then used to set the noised input, and then the denoising steps can take place over it). This hard constraint is very limiting; for outputs that turn out to be shorter in length, this leads to wasted computation. More significantly, if the ideal output happens to require more tokens than the length than specified, this results in reduced generation quality (through truncation, or via throwing off coherence as the model contorts to denoise within the specified length). At the same time, another issue relates to how the standard KV caching used in AR-LLMsSee this excellent post by Sebastian Raschka, that helps speed generation there, is not directly transplantable to DLMs (due to how tokens can be unmasked anywhere across the sequence at each step, and how the state of the entire sequence at step \(t\) is needed for predicting the distributions of the following denoising step). There are a number of approaches proposed to address the former (via variable length decoding) as well as the latter (through block-style diffusion). I highlight two recent pieces of work that combine them: (1) BD3-LMS proposed by (Arriola et al, 2025), and Fast-dLLM proposed by (Wu et al, 2025). Both introduce a notion of semi-autoregressive generation using blocks; each block is autoregressively processed up to the point where an end-of-sequence token is generated, and within each block the typical diffusion denoising steps are carried out (see GIF below). For fast-DLLM, they retain the typical DLM bidirectional attention across all blocks during generation and therefore have to rely on approximations for the caching, approximations which they support with analysis of the typical attention patterns across denoising steps. Notably, LLaDA-V – the multi-modal (vision &amp; language) version of LLaDA – has fast-DLLM integrations built into its implementation. On the other hand, the authors of BD3-LMS use a special attention mask that allows attention to be causal across blocks, which then allows them to cache the KV computation of earlier blocks (akin to how it are done in AR-LLMs at token level); as a result their approach does not need to rely on approximate caches, which in theory should guarantee better results than fast-DLLM. ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Image: comparing auto-regressive generation with AR-LLMs (top), vanilla DLM denoising (middle), and BD3-LMS’ semi-autoregressive generation (bottoms) – source: (Arriola et al, 2025) ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 👉 4. What’s next? This post went into a few of the key issues and considerations faced in the generation (denoising) process of Masked-DLMs, including looking at (i) how informed denoising strategies is used to help improve generation quality, (ii) how step-wise token-level independence is a limitation for faster generation speed together with some of the (early) ways proposed to address this, and (iii) how block diffusion approaches such as BD3-LMS enable variable length generation as well as KV caching similar to what’s done in AR-LLM for efficient LLM inference (which are particularly important for long sequence generation with Masked-DLMs). In the next post, I will take a look into reinforcement learning (RL) for masked-DLMs, covering what the key challenges there are, followed by recent proposals for addressing them; the focus will be on policy-gradient approaches similar to PPO and GRPO, that have recently been instrumental in AR-LLM post-training.]]></summary></entry><entry><title type="html">Diffusion Language Models – Part Two (What kinds are there and how is one trained?)</title><link href="/articles/25/Diffusion_LM_P2" rel="alternate" type="text/html" title="Diffusion Language Models – Part Two (What kinds are there and how is one trained?)" /><published>2025-08-01T09:00:00+08:00</published><updated>2025-08-01T09:00:00+08:00</updated><id>/articles/25/Diffusion_LM_P2</id><content type="html" xml:base="/articles/25/Diffusion_LM_P2"><![CDATA[<p><span class="newthought">There are three variants of diffusion language models</span> (<strong>DLMs</strong>), and the nuances of each impact their training, inference and scalability. I think it will be helpful to situate them amongst each other before we proceed further; and so in this post, I will first introduce the variants, discuss their differences as well as what they mean. I will then go on to outline the training procedure for the Masked (<strong>Masked-DLM</strong>) variant as it is currently receiving significant amounts of attention in research<label for="sidenote-vogue" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-vogue" class="margin-toggle" checked="" /><span class="sidenote">And quite importantly, with useful extensions into multimodality and reinforcement learning already carried out with them.</span>, before ending off with a summary of two interesting pieces of DLM research<label for="sidenote-research" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-research" class="margin-toggle" checked="" /><span class="sidenote">
<br />◼️ (Wen et al, 2025) <a href="https://arxiv.org/abs/2507.11097">The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs</a> 
<br /> ◼️ (Prabhudesai et al, 2025) <a href="https://arxiv.org/abs/2507.15857">Diffusion Beats Autoregressive in Data-Constrained Settings</a></span> that surfaced recently along with thoughts on their broader implications. If you wish, you can use these links to skip to the specific sections of this post: <a href="#-1-what-variants-of-dlms-are-there">1. DLM variants</a>; <a href="#-2-apple-to-apple-whats-the-benefit-of-one-over-another">2. Comparing the variants</a>; <a href="#-3-what-is-the-training-procedure-for-a-dlm">3. Training a masked DLM</a>; and <a href="#-4-whats-come-up-recently-in-dlm-research">4. Recent findings and potential implications</a>
<!-- NOTE: "-" needed for space between emoji and first word --></p>

<h3 id="-1-what-variants-of-dlms-are-there">🎨 1. What variants of DLMs are there?</h3>
<p>DLM approaches can be described as being of the (i) <strong>Gaussian</strong>, (ii) <strong>Uniform</strong>, or (iii) <strong>Masked</strong> variants, based on how the original training data instances<label for="sidenote-datainstance" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-datainstance" class="margin-toggle" checked="" /><span class="sidenote">for e.g. an instance could be a sentence such as “dogs are our friends” for the language modeling task.</span> are “corrupted” (or how the notion of noise is conceptualised and how it is introduced into the input during the foward process<label for="sidenote-reverse" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-reverse" class="margin-toggle" checked="" /><span class="sidenote">. . . and hence (by design) the reverse process as well.</span>). The first two are named for the distributions they use to draw noise from, and the third is likely named for a special token (e.g. <code class="language-plaintext highlighter-rouge">[MASK]</code>) added to the vocabulary to give a noised state that “masks” the original token. To give a more concrete feel of each variant, I will use the following toy example to illustrate some of them.</p>

<div style="background-color: #249ae9ff; max-width: 50%; color: white; padding: 20px; border-radius: 8px; margin: 10px;">
  <h3 style="margin: 0 0 15px 0; width: 100%;">Settings for a toy example</h3>   
  <p style="margin: 0; width: 100%; text-align: justify;">   
    Imagine we have a toy language with 13 lexical units in the vocabulary V = <em>{"we", "you", "they", "our", "your", "their", "cat", "dog", "friend", "is", "are", "s", "[PAD]"}</em>, which allows one to form sequences such as <em>"dog s are our friend s"</em>, <em>"our dog s are your s"</em>, <em>"your cat is our friend"</em> etc. 
    <br /><br />
    <small><small>The special marker [PAD] is used as a filler to ensure that all sequences are of the same length, e.g. "<em>your cat is our friend [PAD]</em>", so that it has the same six-unit length as the other two dog-sentences, allowing us to do batched generations.</small></small>
  </p>
</div>

<p>◼️ <strong>Gaussian-DLM</strong>: This variant, also referred to as “continuous-time” in the literature, is the closest in form to the ones that are found in image diffusion models<label for="sidenote-imagediff" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-imagediff" class="margin-toggle" checked="" /><span class="sidenote">Such as Google’s <a href="https://deepmind.google/models/imagen/">Imagen</a>, OpenAI’s <a href="https://openai.com/index/dall-e-3/">DALL-E</a> and Stability AI’s <a href="https://stability.ai/stable-image">Stable Diffusion</a></span>, so if you are familiar with them, then the Gaussian-DLM models could be quite familiar to you. The architecture here involves: (i) embedding the tokens of a sequence so that each of them are represented with real-valued vectors<label for="sidenote-embed" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-embed" class="margin-toggle" checked="" /><span class="sidenote">This is similar to the first step in auto-regressive LLMs (<strong>AR-LLMs</strong>) (for Transformers as well as RNNs).</span>, and (ii) adding to them noise that is in the form of random vectors drawn from a Gaussian distribution \(\epsilon_{t}\) ~ \(N(\mu_{t}, \sigma_{t}^2)\) at each step. Using the toy example, what this means in practice is that we will keep a look-up table for 13 random vectors (one for every unit in our toy language’s vocabulary) and each vector is of a dimension \(d\). So the sequence <em>“dog s are our friend s”</em> will be represented by six token vectors, and we will add noise to each token vector such that every token is completely gaussian at the limit of the forward process i.e. \(t \to \infty\) (or in practice, some defined terminal timestep \(T\) so that the total noise added is known; it is usually set at 1.0 in diffusion models). A few formulations have been proposed for learning these models, including score-matching which appears most commonly, as well as latent variable models and stochastic differential equations, for learning the reverse process <a href="https://arxiv.org/abs/2211.15089" title="Continuous diffusion for categorical data">(Dieleman et al, 2023)</a>. Examples of Gaussian-DLMs include (i) Diffusion-LM <a href="https://arxiv.org/abs/2205.14217" title="Diffusion-LM Improves Controllable Text Generation">(Li et al, 2022)</a> that appeared in 2022 and which first drew attention towards diffusion modeling for text generation, (ii) CDCD <a href="https://arxiv.org/abs/2211.15089" title="Continuous diffusion for categorical data">(Dieleman et al, 2023)</a>, and (iii) PLAID <a href="https://arxiv.org/abs/2305.18619" title="Likelihood-Based Diffusion Language Models">(Gulrajani &amp; Hashimoto, 2023)</a>.</p>

<!-- <label for='sidenote-early-discrete' class='margin-toggle sidenote-number'></label><input type='checkbox' id='sidenote-early-discrete' class='margin-toggle' checked/><span class='sidenote'>Earlier models examined diffusion for discrete sequences such as music <a href='https://arxiv.org/abs/2103.16091' title='Symbolic Music Generation with Diffusion Models'>(Mittal et al, 2021)</a>, but not specifically on language and at the same scale of a language model. Initial explorations can be found in some of a few earlier discrete diffusion work such as <a href='https://arxiv.org/abs/2107.03006' title='Structured Denoising Diffusion Models in Discrete State-Spaces'>(Austin et al, 2021)</a> -- see §3.1 there</span> -->

<p>◼️ <strong>Uniform-DLM</strong>: Instead of the real-valued embedding vectors used in Gaussian-DLMs, this approach represents each token of a sequence with a one-hot vector<label for="sidenote-onehot" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-onehot" class="margin-toggle" checked="" /><span class="sidenote">i.e. the vector for each token is a dirac distribution with all probability mass concentrated at the actual token’s index in the vocabulary.</span>; of dimensions the size of the vocabulary; whereby it is 1 at the position of the token in the vocabulary, and 0 elsewhere. Here, the notion of adding noise to the data instance<label for="sidenote-udlmnoise" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-udlmnoise" class="margin-toggle" checked="" /><span class="sidenote">Note that noise is added/removed independently between each token in the sequence – “<em>We make the assumption that the forward noising process is applied independently across a sequence… the denoising process factorizes independently across tokens.</em>” <a href="https://arxiv.org/abs/2406.07524" title="Simple and Effective Masked Diffusion Language Models">(Sahoo et al, 2024)</a> </span> is via the application of some transition matrix (\(Q_t\), determining the probabilities for whether the token stays unchanged or to which one of the other vocab units) such that the initially concentrated probability mass in the token vector gradually distributes over all of the other vocab units. At the limit of the forward process, the one-hot vector applied with \(Q_{t=T}\) would give a uniform distribution (i.e. each vocab unit is equally likely; there is completely no useful signal to deduce what the original token was) and also reach stationarity, i.e. additional steps cannot change from the uniform distribution. Examples of this approach include the <code class="language-plaintext highlighter-rouge">Uniform</code> versions of the models trained in <a href="https://arxiv.org/abs/2310.16834" title="Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution">(Lou et al, 2023)</a> and <a href="https://arxiv.org/abs/2310.16834" title="Simple Guidance Mechanisms for Discrete Diffusion Models">(Schiff et al, 2025)</a>.<label for="sidenote-wheel-udlm" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-wheel-udlm" class="margin-toggle" checked="" /><span class="sidenote">To extend the <strong>Wheel</strong> analogy to Uniform-DLM: (i) instead of each panel on the gameboard being two-sided (being either white/blank or a character), they would be |\(V\)|-sided (i.e. as many sides as the vocabulary and without white/blank), (ii) the gameboard will start with some completely scrambled combination of characters, and (iii) at every guess, the contestant can flip multiple panels to any other character in the vocabulary. Another way to look at it could be as a slots machine (see GIF below).</span></p>

<p><label for="marginfigure-slots" class="margin-toggle">⊕</label><input type="checkbox" id="marginfigure-slots" class="margin-toggle" checked="" /><span class="marginnote"><img class="fullwidth" src="/assets/img/udlm.gif" /><br />Source: Slots GIF – <a href="https://discrete-diffusion-guidance.github.io/">https://discrete-diffusion-guidance.github.io/</a></span></p>

<p>◼️ <strong>Masked-DLM</strong>: This is also referred to as modeling discrete diffusion with an “absorbing state”, first appearing in <a href="https://arxiv.org/abs/2107.03006" title="Structured denoising diffusion models in discrete state-spaces">(Austin et al, 2021)</a>. Essentially, noise is represented by the special token mentioned above<label for="sidenote-maskexp" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-maskexp" class="margin-toggle" checked="" /><span class="sidenote"> for e.g., the sentence <em>“dog s are our friend s”</em> could be gradually masked to “dog [MASK] are our [MASK] s” and finally to “[MASK] [MASK] [MASK] [MASK] [MASK] [MASK]” </span>, i.e. a token in the original sequence is corrupted by being “absorbed” into this special state (rather than transitioning to others). This is an important characteristic of most current Masked-DLMs, i.e. in the forward process, once a token transitions to <code class="language-plaintext highlighter-rouge">[MASK]</code>, it stays in that state throughout the subsequent steps. Conversely, in the reverse process, once a <code class="language-plaintext highlighter-rouge">[MASK]</code> token transitions to a vocab unit \(v\) other than <code class="language-plaintext highlighter-rouge">[MASK]</code>, it also stays as \(v\) in all subsequent steps.<label for="sidenote-wheelmdlm" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-wheelmdlm" class="margin-toggle" checked="" /><span class="sidenote">This variant matches the <strong>Wheel</strong> analogy in the previous <a href="https://hankelvin.github.io/articles/25/Diffusion_LM_P1">post</a>; the white panels correspond to the <code class="language-plaintext highlighter-rouge">[MASK]</code> token, behind each white panel is one character and once a contestant makes a <del>correct</del> any guess for it, it cannot be changed.</span> Masked-DLM is the basis for the LLaDA <a href="https://arxiv.org/abs/2502.09992" title="Large Language Diffusion Models">(Nie et al, 2024)</a> and Dream <a href="https://hkunlp.github.io/blog/2025/dream" title="Dream 7B"> (Ye et al, 2025)</a> models, as well as well as multimodal versions such as LLaDA-V <a href="https://arxiv.org/abs/2505.16933" title="LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning">(You et al, 2025)</a> and MMaDA <a href="https://arxiv.org/abs/2505.15809" title="MMaDA: Multimodal Large Diffusion Language Models">(Yang et al, 2025)</a>, It was also explored in SEDD <a href="https://arxiv.org/abs/2310.16834" title="Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution">(Lou et al, 2023)</a>, which Inception Lab’s Mercury models are reportedly based on<label for="sidenote-mercury" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-mercury" class="margin-toggle" checked="" /><span class="sidenote">The Mercury technical report <a href="https://arxiv.org/abs/2506.17298" title="Mercury: Ultra-Fast Language Models Based on Diffusion">(Inception Labs, 2025)</a> state that “<em>Our methods extend <a href="https://arxiv.org/abs/2310.16834" title="Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution">(Lou et al, 2023)</a> through careful modifications to the data and computation to scale up learning.</em>”. Although they do not state which of the Masked-DLM or Uniform-DLM variant they studied in (Lou et al, 2023) they ended up leveraging, it seems possible that the Gaussian-DLM approach was taken given the poorer perplexity figures obtained by Uniform-DLM in their work as well as in earlier work.</span>.</p>

<p><ins><strong>Light round-up:</strong></ins> At the limit \(T\) in their forward processes, each class of DLM can be summarised as follows: for Gaussian-DLMs each token in a sequence can transition to any other token (state) reachable by the accumulated variance of the sampled noise; for Uniform-DLMs, each token can transition to any other state with equal probability; and in the case of Masked-DLMs, each token lands on the special <code class="language-plaintext highlighter-rouge">[MASK]</code> token.</p>

<p><ins><strong>Masked-DLM connections with BERT:</strong></ins> Recall that BERT <a href="https://arxiv.org/abs/1810.04805" title="BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding">(Devlin et al, 2018)</a> – an encoder-only model that could be used for tasks such as cloze-style QA<label for="sidenote-cloze" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-cloze" class="margin-toggle" checked="" /><span class="sidenote">e.g the most likely token after <em>“Paris is the capital of”</em> is <em>“France”</em>. </span> and which is a precursor to the LLMs of today – was trained with a masked language modeling objective, i.e. random portions (~15%) of the sentences in the training data are masked, and the model has to learn to predict the masked words. This is very similar to the Masked-DLM approach, except that in Masked-DLMs this unmasking is done across multiple steps (instead of a single pass), and for the entire sequence (the last Masked-DLM inference step would most closely align with the task in the BERT MLM objective). Accordingly, some work have explored leveraging BERT-style (i.e. encoder-only) models for DLM: see DiffusionBERT <a href="https://aclanthology.org/2023.acl-long.248" title="DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models">(He et al, 2023)</a> which propose further training BERT with time-embeddings with a special diffusion noise schedule to use it <em>à la DLM</em>. See also <em>‘Comparison to BERT’</em> in §6 of <a href="https://arxiv.org/abs/2406.07524" title="Simple and Effective Masked Diffusion Language Models">(Sahoo et al, 2024)</a> for a discussion on this connection.</p>

<h3 id="-2-apple-to-apple-whats-the-benefit-of-one-over-another">🍏 2. Apple-to-apple: what’s the benefit of one over another?</h3>
<p>There is more focus on Masked-DLM and Uniform-DLM over Gaussian-DLM currently, as the latter has been met with comparably less success (in terms of achievable perplexity).<label for="sidenote-clamp" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-clamp" class="margin-toggle" checked="" /><span class="sidenote">The reason for this is because the question of how to reconcile the noised real-valued vectors in Gaussian-DLM to discrete space is not trivial and required many special tricks for training and inference to work. For instance, it was necessary to use nearest neighbour search and clamping to a valid token vector at every reverse diffusion step in Diffusion-LM <a href="https://arxiv.org/abs/2205.14217" title="Diffusion-LM Improves Controllable Text Generation">(Li et al, 2022)</a> to reach comparable performance with AR-LLMs.</span> In the most recent round of published DLM research (i.e. ICLR and ICML in 2025), significant attention has been focused on Masked-DLM approaches<label for="sidenote-duo" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-duo" class="margin-toggle" checked="" /><span class="sidenote">Though it should be noted that interesting work on Gaussian-DLM &amp; Uniform-DLM is also continuing, I think notably in <a href="https://arxiv.org/abs/2506.10892" title="The Diffusion Duality">(Sahoo et al, 2025)</a>, where they work out a proof that connects Uniform-DLM as a special case of Gaussian-DLM using the argmax operator, and opens up the possibility to leverage techniques in Gaussian-DLM for training and inferencing on Uniform-DLM.</span>, as they achieve better perplexity compared to Uniform-DLM<label for="sidenote-perplexity" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-perplexity" class="margin-toggle" checked="" /><span class="sidenote">Note that lower is better for perplexity; compare the SEDD (Uniform) and SEDD (Absorb) results in Table 1 of <a href="https://arxiv.org/abs/2310.16834v2" title="Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution">(Lou et al, 2025)</a>; as well as D3PM uniform vs D3PM absorbing in Figure 2 and Table 2 of <a href="https://arxiv.org/abs/2107.03006" title="Structured Denoising Diffusion Models in Discrete State-Spaces">(Austin et al, 2021)</a>.</span>. Intuitively, this is not unexpected, as it would seem easier to learn an Masked-DLM (where transitions are constrained once the absorbing <code class="language-plaintext highlighter-rouge">[MASK]</code> state is reached in the forward process) compared to Uniform-DLM (where transitions at each timestep could be to any other tokens, i.e. a much larger space of posssible transitions).</p>

<!-- - "such as efficient sampling algorithms based on advanced ODE solvers, or classifier-free guidance." (Dieleman et al, 2023) -->
<p>However, several shortcomings of the Masked-DLM approach have been raised (hence the continued research interests in Gaussian-DLM and Uniform-DLM approaches). Chief amongst them include: (i) the potentially over-restrictiveness of the vanilla Masked-DLM masking/unmasking procedure<label for="sidenote-vanillamaskinference" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-vanillamaskinference" class="margin-toggle" checked="" /><span class="sidenote">I use “vanilla” because there are approaches emerging that propose more advanced inference procedures such as remasking that could address this. For instance <a href="https://arxiv.org/abs/2503.00307" title="Remasking discrete diffusion models with inference-time scaling">(Wang et al, 2025)</a></span>; and (ii) the difficulty of introducing classifier-free guidance into Masked-DLMs.</p>

<p>◼️ regarding shortcoming (i): “<strong>Self-correction</strong>” is a term for referring to how tokens can continue to transition to others over the reverse diffusion steps; it is connected to the “coarse-to-fine” property (in the truest sense) that DLMs are frequently touted to beneficially possess over AR-LLMs. While self-correction is inherently possible in Gaussian-DLMs and Uniform-DLMs given their design<label for="sidenote-predictor" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-predictor" class="margin-toggle" checked="" /><span class="sidenote"><ins><em>Sidenote:</em></ins> they could also be steered with predictor-corrector models like in image diffusion models <a href="https://openreview.net/forum?id=VM8batVBWvg" title="Discrete Predictor-Corrector Diffusion Models for Image Synthesis">(Lezama et al, 2023)</a></span>, it is not possible with Masked-DLM (without some engineering). Theoretically, this lack of self-correction could give rise to errors in the inference steps, which would then propagate<label for="sidenote-selfcorrect" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-selfcorrect" class="margin-toggle" checked="" /><span class="sidenote">To give a concrete example, say we start from “[MASK] [MASK] [MASK] [MASK] [MASK] [MASK]”, errors (e.g. in modeling learning) could bring us to “[MASK] [MASK] are your s [PAD]” and since unmasked tokens cannot transition any further, this leaves limited choice but to go to “our dog are your s [PAD]” resulting in a number agreement error. <br /><ins><em>Sidenote:</em></ins> It could be interesting to carry out a study at-scale of the tokens (and the types of words they form) that get unmasked across the Masked-DLM inference steps – e.g. see Figures 10-15 of <a href="https://arxiv.org/abs/2107.03006" title="Structured Denoising Diffusion Models in Discrete State-Spaces">(Austin et al, 2023)</a>, and establish whether there are any significant patterns in what parts of a sentence/paragraph gets unmasked and fixed first.</span>. That said, some recent studies proposed remasking strategies to address such issues – e.g. <a href="https://arxiv.org/abs/2407.21243" title="Informed Correctors for Discrete Diffusion Models">(Zhao et al, 2025)</a>, <a href="https://arxiv.org/abs/2502.09992" title="Large Language Diffusion Models">(Nie et al, 2024)</a>. <a href="https://arxiv.org/pdf/2503.00307v1" title="Remasking Discrete Diffusion Models with Inference-Time Scaling">(Wang et al, 2025)</a> also claim to provably show the soundness of applying remasking without needing to take special considerations into the training and inference of Masked-DLMs. In practice, the LLaDA authors <a href="https://arxiv.org/pdf/2502.09992" title="Large Language Diffusion Models"> (Nie et al, 2025)</a> were able to reach AR-LLM performance for their model whilst leveraging generation with remasking strategies (see §2.4 of their paper).</p>

<p>◼️ regarding shortcoming (ii): <strong>Guidance</strong> originates from diffusion modeling for image and refers to conditioning information (for instance a label such as “dogs”, or a text prompt such as “friendly-looking dogs”) we can add to “guide” the reverse diffusion process towards generating an image with certain desired properties.<label for="sidenote-cfg" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-cfg" class="margin-toggle" checked="" /><span class="sidenote">This was initially proposed with the use of gradients from a classifier <a href="https://arxiv.org/abs/2105.05233" title="Diffusion models beat GANs on image synthesis">(Dhariwal &amp; Nichol, 2021)</a> that can identify the classes of the images during training; but since training is over a diffusion process, the classifier had to be trained to be able to identify the classes across the noising process, which can be complicated to achieve. <a href="https://arxiv.org/abs/2207.12598" title="Classifier-Free Diffusion Guidance">(Ho &amp; Salimans, 2022)</a> established a more efficient and effective to train an image diffusion model for guidance without the need for a separate classifier (classifier free guidance, or <strong>CFG</strong>) which is now widely used. See this Sander Dieleman <a href="(https://sander.ai/2022/05/26/guidance.html)">post</a> for an overview. For a more visual explanation of CFG (and also diffusion models in general), have a look at this recent <a href="https://youtu.be/iv-5mZ_9CPY?t=1791">Welch Labs-3Blue1Brown explainer</a>.</span> Guidance is useful for DLM too, as a way to steer towards safety and better matching user intent, for e.g. to adhere to style requirements or reflect concepts such as non-toxicity, inclusivity, empathy, neutrality etc. An example of such work is DGLM <a href="https://aclanthology.org/2024.findings-acl.887" title="Diffusion Guided Language Modeling">(Lovelace et al, 2024)</a> which uses a Gaussian-DLM to produce a “candidate” continuation of a prompt plus some guidance condition, which<label for="sidenote-cand" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-cand" class="margin-toggle" checked="" /><span class="sidenote">The input to the AR-LLM is the probability distribution of the candidate denoised/”refined” by the Gaussian-DLM from noise.</span> is then put through a decoder-only AR-LLM to verbalise.<label for="sidenote-weinberger" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-weinberger" class="margin-toggle" checked="" /><span class="sidenote">For a quick overview: see Kilian Weinberger’s <a href="https://youtu.be/klW65MWJ1PY?t=2757">presentation</a> on the work.</span> However, standard classifier-free guidance (<strong>CFG</strong>) are designed with diffusion models trained with the score-matching objective (which learns a model to match the <em>score</em>, or the gradient of the log probability density function with respect to the data) in mind. Score-matching in the continuous sense is not used for Masked-DLMs<label for="sidenote-discretescore" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-discretescore" class="margin-toggle" checked="" /><span class="sidenote">Though there is concrete score matching <a href="https://arxiv.org/pdf/2211.00802" title="Concrete Score Matching: Generalized Score Matching for Discrete Data">(Meng et al, 2023)</a> for the discrete case (modified in SEDD <a href="https://arxiv.org/abs/2310.16834" title="Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution">(Lou et al, 2023)</a>), it is an approximation and does not address how classical guidance (as in the continuous setting for images) can be applied (without potentially scrambling the discrete semantics). Moreover, recent Masked-DLM work such as in the LLaDA family of models directly set the training objective as the cross-entropy loss on the masked tokens (Eq 3 in <a href="https://arxiv.org/pdf/2502.09992" title="Large Language Diffusion Models">(Nie et al, 2025)</a>), a further departure from the score-matching formulation for learning DLMs.</span> hence it is not feasible to transfer existing CFG techniques to Masked-DLMs. Notably however, <a href="https://arxiv.org/abs/2410.18514" title="Scaling up Masked Diffusion Models on Text">(Nie at al, 2024)</a> proposed <em>unsupervised classifier-free guidance</em>, a training objective to allow CFG in Masked-DLMs without using paired data (e.g. prompt-continuation, question-answer); they found that unsupervised CFG fine-tuning of an Masked-DLM gives better performance compared to using standard CFG.</p>

<h3 id="️️-3-what-is-the-training-procedure-for-a-dlm">🏋️‍♀️ 3. What is the training procedure for a DLM?</h3>
<p>To get a general sense of the DLM training procedure, it will be useful to look at the LLaDA approach as it includes language model pretraining and instructions fine-tuning, both of which are fundamental for general-purpose LLM usage. Note that most work parameterise their DLMs using the Transformer architecture<label for="sidenote-transformer" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-transformer" class="margin-toggle" checked="" /><span class="sidenote">Paralleling the trend in image diffusion too <a href="https://arxiv.org/pdf/2202.04200" title="MaskGIT: Masked Generative Image Transformer">(Chang et al, 2022)</a> as well as <a href="https://arxiv.org/pdf/2212.09748" title="https://arxiv.org/pdf/2212.09748">(Peebles &amp; Xie, 2023)</a></span>, but this is not a must.<label for="sidenote-rush" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-rush" class="margin-toggle" checked="" /><span class="sidenote">Alexander Rush has a useful <a href="https://www.youtube.com/watch?v=WjAUX23vgfg">explainer video</a> of diffusion models for text in general.</span> Here I will briefly touch on two key areas of the training procedure: (i) the noising in the forward process, and (ii) the training objective. For more details and code on LLaDA’s training, see their <a href="https://ml-gsai.github.io/LLaDA-demo/">blogpost</a> and <a href="https://github.com/ML-GSAI/LLaDA/blob/main/GUIDELINES.md">codebase</a>.</p>

<p><br /></p>

<figure><figcaption><span>Image source: illustrating the masking and prediction procedure for Masked-DLM pretraining and fine-tuning – <a href="https://arxiv.org/pdf/2502.09992" title="Large Language Diffusion Models"> (Nie et al, 2025)</a> <br /><br /></span></figcaption><img src="/assets/img/llada_training.png" /></figure>

<p>◼️ <strong>Noising:</strong> The general idea is to sample some noise level for a given data instance (see graphic above for how masking is done for pretraining and for instructions fine-tuning) which will be used to determine the tokens to mask (e.g. in footnote 10 above on the sequence “dog s are our friend s”) in the forward process. The noising schedule (e.g. linear, geometric or even cosine) can impact performance; for instance, if we model it such that masking is at a slower rate as \(t \to T\) in the forward process, then this would align with an unmasking sequence in the reverse denosing process where we unmask comparatively fewer tokens at first before gradually increasing.<label for="sidenote-schedule" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-schedule" class="margin-toggle" checked="" /><span class="sidenote">Intuitively, this makes sense since tokens are unmasked independently of each other at each time step, and given the fixedness in Masked-DLMs unmasking, we want fewer commitments at the start to reduce the unmasking of conflicting tokens, which will go on to compound in subsequent steps.</span> On another note, multiple epochs (together with full-attention over the sequence length at every step drive compute for DLM training up; &lt;=64x, see <a href="https://arxiv.org/pdf/2305.18619" title="Likelihood-Based Diffusion Language Models">(Gulrajani &amp; Hashimoto, 2023)</a>) of training over the data are needed in practice, with (almost likely) different noise sampled on the same data instance, is required for DLMs to reach the same level of performance on perplexity as AR-LLMs (which are typically trained with a single epoch over the data); while less efficient to train, it is this procedure that imbues the DLM to be able to generate non-autoregressively.</p>

<p>◼️ <strong>Training objective:</strong> In the earliest diffusion models, the training objective involves having to compute the evidence lower bound (ELBO)<label for="sidenote-elbo" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-elbo" class="margin-toggle" checked="" /><span class="sidenote">Unlike AR-LLMs which factorise the likelihood of a sequence into conditional probabilities (i.e. probability over the vocabulary at every step) which make it easier to evaluate, likelihood of a sequence (which is how we model in DLM) is intractable, hence the use of ELBO.</span>; subsequently, a simplification to the original ELBO was proposed with the score-matching approach<label for="sidenote-score" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-score" class="margin-toggle" checked="" /><span class="sidenote"><a href="https://arxiv.org/pdf/2006.11239" title="Denoising Diffusion Probabilistic Models">(Ho et al, 2020)</a> showed that the original ELBO in the diffusion objective can be further simplified by modeling the probability distributions as <em>scores</em> (gradient of the log prob with respect to the data), and the objective can be reduced to minimising the mean-squared error of the entries in the predicted score vector and the true score, which is substantially easier to do.</span>. Since the noise level varies across time, it was also included in a time variable (for e.g. as an additional embedding in CDCD). Recently, it was shown (quite concurrently, based on version dates on arxiv) in MD4 <a href="https://arxiv.org/abs/2406.04329v4" title="Simplified and Generalized Masked Diffusion for Discrete Data">(Shi et al, 2024)</a>, MDLM <a href="https://arxiv.org/pdf/2406.07524" title="Simple and Effective Masked Diffusion Language Models">(Sahoo et al, 2024)</a> and RADD <a href="https://arxiv.org/abs/2406.03736" title="Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data">(Ou et al, 2024)</a> that, for Masked-DLMs, a tighter bound could be obtained by modeling the predictions of the logits at each timestep against the ground truth (“<em>clean data</em>”) labels using cross-entropy, and that it is fine to drop the need to explicitly capture the time variable, which greatly simplifies the training objective (and becomes very similar to the standard AR-LLM training objective). Notably, the use of this cross-entropy objective is validated empirically by the strong evaluations from the LLaDA model, whose training was done with it.</p>

<h3 id="-4-whats-come-up-recently-in-dlm-research">📰 4. What’s come up recently in DLM research?</h3>
<p>On another note, two interesting pieces of research have surfaced last week which I think adds to the conversation about DLMs.</p>

<p>◼️ The first is from <a href="https://arxiv.org/abs/2507.11097" title="The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs"> (Wen et al, 2025)</a> which studies the jailbreaking vulnerability of text and multimodal Masked-DLMs (LLaDA, Dream and MMaDA). They find that these Masked-DLMs (see Appendix C of their paper) “<em>often match or surpass those of autoregressive LLMs in resisting existing jailbreak attack methods</em>”, which is ideal. However, they also found that it is possible to use AR-LLMs (GPT4o or a 7B-parameter Qwen model) few-shot prompting to generate “mid-flight” unmasked sequences (i.e. \(\hat{x}_{t}, 0 &lt; t &lt; T\)) which an Masked-DLM would go on to unmask jailbroken content (see example on right column). Notably, they tested their method on several jailbreaking benchmarks and established similar findings of such behaviour across Masked-DLMs, including going from \(\leq\) 1% success on JailbreakBench to \(\geq\) 99% success. They also found that simply extending the length of the sequence made it possible to bring the Masked-DLM from initial refusal to a jailbreak outcome (see next example on right column). These findings flag the need for further efforts to study DLM jailbreaking, to better understand and find ways to mitigate such safety weaknesses in them.</p>

<p><label for="marginfigure-jailbreak" class="margin-toggle">⊕</label><input type="checkbox" id="marginfigure-jailbreak" class="margin-toggle" checked="" /><span class="marginnote"><img class="fullwidth" src="/assets/img/jailbreak_llada.png" /><br />Source: jail breaking example – <a href="https://arxiv.org/abs/2507.11097" title="The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs"> (Wen et al, 2025)</a></span></p>

<p><label for="marginfigure-jailbreak-longer" class="margin-toggle">⊕</label><input type="checkbox" id="marginfigure-jailbreak-longer" class="margin-toggle" checked="" /><span class="marginnote"><img class="fullwidth" src="/assets/img/jailbreak_longergen.png" /><br />Source: jail breaking example – <a href="https://arxiv.org/abs/2507.11097" title="The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs">(Wen et al, 2025)</a>. <em>This finding indicates some conditional dependence for generating refusal content which is tied to max sequence length seen at training, i.e. this might be resolvable by taking steps to disconnect this dependence during training.</em></span></p>

<p>◼️ The other work <a href="https://arxiv.org/abs/2507.15857v1" title="Diffusion Beats Autoregressive in Data-Constrained Settings">(Prabhudesai et al, 2025)</a> studies the compute-data Pareto frontier of DLMs and 100 comparable AR-LLMs. Specifically, they trained 100 Masked-DLMs and AR-LLMs (ranging in size from 7M- to 2.5B-parameters) on the English C4 corpus (at data scales of between 25 to 100M tokens) for up to 800 epochs. From their study, they found that initially, at low epoch counts, AR-LLMs outperforms DLMs; but as repeated passes over the data is carried out, DLMs overtake and perform better. Their experiments allowed them to establish a potential scaling law for for DLMs, and also conclude that under data-constrained settings (for e.g. when Internet data peters out, or in sequence modeling for specialised domains/applications where available data could be at smaller scales), a DLM architecture may be better for modeling the data distribution. The findings provide useful insights that help clarify whether (and where) moving to DLMs makes sense. That said, we are still lacking an understanding of the DLM &amp; AR-LLM differences under real-use case evaluations<label for="sidenote-eval" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-eval" class="margin-toggle" checked="" /><span class="sidenote">e.g. with suites such as <a href="https://github.com/EleutherAI/lm-evaluation-harness">lm-evaluation-harness</a>; in their work they only report loss curves and perplexity (NLL).</span> as well as the impact of methods that have been recent drivers of AR-LLM model improvements (such as training on synthetic data, as well as preference tuning and reinforcement learning) on DLMs.</p>

<p><label for="marginnote-resources" class="margin-toggle"> ⊕</label><input type="checkbox" id="marginnote-resources" class="margin-toggle" checked="" /><span class="marginnote">Some useful resources for diffusion modeling: Firstly, the “Diffusion Models” chapters in: 
<br />◼️ Chapter 20 of <a href="https://www.bishopbook.com/">Deep Learning: Foundations and Concepts</a>. Bishop, C.M., Bishop, H. (2024). Springer.; 
<br />◼️ Chapter 25 of <a href="https://probml.github.io/pml-book/book2.html">Probabilistic Machine Learning: Advanced Topics</a>. Murphy K. P. (2023). MIT Press.; 
<br />◼️ Chapter 18 of <a href="https://udlbook.github.io/udlbook/">Understanding Deep Learning</a>. Prince, S. J. D. (2024). MIT Press. 
<br /><em>Of the three textbooks, the Murphy one (to my mind, the bible of probabilistic generative modeling for its breadth and depth) has the most substantial coverage of discrete diffusion modeling, and the Bishop one is (in my opinion) the most accessibly written plus it also comes with helpful sections on score matching and guidance; though it helped to triangulate information between the three as much as possible.</em>
<br />Secondly, it also helps to start with the score-matching diffusion models that were developed for image generation and go on to discrete case. To help: the diffusion and flow modules of the CS236 class taught by Stefano Ermon could be very useful for putting the parts on together.
<br />◼️ <a href="https://www.youtube.com/playlist?list=PLoROMvodv4rPOWA-omMM6STXaWW4FvJT8">Stanford CS236: Deep Generative Models 2023 playlist</a>
Finally, this survey paper also goes deep into the various aspects of DLMs:
<br />◼️ <a href="https://arxiv.org/abs/2506.13759?">Discrete Diffusion in Large Language and Multimodal Models: A Survey</a></span></p>

<p><label for="marginnote-notes" class="margin-toggle"> ⊕</label><input type="checkbox" id="marginnote-notes" class="margin-toggle" checked="" /><span class="marginnote"><ins><strong>Some notes:</strong></ins> It seems reasonable to consider combining Uniform-DLM and Masked-DLM, after all the latter is just adding an additional state <code class="language-plaintext highlighter-rouge">[MASK]</code> and enforcing fixedness once this state is reached/exited. I am wondering if it might make sense to add the <code class="language-plaintext highlighter-rouge">[MASK]</code> token as well as permit transitions to and from it across the timesteps (i.e. Uniform-DLM+<code class="language-plaintext highlighter-rouge">[MASK]</code>), but with some constraints that the overall ratio of <code class="language-plaintext highlighter-rouge">[MASK]</code> tokens must monotonically increase over time in the forward process, which could address the “self-correction” limitation of Masked-DLM (see <a href="#-2-apple-to-apple-whats-the-benefit-of-one-over-another">above</a>). <a href="https://arxiv.org/abs/2107.03006" title="Structured Denoising Diffusion Models in Discrete State-Spaces">(Austin et al, 2021)</a> (see Appendix A.2.6 and B.2.1 as well as Figure 4 (upper) there) stated that they carried out some ablations on the text8 dataset that touched on this – for e.g. by applying \(e_m\) a separate one-hot vector with 1 on <code class="language-plaintext highlighter-rouge">[MASK]</code> and 0 elsewhere – but do not seem to have included the results to show how Uniform-DLM+<code class="language-plaintext highlighter-rouge">[MASK]</code> performs over Masked-DLM. <em>Although this change might impact the simplification of the loss objective to the use of cross-entropy against ground truth tokens as established by MD4, MDLM and RADD. (see for e.g. §3.1 of the RADD paper)</em></span></p>

<h3 id="-4-whats-next">👉 4. What’s next?</h3>
<p>In this post, I have covered the three broad classes of DLMs, discussed what their differences mean, then briefly described the training procedure of Masked-DLMs, before ending on some recent DLM research and their broader implications. In the next post, I will walk through the reverse process used for generation in DLMs, situate generation with DLMs against existing efficient serving methods for AR-LLMs, and then look at the a few recent advanced sampling techniques proposed (such as block-wise/semi-autoregessivity and caching). In the post after that, I will cover preference tuning and reinforcement learning of these DLMs.</p>

<p><em>Update (16 August 2025): (1) precise DLM’s 64x more FLOPs statement, (2) refine <em>Training objective</em> subsection.</em></p>]]></content><author><name>Kelvin Han</name></author><category term="discrete" /><category term="diffusion" /><summary type="html"><![CDATA[There are three variants of diffusion language models (DLMs), and the nuances of each impact their training, inference and scalability. I think it will be helpful to situate them amongst each other before we proceed further; and so in this post, I will first introduce the variants, discuss their differences as well as what they mean. I will then go on to outline the training procedure for the Masked (Masked-DLM) variant as it is currently receiving significant amounts of attention in researchAnd quite importantly, with useful extensions into multimodality and reinforcement learning already carried out with them., before ending off with a summary of two interesting pieces of DLM research ◼️ (Wen et al, 2025) The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs ◼️ (Prabhudesai et al, 2025) Diffusion Beats Autoregressive in Data-Constrained Settings that surfaced recently along with thoughts on their broader implications. If you wish, you can use these links to skip to the specific sections of this post: 1. DLM variants; 2. Comparing the variants; 3. Training a masked DLM; and 4. Recent findings and potential implications 🎨 1. What variants of DLMs are there? DLM approaches can be described as being of the (i) Gaussian, (ii) Uniform, or (iii) Masked variants, based on how the original training data instancesfor e.g. an instance could be a sentence such as “dogs are our friends” for the language modeling task. are “corrupted” (or how the notion of noise is conceptualised and how it is introduced into the input during the foward process. . . and hence (by design) the reverse process as well.). The first two are named for the distributions they use to draw noise from, and the third is likely named for a special token (e.g. [MASK]) added to the vocabulary to give a noised state that “masks” the original token. To give a more concrete feel of each variant, I will use the following toy example to illustrate some of them. Settings for a toy example Imagine we have a toy language with 13 lexical units in the vocabulary V = {"we", "you", "they", "our", "your", "their", "cat", "dog", "friend", "is", "are", "s", "[PAD]"}, which allows one to form sequences such as "dog s are our friend s", "our dog s are your s", "your cat is our friend" etc. The special marker [PAD] is used as a filler to ensure that all sequences are of the same length, e.g. "your cat is our friend [PAD]", so that it has the same six-unit length as the other two dog-sentences, allowing us to do batched generations. ◼️ Gaussian-DLM: This variant, also referred to as “continuous-time” in the literature, is the closest in form to the ones that are found in image diffusion modelsSuch as Google’s Imagen, OpenAI’s DALL-E and Stability AI’s Stable Diffusion, so if you are familiar with them, then the Gaussian-DLM models could be quite familiar to you. The architecture here involves: (i) embedding the tokens of a sequence so that each of them are represented with real-valued vectorsThis is similar to the first step in auto-regressive LLMs (AR-LLMs) (for Transformers as well as RNNs)., and (ii) adding to them noise that is in the form of random vectors drawn from a Gaussian distribution \(\epsilon_{t}\) ~ \(N(\mu_{t}, \sigma_{t}^2)\) at each step. Using the toy example, what this means in practice is that we will keep a look-up table for 13 random vectors (one for every unit in our toy language’s vocabulary) and each vector is of a dimension \(d\). So the sequence “dog s are our friend s” will be represented by six token vectors, and we will add noise to each token vector such that every token is completely gaussian at the limit of the forward process i.e. \(t \to \infty\) (or in practice, some defined terminal timestep \(T\) so that the total noise added is known; it is usually set at 1.0 in diffusion models). A few formulations have been proposed for learning these models, including score-matching which appears most commonly, as well as latent variable models and stochastic differential equations, for learning the reverse process (Dieleman et al, 2023). Examples of Gaussian-DLMs include (i) Diffusion-LM (Li et al, 2022) that appeared in 2022 and which first drew attention towards diffusion modeling for text generation, (ii) CDCD (Dieleman et al, 2023), and (iii) PLAID (Gulrajani &amp; Hashimoto, 2023). ◼️ Uniform-DLM: Instead of the real-valued embedding vectors used in Gaussian-DLMs, this approach represents each token of a sequence with a one-hot vectori.e. the vector for each token is a dirac distribution with all probability mass concentrated at the actual token’s index in the vocabulary.; of dimensions the size of the vocabulary; whereby it is 1 at the position of the token in the vocabulary, and 0 elsewhere. Here, the notion of adding noise to the data instanceNote that noise is added/removed independently between each token in the sequence – “We make the assumption that the forward noising process is applied independently across a sequence… the denoising process factorizes independently across tokens.” (Sahoo et al, 2024) is via the application of some transition matrix (\(Q_t\), determining the probabilities for whether the token stays unchanged or to which one of the other vocab units) such that the initially concentrated probability mass in the token vector gradually distributes over all of the other vocab units. At the limit of the forward process, the one-hot vector applied with \(Q_{t=T}\) would give a uniform distribution (i.e. each vocab unit is equally likely; there is completely no useful signal to deduce what the original token was) and also reach stationarity, i.e. additional steps cannot change from the uniform distribution. Examples of this approach include the Uniform versions of the models trained in (Lou et al, 2023) and (Schiff et al, 2025).To extend the Wheel analogy to Uniform-DLM: (i) instead of each panel on the gameboard being two-sided (being either white/blank or a character), they would be |\(V\)|-sided (i.e. as many sides as the vocabulary and without white/blank), (ii) the gameboard will start with some completely scrambled combination of characters, and (iii) at every guess, the contestant can flip multiple panels to any other character in the vocabulary. Another way to look at it could be as a slots machine (see GIF below). ⊕Source: Slots GIF – https://discrete-diffusion-guidance.github.io/ ◼️ Masked-DLM: This is also referred to as modeling discrete diffusion with an “absorbing state”, first appearing in (Austin et al, 2021). Essentially, noise is represented by the special token mentioned above for e.g., the sentence “dog s are our friend s” could be gradually masked to “dog [MASK] are our [MASK] s” and finally to “[MASK] [MASK] [MASK] [MASK] [MASK] [MASK]” , i.e. a token in the original sequence is corrupted by being “absorbed” into this special state (rather than transitioning to others). This is an important characteristic of most current Masked-DLMs, i.e. in the forward process, once a token transitions to [MASK], it stays in that state throughout the subsequent steps. Conversely, in the reverse process, once a [MASK] token transitions to a vocab unit \(v\) other than [MASK], it also stays as \(v\) in all subsequent steps.This variant matches the Wheel analogy in the previous post; the white panels correspond to the [MASK] token, behind each white panel is one character and once a contestant makes a correct any guess for it, it cannot be changed. Masked-DLM is the basis for the LLaDA (Nie et al, 2024) and Dream (Ye et al, 2025) models, as well as well as multimodal versions such as LLaDA-V (You et al, 2025) and MMaDA (Yang et al, 2025), It was also explored in SEDD (Lou et al, 2023), which Inception Lab’s Mercury models are reportedly based onThe Mercury technical report (Inception Labs, 2025) state that “Our methods extend (Lou et al, 2023) through careful modifications to the data and computation to scale up learning.”. Although they do not state which of the Masked-DLM or Uniform-DLM variant they studied in (Lou et al, 2023) they ended up leveraging, it seems possible that the Gaussian-DLM approach was taken given the poorer perplexity figures obtained by Uniform-DLM in their work as well as in earlier work.. Light round-up: At the limit \(T\) in their forward processes, each class of DLM can be summarised as follows: for Gaussian-DLMs each token in a sequence can transition to any other token (state) reachable by the accumulated variance of the sampled noise; for Uniform-DLMs, each token can transition to any other state with equal probability; and in the case of Masked-DLMs, each token lands on the special [MASK] token. Masked-DLM connections with BERT: Recall that BERT (Devlin et al, 2018) – an encoder-only model that could be used for tasks such as cloze-style QAe.g the most likely token after “Paris is the capital of” is “France”. and which is a precursor to the LLMs of today – was trained with a masked language modeling objective, i.e. random portions (~15%) of the sentences in the training data are masked, and the model has to learn to predict the masked words. This is very similar to the Masked-DLM approach, except that in Masked-DLMs this unmasking is done across multiple steps (instead of a single pass), and for the entire sequence (the last Masked-DLM inference step would most closely align with the task in the BERT MLM objective). Accordingly, some work have explored leveraging BERT-style (i.e. encoder-only) models for DLM: see DiffusionBERT (He et al, 2023) which propose further training BERT with time-embeddings with a special diffusion noise schedule to use it à la DLM. See also ‘Comparison to BERT’ in §6 of (Sahoo et al, 2024) for a discussion on this connection. 🍏 2. Apple-to-apple: what’s the benefit of one over another? There is more focus on Masked-DLM and Uniform-DLM over Gaussian-DLM currently, as the latter has been met with comparably less success (in terms of achievable perplexity).The reason for this is because the question of how to reconcile the noised real-valued vectors in Gaussian-DLM to discrete space is not trivial and required many special tricks for training and inference to work. For instance, it was necessary to use nearest neighbour search and clamping to a valid token vector at every reverse diffusion step in Diffusion-LM (Li et al, 2022) to reach comparable performance with AR-LLMs. In the most recent round of published DLM research (i.e. ICLR and ICML in 2025), significant attention has been focused on Masked-DLM approachesThough it should be noted that interesting work on Gaussian-DLM &amp; Uniform-DLM is also continuing, I think notably in (Sahoo et al, 2025), where they work out a proof that connects Uniform-DLM as a special case of Gaussian-DLM using the argmax operator, and opens up the possibility to leverage techniques in Gaussian-DLM for training and inferencing on Uniform-DLM., as they achieve better perplexity compared to Uniform-DLMNote that lower is better for perplexity; compare the SEDD (Uniform) and SEDD (Absorb) results in Table 1 of (Lou et al, 2025); as well as D3PM uniform vs D3PM absorbing in Figure 2 and Table 2 of (Austin et al, 2021).. Intuitively, this is not unexpected, as it would seem easier to learn an Masked-DLM (where transitions are constrained once the absorbing [MASK] state is reached in the forward process) compared to Uniform-DLM (where transitions at each timestep could be to any other tokens, i.e. a much larger space of posssible transitions). However, several shortcomings of the Masked-DLM approach have been raised (hence the continued research interests in Gaussian-DLM and Uniform-DLM approaches). Chief amongst them include: (i) the potentially over-restrictiveness of the vanilla Masked-DLM masking/unmasking procedureI use “vanilla” because there are approaches emerging that propose more advanced inference procedures such as remasking that could address this. For instance (Wang et al, 2025); and (ii) the difficulty of introducing classifier-free guidance into Masked-DLMs. ◼️ regarding shortcoming (i): “Self-correction” is a term for referring to how tokens can continue to transition to others over the reverse diffusion steps; it is connected to the “coarse-to-fine” property (in the truest sense) that DLMs are frequently touted to beneficially possess over AR-LLMs. While self-correction is inherently possible in Gaussian-DLMs and Uniform-DLMs given their designSidenote: they could also be steered with predictor-corrector models like in image diffusion models (Lezama et al, 2023), it is not possible with Masked-DLM (without some engineering). Theoretically, this lack of self-correction could give rise to errors in the inference steps, which would then propagateTo give a concrete example, say we start from “[MASK] [MASK] [MASK] [MASK] [MASK] [MASK]”, errors (e.g. in modeling learning) could bring us to “[MASK] [MASK] are your s [PAD]” and since unmasked tokens cannot transition any further, this leaves limited choice but to go to “our dog are your s [PAD]” resulting in a number agreement error. Sidenote: It could be interesting to carry out a study at-scale of the tokens (and the types of words they form) that get unmasked across the Masked-DLM inference steps – e.g. see Figures 10-15 of (Austin et al, 2023), and establish whether there are any significant patterns in what parts of a sentence/paragraph gets unmasked and fixed first.. That said, some recent studies proposed remasking strategies to address such issues – e.g. (Zhao et al, 2025), (Nie et al, 2024). (Wang et al, 2025) also claim to provably show the soundness of applying remasking without needing to take special considerations into the training and inference of Masked-DLMs. In practice, the LLaDA authors (Nie et al, 2025) were able to reach AR-LLM performance for their model whilst leveraging generation with remasking strategies (see §2.4 of their paper). ◼️ regarding shortcoming (ii): Guidance originates from diffusion modeling for image and refers to conditioning information (for instance a label such as “dogs”, or a text prompt such as “friendly-looking dogs”) we can add to “guide” the reverse diffusion process towards generating an image with certain desired properties.This was initially proposed with the use of gradients from a classifier (Dhariwal &amp; Nichol, 2021) that can identify the classes of the images during training; but since training is over a diffusion process, the classifier had to be trained to be able to identify the classes across the noising process, which can be complicated to achieve. (Ho &amp; Salimans, 2022) established a more efficient and effective to train an image diffusion model for guidance without the need for a separate classifier (classifier free guidance, or CFG) which is now widely used. See this Sander Dieleman post for an overview. For a more visual explanation of CFG (and also diffusion models in general), have a look at this recent Welch Labs-3Blue1Brown explainer. Guidance is useful for DLM too, as a way to steer towards safety and better matching user intent, for e.g. to adhere to style requirements or reflect concepts such as non-toxicity, inclusivity, empathy, neutrality etc. An example of such work is DGLM (Lovelace et al, 2024) which uses a Gaussian-DLM to produce a “candidate” continuation of a prompt plus some guidance condition, whichThe input to the AR-LLM is the probability distribution of the candidate denoised/”refined” by the Gaussian-DLM from noise. is then put through a decoder-only AR-LLM to verbalise.For a quick overview: see Kilian Weinberger’s presentation on the work. However, standard classifier-free guidance (CFG) are designed with diffusion models trained with the score-matching objective (which learns a model to match the score, or the gradient of the log probability density function with respect to the data) in mind. Score-matching in the continuous sense is not used for Masked-DLMsThough there is concrete score matching (Meng et al, 2023) for the discrete case (modified in SEDD (Lou et al, 2023)), it is an approximation and does not address how classical guidance (as in the continuous setting for images) can be applied (without potentially scrambling the discrete semantics). Moreover, recent Masked-DLM work such as in the LLaDA family of models directly set the training objective as the cross-entropy loss on the masked tokens (Eq 3 in (Nie et al, 2025)), a further departure from the score-matching formulation for learning DLMs. hence it is not feasible to transfer existing CFG techniques to Masked-DLMs. Notably however, (Nie at al, 2024) proposed unsupervised classifier-free guidance, a training objective to allow CFG in Masked-DLMs without using paired data (e.g. prompt-continuation, question-answer); they found that unsupervised CFG fine-tuning of an Masked-DLM gives better performance compared to using standard CFG. 🏋️‍♀️ 3. What is the training procedure for a DLM? To get a general sense of the DLM training procedure, it will be useful to look at the LLaDA approach as it includes language model pretraining and instructions fine-tuning, both of which are fundamental for general-purpose LLM usage. Note that most work parameterise their DLMs using the Transformer architectureParalleling the trend in image diffusion too (Chang et al, 2022) as well as (Peebles &amp; Xie, 2023), but this is not a must.Alexander Rush has a useful explainer video of diffusion models for text in general. Here I will briefly touch on two key areas of the training procedure: (i) the noising in the forward process, and (ii) the training objective. For more details and code on LLaDA’s training, see their blogpost and codebase. Image source: illustrating the masking and prediction procedure for Masked-DLM pretraining and fine-tuning – (Nie et al, 2025) ◼️ Noising: The general idea is to sample some noise level for a given data instance (see graphic above for how masking is done for pretraining and for instructions fine-tuning) which will be used to determine the tokens to mask (e.g. in footnote 10 above on the sequence “dog s are our friend s”) in the forward process. The noising schedule (e.g. linear, geometric or even cosine) can impact performance; for instance, if we model it such that masking is at a slower rate as \(t \to T\) in the forward process, then this would align with an unmasking sequence in the reverse denosing process where we unmask comparatively fewer tokens at first before gradually increasing.Intuitively, this makes sense since tokens are unmasked independently of each other at each time step, and given the fixedness in Masked-DLMs unmasking, we want fewer commitments at the start to reduce the unmasking of conflicting tokens, which will go on to compound in subsequent steps. On another note, multiple epochs (together with full-attention over the sequence length at every step drive compute for DLM training up; &lt;=64x, see (Gulrajani &amp; Hashimoto, 2023)) of training over the data are needed in practice, with (almost likely) different noise sampled on the same data instance, is required for DLMs to reach the same level of performance on perplexity as AR-LLMs (which are typically trained with a single epoch over the data); while less efficient to train, it is this procedure that imbues the DLM to be able to generate non-autoregressively. ◼️ Training objective: In the earliest diffusion models, the training objective involves having to compute the evidence lower bound (ELBO)Unlike AR-LLMs which factorise the likelihood of a sequence into conditional probabilities (i.e. probability over the vocabulary at every step) which make it easier to evaluate, likelihood of a sequence (which is how we model in DLM) is intractable, hence the use of ELBO.; subsequently, a simplification to the original ELBO was proposed with the score-matching approach(Ho et al, 2020) showed that the original ELBO in the diffusion objective can be further simplified by modeling the probability distributions as scores (gradient of the log prob with respect to the data), and the objective can be reduced to minimising the mean-squared error of the entries in the predicted score vector and the true score, which is substantially easier to do.. Since the noise level varies across time, it was also included in a time variable (for e.g. as an additional embedding in CDCD). Recently, it was shown (quite concurrently, based on version dates on arxiv) in MD4 (Shi et al, 2024), MDLM (Sahoo et al, 2024) and RADD (Ou et al, 2024) that, for Masked-DLMs, a tighter bound could be obtained by modeling the predictions of the logits at each timestep against the ground truth (“clean data”) labels using cross-entropy, and that it is fine to drop the need to explicitly capture the time variable, which greatly simplifies the training objective (and becomes very similar to the standard AR-LLM training objective). Notably, the use of this cross-entropy objective is validated empirically by the strong evaluations from the LLaDA model, whose training was done with it. 📰 4. What’s come up recently in DLM research? On another note, two interesting pieces of research have surfaced last week which I think adds to the conversation about DLMs. ◼️ The first is from (Wen et al, 2025) which studies the jailbreaking vulnerability of text and multimodal Masked-DLMs (LLaDA, Dream and MMaDA). They find that these Masked-DLMs (see Appendix C of their paper) “often match or surpass those of autoregressive LLMs in resisting existing jailbreak attack methods”, which is ideal. However, they also found that it is possible to use AR-LLMs (GPT4o or a 7B-parameter Qwen model) few-shot prompting to generate “mid-flight” unmasked sequences (i.e. \(\hat{x}_{t}, 0 &lt; t &lt; T\)) which an Masked-DLM would go on to unmask jailbroken content (see example on right column). Notably, they tested their method on several jailbreaking benchmarks and established similar findings of such behaviour across Masked-DLMs, including going from \(\leq\) 1% success on JailbreakBench to \(\geq\) 99% success. They also found that simply extending the length of the sequence made it possible to bring the Masked-DLM from initial refusal to a jailbreak outcome (see next example on right column). These findings flag the need for further efforts to study DLM jailbreaking, to better understand and find ways to mitigate such safety weaknesses in them. ⊕Source: jail breaking example – (Wen et al, 2025) ⊕Source: jail breaking example – (Wen et al, 2025). This finding indicates some conditional dependence for generating refusal content which is tied to max sequence length seen at training, i.e. this might be resolvable by taking steps to disconnect this dependence during training. ◼️ The other work (Prabhudesai et al, 2025) studies the compute-data Pareto frontier of DLMs and 100 comparable AR-LLMs. Specifically, they trained 100 Masked-DLMs and AR-LLMs (ranging in size from 7M- to 2.5B-parameters) on the English C4 corpus (at data scales of between 25 to 100M tokens) for up to 800 epochs. From their study, they found that initially, at low epoch counts, AR-LLMs outperforms DLMs; but as repeated passes over the data is carried out, DLMs overtake and perform better. Their experiments allowed them to establish a potential scaling law for for DLMs, and also conclude that under data-constrained settings (for e.g. when Internet data peters out, or in sequence modeling for specialised domains/applications where available data could be at smaller scales), a DLM architecture may be better for modeling the data distribution. The findings provide useful insights that help clarify whether (and where) moving to DLMs makes sense. That said, we are still lacking an understanding of the DLM &amp; AR-LLM differences under real-use case evaluationse.g. with suites such as lm-evaluation-harness; in their work they only report loss curves and perplexity (NLL). as well as the impact of methods that have been recent drivers of AR-LLM model improvements (such as training on synthetic data, as well as preference tuning and reinforcement learning) on DLMs. ⊕Some useful resources for diffusion modeling: Firstly, the “Diffusion Models” chapters in: ◼️ Chapter 20 of Deep Learning: Foundations and Concepts. Bishop, C.M., Bishop, H. (2024). Springer.; ◼️ Chapter 25 of Probabilistic Machine Learning: Advanced Topics. Murphy K. P. (2023). MIT Press.; ◼️ Chapter 18 of Understanding Deep Learning. Prince, S. J. D. (2024). MIT Press. Of the three textbooks, the Murphy one (to my mind, the bible of probabilistic generative modeling for its breadth and depth) has the most substantial coverage of discrete diffusion modeling, and the Bishop one is (in my opinion) the most accessibly written plus it also comes with helpful sections on score matching and guidance; though it helped to triangulate information between the three as much as possible. Secondly, it also helps to start with the score-matching diffusion models that were developed for image generation and go on to discrete case. To help: the diffusion and flow modules of the CS236 class taught by Stefano Ermon could be very useful for putting the parts on together. ◼️ Stanford CS236: Deep Generative Models 2023 playlist Finally, this survey paper also goes deep into the various aspects of DLMs: ◼️ Discrete Diffusion in Large Language and Multimodal Models: A Survey ⊕Some notes: It seems reasonable to consider combining Uniform-DLM and Masked-DLM, after all the latter is just adding an additional state [MASK] and enforcing fixedness once this state is reached/exited. I am wondering if it might make sense to add the [MASK] token as well as permit transitions to and from it across the timesteps (i.e. Uniform-DLM+[MASK]), but with some constraints that the overall ratio of [MASK] tokens must monotonically increase over time in the forward process, which could address the “self-correction” limitation of Masked-DLM (see above). (Austin et al, 2021) (see Appendix A.2.6 and B.2.1 as well as Figure 4 (upper) there) stated that they carried out some ablations on the text8 dataset that touched on this – for e.g. by applying \(e_m\) a separate one-hot vector with 1 on [MASK] and 0 elsewhere – but do not seem to have included the results to show how Uniform-DLM+[MASK] performs over Masked-DLM. Although this change might impact the simplification of the loss objective to the use of cross-entropy against ground truth tokens as established by MD4, MDLM and RADD. (see for e.g. §3.1 of the RADD paper) 👉 4. What’s next? In this post, I have covered the three broad classes of DLMs, discussed what their differences mean, then briefly described the training procedure of Masked-DLMs, before ending on some recent DLM research and their broader implications. In the next post, I will walk through the reverse process used for generation in DLMs, situate generation with DLMs against existing efficient serving methods for AR-LLMs, and then look at the a few recent advanced sampling techniques proposed (such as block-wise/semi-autoregessivity and caching). In the post after that, I will cover preference tuning and reinforcement learning of these DLMs. Update (16 August 2025): (1) precise DLM’s 64x more FLOPs statement, (2) refine Training objective subsection.]]></summary></entry><entry><title type="html">Diffusion Language Models – Part One (Introduction)</title><link href="/articles/25/Diffusion_LM_P1" rel="alternate" type="text/html" title="Diffusion Language Models – Part One (Introduction)" /><published>2025-07-20T08:00:00+08:00</published><updated>2025-07-20T08:00:00+08:00</updated><id>/articles/25/Diffusion_LM_P1</id><content type="html" xml:base="/articles/25/Diffusion_LM_P1"><![CDATA[<p><span class="newthought">Language modeling with diffusion architectures is gaining traction and there are several promising indicators for further adoption.</span> Since I have been on a deep dive into diffusion modeling<label for="sidenote-appreciation" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-appreciation" class="margin-toggle" checked="" /><span class="sidenote"><em>Impetus: the spark to look into diffusion language models came to mind during a presentation on multimodal AI at Lorong AI <a href="https://lorong.ai/">https://lorong.ai/</a> (they do a very nicely curated set of expert talks across wide swathes of AI and AI-adjacent topics). Whilst sitting in one on how vision-language models (<strong>VLMs</strong>) may be relevant to actions planning in robotics, and how latency is crucial for such uses, I started to wonder about non-autoregressive approaches, especially diffusion (which is known for better consistency and potential for speed). It also helped that it was Google’s I/O week and amidst the coverage of their various launches were mentions of a diffusion demo (see below).</em></span> – including some experiments for reinforcement learning (<strong>RL</strong>) post-training of a diffusion language model (<strong>DLM</strong>), I thought to share what I have come across in a series of posts; which I am planning as (at least) a four-parter, to be gradually released over the next few weeks.<label for="sidenote-weekly" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-weekly" class="margin-toggle" checked="" /><span class="sidenote">The plan is for updates each week. There are already well-written posts on DLMs: one by <a href="https://spacehunterinf.github.io/blog/2025/diffusion-language-models/">Xiaochen Zhu</a> (Apr 2025) and one by <a href="https://sander.ai/2023/01/09/diffusion-language.html">Sander Dielman</a> (2023), so I will not reinvent the wheel and will focus mostly on outlining and explaining DLM developments since then/as yet covered.</span> My focus will be on diffusion for text<label for="sidenote-MDLM" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-MDLM" class="margin-toggle" checked="" /><span class="sidenote">Especially masked diffusion language models (<strong>MDLM</strong>), a variant whereby a special [MASK] token is used (also termed an <em>absorbing state</em>; in the forward process, MDLMs noise the original sequences by transitioning tokens to this state – i.e. [MASK]). Training and sampling has been found to be easier when modeling the diffusion process for discrete sequences in this way.</span>; which is the modality I am most familiar with; although these models can be applied to discrete sequences in general as well as to, or together with, other modalities (e.g. vision – which diffusion models were originally developed on, as well as audio). In this first post I will introduce DLMs briefly and outline why I see them as promising.</p>

<h3 id="-1-what-are-dlms-how-are-they-different-from-current-llms">📝 1. What are DLMs? How are they different from current LLMs?</h3>

<p><span class="newthought">Diffusion architectures are designed around a training and inference procedure that involves a forward as well as a reverse process.</span> In the forward process, random noise is added to some original input (e.g. a real image or a human-written text) and repeatedly done so until it is entirely noised (i.e. the structure that was in the input is completely lost<label for="sidenote-noise" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-noise" class="margin-toggle" checked="" /><span class="sidenote">In the form of white noise for images; gibberish (or complete whitespace in the MDLM case) for text.</span>). What we want is a model that can seek to reverse this noising process; and if the noise in the forward process has been carefully added (minute amounts at each step; following an increasing schedule; and from a distribution easy to sample from), then the learning of the model is made relatively easy. Subsequently, it will be possible to start from complete noise as input, denoise it over some number of steps using the learned model, and arrive at a state where meaningful structure is restored (<em>et voila</em>, we would have obtained a realistic sample).</p>

<p><label for="marginfigure-wof" class="margin-toggle">⊕</label><input type="checkbox" id="marginfigure-wof" class="margin-toggle" checked="" /><span class="marginnote"><img class="fullwidth" src="/assets/img/wheeloffortune.png" /><br /><em>Generated with ChatGPT.</em></span></p>

<p><span class="newthought">To give a more intuitive sense of how DLMs work</span>, I will draw on Wheel of Fortune (<strong>Wheel</strong>)<label for="sidenote-genz" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-genz" class="margin-toggle" checked="" /><span class="sidenote"><em>Sidetrack: For an introduction, see this <a href="https://en.wikipedia.org/wiki/Wheel_of_Fortune_(American_game_show)">Wikipedia</a> entry.</em></span> for an analogy. <ins>To set the scene</ins>: <em>imagine it is a weekday evening, a game with Pat Sajak (or Ryan Seacrest if you prefer) and Vanna White is running, a sole contestant remains on the show. The round starts and the board shuffles to reveal a sequence of white panels, and Pat/Ryan gives the category: “Living Things”.</em> 
So far this corresponds to the forwards noising process described above.</p>

<p>In this game you are Pat/Ryan (who gave the category), and the contestant is your favourite LLM (ChatGPT, Claude, Deepseek, Le Chat etc) or a DLM. The task is for the contestant to correctly guess the characters behind each of these white panels on the board, based on the category.<label for="sidenote-rules" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-rules" class="margin-toggle" checked="" /><span class="sidenote"><em>Sidetrack: In fact we can drop the wheel; there is no revealing of R-S-T-L-N-E on the board either; and guesses are for tokens not characters. Think also of the category along the lines of the context/prompt we typically give to LLMs and the white panels on the Wheel board as the LLM/DLM’s response to your context/prompt.</em></span> The LLMs (e.g. ChatGPT) that we are familiar with are modeled in an auto-regressive manner (<strong>AR-LLMs</strong> from hereon), i.e. no matter what, they go about solving the task by making a sequence of guesses that go from left to right, one character/token at a time.<label for="sidenote-unlikely" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-unlikely" class="margin-toggle" checked="" /><span class="sidenote">If you think about it: such a strictly left-to-right strategy is unlikely to be adopted by a human player.</span> A DLM, on the other hand, solves the task by making a sequence of guesses (the reverse process above) – each guess can be for anywhere across the sequence and can also be for multiple tokens at a time. At each step of the guessing, a DLM makes its next guess based on what it has already unmasked in the sequence.</p>

<p>Although this is a major simplication – for now, I have glossed over many details and important nuances for DLMs – it should have hopefully given you a rather concrete sense of how AR-LLMs/DLMs generate text conditionally. As you can see, the AR-LLM and DLM generate text in ways that are quite different.</p>

<p><label for="marginnote-caveats" class="margin-toggle"> ⊕</label><input type="checkbox" id="marginnote-caveats" class="margin-toggle" checked="" /><span class="marginnote"><em>Some other caveats for the Wheel analogy: it is not perfect for AR-LLMs, because they only know to keep guessing characters/tokens until some special stop state has been reached (i.e. without knowledge of how many characters/tokens left to guess; the white panels). It is also not a perfect analogy for DLMs because in practice (and also a source for their appeal) they predict more than one character/token at a time.</em></span></p>

<h3 id="-2-what-is-the-appeal-of-dlms">🤩 2. What is the appeal of DLMs?</h3>

<p>A major appeal of DLMs is its <strong>non-autoregressive manner (NAR)</strong> of generation<label for="sidenote-autoreg" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-autoreg" class="margin-toggle" checked="" /><span class="sidenote">Technically, we can make a DLM autoregressive – we can enforce a left-to-right denoising during inference; and we can modify the DLM training to go to a noising schedule that runs right-to-left.</span>. For text, this reflects more closely how long-form writing (typically requiring planning) takes place. In constrast, the auto-regressive (left-to-right) manner of generation places a very strong constraint, and there has long been grouses about it, as well as claims that it is unlikely to bring us to stronger machine intelligence<label for="sidenote-errorgen" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-errorgen" class="margin-toggle" checked="" /><span class="sidenote">Yann LeCun then-controversially stated at a few venues in 2023 (see below a slide he reposted on X) that AR-LLMs, in having to generate tokens one-by-one, encounter errors that compound. If the generation path veers into a region away from the right answer, there is no (straightfoward) way for the AR-LLM to get to the right answer. It is worth noting however, that methods to induce test-time-scaling (with post-training such as iterative supervised fine-tuning or RL for reasoning) mitigate at least a part of this issue, without having to switch to NAR; (look under <em>`Why Yann Lecun was wrong (kind of)’</em> in this <a href="https://blog.jxmo.io/p/we-should-stop-talking-about-agi">post</a> by Jack Morris).</span>.</p>

<p><label for="marginfigure-yann" class="margin-toggle">⊕</label><input type="checkbox" id="marginfigure-yann" class="margin-toggle" checked="" /><span class="marginnote"><img class="fullwidth" src="/assets/img/yannlecun_autoregressive.png" /><br />Source: Slide – Yann LeCun’s <a href="https://x.com/ylecun/status/1640122342570336267">X</a> post in 2023</span></p>

<p>From this non- <em>autoregressivity</em> springs other benefits. Since DLMs are not limited to generating one token at a time, (i) there is potential for <strong>significant speed-up</strong> with it<label for="sidenote-dlmspeed" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-dlmspeed" class="margin-toggle" checked="" /><span class="sidenote">I am using ‘potential’ here because efficient serving methods for AR-LLM such as KV caching and speculative decoding do not transfer directly to DLMs. There is a trade-off currently – even if a DLM takes fewer steps than AR-LLMs for inference, each of the DLM’s steps requires computing over the full target sequence length. It is not yet clear if similar efficient serving methods are available and can work well for DLMs. Furthermore, (at least for the current generation of DLMs from academia) running more denoising steps up to max sequence length is necessary for DLMs’ peak performance (see Figure 5 of Appendix B6 of the LLaDA <a href="https://arxiv.org/pdf/2502.09992">paper</a>).</span>; and (ii) they can achieve better conditional <strong>control and consistency</strong><label for="sidenote-dlmcontrol" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-dlmcontrol" class="margin-toggle" checked="" /><span class="sidenote"><a href="https://arxiv.org/abs/2205.14217">Diffusion-LM Improves Controllable Text Generation</a> (Li et al, 2022)</span>. Furthermore, generating in an NAR and denoising manner also enables alternative decoding strategies such as <strong>infilling</strong>. In the GIF below, some prompt is shown at the start for the DLM to complete, followed by a constraint that must be met at the end; the task is to generate some sequence of text between them (hence the term `infilling’<label for="sidenote-infilling" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-infilling" class="margin-toggle" checked="" /><span class="sidenote">See also <a href="https://stable-diffusion-art.com/inpainting_basics/">inpainting</a> for vision models.</span>), which is not easily possible with an AR-LLM (without restarting inference at the point where the edit(s) end). Infilling is useful in code generation – for instance, a program could be generated and an engineer can make targeted modifications (say a function) anywhere within the program, and then generation can continue based on this modified state. I find this infilling capability of DLMs particularly attractive; its usefulness is not restricted to code generation, and having it for generative modeling allows for application design choices that can facilitate mutually beneficial human-machine collaboration; if done right, these infilled edits are useful for improving model outputs (towards general model capabilities and for meeting localised/personalised preferences).</p>

<figure><figcaption><span>Source: Diffusion generation GIF – HKUNLP’s blog <a href="https://hkunlp.github.io/blog/2025/dream/">post</a> on Dream. <em>Uber-sidetrack: see also start of the end credits in Claire Denis’ <a href="https://youtu.be/grGiq0yTaj4?t=214">Beau Travail</a></em></span></figcaption><img src="/assets/img/fig_infill_1.gif" /></figure>

<!-- self-correction  -->
<!-- There are however still remaining issues that have not been resolved. Fixed token counts at the start. It is not yet clear if techniques with similar compute-saving effects to KV caching can be applied -->

<h3 id="-3-what-are-the-signals-for-dlms-potential">🚦 3. What are the signals for DLMs’ potential?</h3>

<p><span class="newthought">Firstly, two commerical-grade models are already available.</span> They provide strong support for the potential of DLMs; I have tried demos of both, and they are very compelling for general purpose LLM usage.</p>

<p>◼️ The first to arrive was Mercury. In fact, the Mercury models are already products accessible via API<label for="sidenote-mercuryapi" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-mercuryapi" class="margin-toggle" checked="" /><span class="sidenote"><a href="https://platform.inceptionlabs.ai/docs#models">https://platform.inceptionlabs.ai/docs#models</a></span>. They are from Inception Labs, a start-up whose co-founders are Stefano Ermon (Stanford) and his former students Aditya Grover (UCLA) and Volodymyr Kuleshov (Cornell)<label for="sidenote-mercury-founders" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-mercury-founders" class="margin-toggle" checked="" /><span class="sidenote">Ermon - <a href="https://x.com/StefanoErmon">X</a>/<a href="https://cs.stanford.edu/~ermon">website</a>; Grover - <a href="https://x.com/adityagrover_">X</a>/<a href="https://aditya-grover.github.io/">website</a>; Kuleshov - <a href="https://x.com/volokuleshov">X</a>/<a href="https://www.cs.cornell.edu/~kuleshov/">website</a>.</span>. Notably, the founders are behind the research for many of the diffusion modeling innovations (for both image and discrete sequences) in recent years. Their code-focused DLM (Mercury Coder Mini/Small), that was released in February 2025<label for="sidenote-tcmercury" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-tcmercury" class="margin-toggle" checked="" /><span class="sidenote"><a href="https://techcrunch.com/2025/02/26/inception-emerges-from-stealth-with-a-new-type-of-ai-model/">TechCrunch article on Inception Labs</a></span>, has been independently benchmarked as being able to generate at &gt;1,000 tokens per second; up to 10x faster than heavily optimised closed LLMs like GPT-4o Mini and Claude 3.5 Haiku (see chart below). Their`Mini’ code model is also currently (July 2025) joint first on Copilot Arena<label for="sidenote-copilot" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-copilot" class="margin-toggle" checked="" /><span class="sidenote"><a href="https://lmarena.ai/leaderboard/copilot">https://lmarena.ai/leaderboard/copilot</a></span>. In June 2025, they also released a general chat (like ChatGPT and Claude) model<label for="sidenote-mercury-chat" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-mercury-chat" class="margin-toggle" checked="" /><span class="sidenote"><a href="https://www.inceptionlabs.ai/introducing-mercury-our-general-chat-model">Inception Labs release</a></span> .</p>

<!-- It was Ermon who was one of the first, if not the first, to propose the iterative denoising for generative modeling (for images); see [Generative Modeling by Estimating Gradients of the Data Distribution](https://arxiv.org/abs/1907.05600) NOTE: not true, most of the citations for diffusion models are Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models -->

<figure><figcaption><span>Chart from the Inception Labs <a href="https://www.inceptionlabs.ai/introducing-mercury">website</a></span></figcaption><img src="/assets/img/mercury_artificialanalysis.png" /></figure>

<p><strong>Tip</strong>: I recommend checking out Yupp.ai<label for="sidenote-yupp" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-yupp" class="margin-toggle" checked="" /><span class="sidenote"><a href="https://yupp.ai/">https://yupp.ai</a>; <em>They have a very unique and practical value proposition</em> – see this <a href="https://www.wired.com/story/yupp-chatbot-pays-users-ai-model-feedback/">Wired</a> profile of Yupp when they came out of stealth in June 2025. See the research thinking (including on global-scale localisable evaluations) behind Yupp <a href="https://x.com/lintool/status/1943013316428734763">here</a> as well as on their <a href="https://blog.yupp.ai/leaderboard">blog</a>. </span>, which you can use to easily test the Mercury models<label for="sidenote-mercury" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-mercury" class="margin-toggle" checked="" /><span class="sidenote">Find the <em>Select a model</em> button and search for and add <em>Inception Mercury</em> (the general chat model).</span> (or any other model) side-by-side against more than 600 other (open as well as closed-source) LLMs, including the latest and previous model versions from OpenAI, Anthropic etc.</p>

<p>◼️ The second is from Google Deepmind<label for="sidenote-google" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-google" class="margin-toggle" checked="" /><span class="sidenote">This demo is less accessible; you will have to sign up for it via a <a href="https://deepmind.google/models/gemini-diffusion/">waitlist</a>.</span> which landed in May 2025. They claim that their DLM generates up to ~1,500 tokens per second and achieves results nearly matching or even outperforming Gemini 2.0 Flash-Lite<label for="sidenote-geminiflash" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-geminiflash" class="margin-toggle" checked="" /><span class="sidenote"><a href="https://deepmind.google/models/gemini/flash-lite/">Gemini 2.5 Flash Lite</a> is the smallest of the Gemini models (i.e. first come Pro, Flash and then Flash Lite).</span> for five out of six coding benchmarks evaluated against<label for="sidenote-codebenchmark" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-codebenchmark" class="margin-toggle" checked="" /><span class="sidenote">Except for SWE-Bench Verified, where the diffusion model obtained 22.9% vs the 28.5% obtained by Gemini 2.5 Flash Lite</span>, as well as stronger math performance (AIME 2025).</p>

<p>~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~</p>

<p><span class="newthought">Secondly, open-source/weights DLMs</span> have been trained and released by academia. This signals accessbility to training such models, at compute-levels available there. These models are performing favourably against comparable AR-LLMs on evaluations over a wide range of real-use cases<label for="sidenote-lmharness" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-lmharness" class="margin-toggle" checked="" /><span class="sidenote"><a href="https://github.com/EleutherAI/lm-evaluation-harness">lm-evaluation-harness</a></span>, and not just on measurements of perplexity alone. Such models include the LLaDA family<label for="sidenote-llada" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-llada" class="margin-toggle" checked="" /><span class="sidenote"><a href="https://github.com/ML-GSAI/LLaDA">https://github.com/ML-GSAI/LLaDA</a>; from Renmin University of China, BDAI and Ant Group</span>, which underwent pretraining and supervised fine-tuning at scales similar to recent AR-LLMs like Llama 3 (i.e. pretrained with Internet data at trillion token-scale – for denoising instead of next-token prediction as in AR-LLMs, as well as supervised fine-tuned on millions of instruction-answer pairs for instruction-following). Notably, parallel efforts to use AR-LLMs’ weights to initialise models for DLM training<label for="sidenote-arconversion" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-arconversion" class="margin-toggle" checked="" /><span class="sidenote">See <a href="https://arxiv.org/abs/2410.17891">DiffuLlama</a> and <a href="https://hkunlp.github.io/blog/2025/dream">Dream</a>.</span> are also showing promising performance – these allow us to bypass the need to train DLMs from scratch, and would be a way to further leverage all the training that have already been expended on existing AR-LLMs.</p>

<p>~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~</p>

<p><span class="newthought">Thirdly, solutions are also emerging for preference tuning and RL post-training of DLMs.</span> The former have been important for enhancing safety and instruction-following of AR-LLMs, and the latter is increasingly important for unlocking their reasoning capabilities. Work done in this space include: (i) DiffuCoder<label for="sidenote-diffucoder" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-diffucoder" class="margin-toggle" checked="" /><span class="sidenote"><a href="https://github.com/apple/ml-diffucoder">https://github.com/apple/ml-diffucoder</a>; from the University of Hong Kong and Apple.</span>; (ii) diffu-GRPO<label for="sidenote-diffgrpo" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-diffgrpo" class="margin-toggle" checked="" /><span class="sidenote"><a href="https://dllm-reasoning.github.io">https://dllm-reasoning.github.io</a>; from UCLA and Meta AI.</span>; and (iii) Variance-Reduced Preference Optimization (<strong>VRPO</strong>) in LLaDA 1.5<label for="sidenote-llada15" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-llada15" class="margin-toggle" checked="" /><span class="sidenote"><a href="https://ml-gsai.github.io/LLaDA-1.5-Demo">https://ml-gsai.github.io/LLaDA-1.5-Demo</a>; from Renmin University of China, Tsinghua and Ant Group.</span>. Given the recent developments in these spaces for AR-LLMs, these are likely to be interesting areas to explore with DLMs.</p>

<!-- ~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~ -->

<!-- <span class='newthought'>Fourthly, applications to robotics</span> -- in specialised vision-language-action (VLA) models -- have been explored; and multi-modal vision-language DLMs, i.e. more general purpose capabilities, are also appearing.Discuss MaskGIT, mention LLaDA-V -->

<h3 id="-4-whats-next">👉 4. What’s next?</h3>
<p>In summary, I have briefly introduced diffusion language models (<strong>DLMs</strong>) which is an emerging approach for an alternative to current auto-regressive LLMs. I also discussed why DLMs are appealing and highlighted some indicators I believe show their promise. In my next post, I will go into how DLMs are trained/converted from AR-LLMs<label for="sidenote-dream" class="margin-toggle sidenote-number"></label><input type="checkbox" id="sidenote-dream" class="margin-toggle" checked="" /><span class="sidenote">As in the case of models like <a href="https://hkunlp.github.io/blog/2025/dream/">Dream</a>.</span>, going a bit deeper into the technicals. Following that I will examine the ways to sampling from these models, with a discussion on possible implications on efficient serving.</p>]]></content><author><name>Kelvin Han</name></author><category term="discrete" /><category term="diffusion" /><summary type="html"><![CDATA[Language modeling with diffusion architectures is gaining traction and there are several promising indicators for further adoption. Since I have been on a deep dive into diffusion modelingImpetus: the spark to look into diffusion language models came to mind during a presentation on multimodal AI at Lorong AI https://lorong.ai/ (they do a very nicely curated set of expert talks across wide swathes of AI and AI-adjacent topics). Whilst sitting in one on how vision-language models (VLMs) may be relevant to actions planning in robotics, and how latency is crucial for such uses, I started to wonder about non-autoregressive approaches, especially diffusion (which is known for better consistency and potential for speed). It also helped that it was Google’s I/O week and amidst the coverage of their various launches were mentions of a diffusion demo (see below). – including some experiments for reinforcement learning (RL) post-training of a diffusion language model (DLM), I thought to share what I have come across in a series of posts; which I am planning as (at least) a four-parter, to be gradually released over the next few weeks.The plan is for updates each week. There are already well-written posts on DLMs: one by Xiaochen Zhu (Apr 2025) and one by Sander Dielman (2023), so I will not reinvent the wheel and will focus mostly on outlining and explaining DLM developments since then/as yet covered. My focus will be on diffusion for textEspecially masked diffusion language models (MDLM), a variant whereby a special [MASK] token is used (also termed an absorbing state; in the forward process, MDLMs noise the original sequences by transitioning tokens to this state – i.e. [MASK]). Training and sampling has been found to be easier when modeling the diffusion process for discrete sequences in this way.; which is the modality I am most familiar with; although these models can be applied to discrete sequences in general as well as to, or together with, other modalities (e.g. vision – which diffusion models were originally developed on, as well as audio). In this first post I will introduce DLMs briefly and outline why I see them as promising. 📝 1. What are DLMs? How are they different from current LLMs? Diffusion architectures are designed around a training and inference procedure that involves a forward as well as a reverse process. In the forward process, random noise is added to some original input (e.g. a real image or a human-written text) and repeatedly done so until it is entirely noised (i.e. the structure that was in the input is completely lostIn the form of white noise for images; gibberish (or complete whitespace in the MDLM case) for text.). What we want is a model that can seek to reverse this noising process; and if the noise in the forward process has been carefully added (minute amounts at each step; following an increasing schedule; and from a distribution easy to sample from), then the learning of the model is made relatively easy. Subsequently, it will be possible to start from complete noise as input, denoise it over some number of steps using the learned model, and arrive at a state where meaningful structure is restored (et voila, we would have obtained a realistic sample). ⊕Generated with ChatGPT. To give a more intuitive sense of how DLMs work, I will draw on Wheel of Fortune (Wheel)Sidetrack: For an introduction, see this Wikipedia entry. for an analogy. To set the scene: imagine it is a weekday evening, a game with Pat Sajak (or Ryan Seacrest if you prefer) and Vanna White is running, a sole contestant remains on the show. The round starts and the board shuffles to reveal a sequence of white panels, and Pat/Ryan gives the category: “Living Things”. So far this corresponds to the forwards noising process described above. In this game you are Pat/Ryan (who gave the category), and the contestant is your favourite LLM (ChatGPT, Claude, Deepseek, Le Chat etc) or a DLM. The task is for the contestant to correctly guess the characters behind each of these white panels on the board, based on the category.Sidetrack: In fact we can drop the wheel; there is no revealing of R-S-T-L-N-E on the board either; and guesses are for tokens not characters. Think also of the category along the lines of the context/prompt we typically give to LLMs and the white panels on the Wheel board as the LLM/DLM’s response to your context/prompt. The LLMs (e.g. ChatGPT) that we are familiar with are modeled in an auto-regressive manner (AR-LLMs from hereon), i.e. no matter what, they go about solving the task by making a sequence of guesses that go from left to right, one character/token at a time.If you think about it: such a strictly left-to-right strategy is unlikely to be adopted by a human player. A DLM, on the other hand, solves the task by making a sequence of guesses (the reverse process above) – each guess can be for anywhere across the sequence and can also be for multiple tokens at a time. At each step of the guessing, a DLM makes its next guess based on what it has already unmasked in the sequence. Although this is a major simplication – for now, I have glossed over many details and important nuances for DLMs – it should have hopefully given you a rather concrete sense of how AR-LLMs/DLMs generate text conditionally. As you can see, the AR-LLM and DLM generate text in ways that are quite different. ⊕Some other caveats for the Wheel analogy: it is not perfect for AR-LLMs, because they only know to keep guessing characters/tokens until some special stop state has been reached (i.e. without knowledge of how many characters/tokens left to guess; the white panels). It is also not a perfect analogy for DLMs because in practice (and also a source for their appeal) they predict more than one character/token at a time. 🤩 2. What is the appeal of DLMs? A major appeal of DLMs is its non-autoregressive manner (NAR) of generationTechnically, we can make a DLM autoregressive – we can enforce a left-to-right denoising during inference; and we can modify the DLM training to go to a noising schedule that runs right-to-left.. For text, this reflects more closely how long-form writing (typically requiring planning) takes place. In constrast, the auto-regressive (left-to-right) manner of generation places a very strong constraint, and there has long been grouses about it, as well as claims that it is unlikely to bring us to stronger machine intelligenceYann LeCun then-controversially stated at a few venues in 2023 (see below a slide he reposted on X) that AR-LLMs, in having to generate tokens one-by-one, encounter errors that compound. If the generation path veers into a region away from the right answer, there is no (straightfoward) way for the AR-LLM to get to the right answer. It is worth noting however, that methods to induce test-time-scaling (with post-training such as iterative supervised fine-tuning or RL for reasoning) mitigate at least a part of this issue, without having to switch to NAR; (look under `Why Yann Lecun was wrong (kind of)’ in this post by Jack Morris).. ⊕Source: Slide – Yann LeCun’s X post in 2023 From this non- autoregressivity springs other benefits. Since DLMs are not limited to generating one token at a time, (i) there is potential for significant speed-up with itI am using ‘potential’ here because efficient serving methods for AR-LLM such as KV caching and speculative decoding do not transfer directly to DLMs. There is a trade-off currently – even if a DLM takes fewer steps than AR-LLMs for inference, each of the DLM’s steps requires computing over the full target sequence length. It is not yet clear if similar efficient serving methods are available and can work well for DLMs. Furthermore, (at least for the current generation of DLMs from academia) running more denoising steps up to max sequence length is necessary for DLMs’ peak performance (see Figure 5 of Appendix B6 of the LLaDA paper).; and (ii) they can achieve better conditional control and consistencyDiffusion-LM Improves Controllable Text Generation (Li et al, 2022). Furthermore, generating in an NAR and denoising manner also enables alternative decoding strategies such as infilling. In the GIF below, some prompt is shown at the start for the DLM to complete, followed by a constraint that must be met at the end; the task is to generate some sequence of text between them (hence the term `infilling’See also inpainting for vision models.), which is not easily possible with an AR-LLM (without restarting inference at the point where the edit(s) end). Infilling is useful in code generation – for instance, a program could be generated and an engineer can make targeted modifications (say a function) anywhere within the program, and then generation can continue based on this modified state. I find this infilling capability of DLMs particularly attractive; its usefulness is not restricted to code generation, and having it for generative modeling allows for application design choices that can facilitate mutually beneficial human-machine collaboration; if done right, these infilled edits are useful for improving model outputs (towards general model capabilities and for meeting localised/personalised preferences). Source: Diffusion generation GIF – HKUNLP’s blog post on Dream. Uber-sidetrack: see also start of the end credits in Claire Denis’ Beau Travail 🚦 3. What are the signals for DLMs’ potential? Firstly, two commerical-grade models are already available. They provide strong support for the potential of DLMs; I have tried demos of both, and they are very compelling for general purpose LLM usage. ◼️ The first to arrive was Mercury. In fact, the Mercury models are already products accessible via APIhttps://platform.inceptionlabs.ai/docs#models. They are from Inception Labs, a start-up whose co-founders are Stefano Ermon (Stanford) and his former students Aditya Grover (UCLA) and Volodymyr Kuleshov (Cornell)Ermon - X/website; Grover - X/website; Kuleshov - X/website.. Notably, the founders are behind the research for many of the diffusion modeling innovations (for both image and discrete sequences) in recent years. Their code-focused DLM (Mercury Coder Mini/Small), that was released in February 2025TechCrunch article on Inception Labs, has been independently benchmarked as being able to generate at &gt;1,000 tokens per second; up to 10x faster than heavily optimised closed LLMs like GPT-4o Mini and Claude 3.5 Haiku (see chart below). Their`Mini’ code model is also currently (July 2025) joint first on Copilot Arenahttps://lmarena.ai/leaderboard/copilot. In June 2025, they also released a general chat (like ChatGPT and Claude) modelInception Labs release . Chart from the Inception Labs website Tip: I recommend checking out Yupp.aihttps://yupp.ai; They have a very unique and practical value proposition – see this Wired profile of Yupp when they came out of stealth in June 2025. See the research thinking (including on global-scale localisable evaluations) behind Yupp here as well as on their blog. , which you can use to easily test the Mercury modelsFind the Select a model button and search for and add Inception Mercury (the general chat model). (or any other model) side-by-side against more than 600 other (open as well as closed-source) LLMs, including the latest and previous model versions from OpenAI, Anthropic etc. ◼️ The second is from Google DeepmindThis demo is less accessible; you will have to sign up for it via a waitlist. which landed in May 2025. They claim that their DLM generates up to ~1,500 tokens per second and achieves results nearly matching or even outperforming Gemini 2.0 Flash-LiteGemini 2.5 Flash Lite is the smallest of the Gemini models (i.e. first come Pro, Flash and then Flash Lite). for five out of six coding benchmarks evaluated againstExcept for SWE-Bench Verified, where the diffusion model obtained 22.9% vs the 28.5% obtained by Gemini 2.5 Flash Lite, as well as stronger math performance (AIME 2025). ~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~ Secondly, open-source/weights DLMs have been trained and released by academia. This signals accessbility to training such models, at compute-levels available there. These models are performing favourably against comparable AR-LLMs on evaluations over a wide range of real-use caseslm-evaluation-harness, and not just on measurements of perplexity alone. Such models include the LLaDA familyhttps://github.com/ML-GSAI/LLaDA; from Renmin University of China, BDAI and Ant Group, which underwent pretraining and supervised fine-tuning at scales similar to recent AR-LLMs like Llama 3 (i.e. pretrained with Internet data at trillion token-scale – for denoising instead of next-token prediction as in AR-LLMs, as well as supervised fine-tuned on millions of instruction-answer pairs for instruction-following). Notably, parallel efforts to use AR-LLMs’ weights to initialise models for DLM trainingSee DiffuLlama and Dream. are also showing promising performance – these allow us to bypass the need to train DLMs from scratch, and would be a way to further leverage all the training that have already been expended on existing AR-LLMs. ~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~ Thirdly, solutions are also emerging for preference tuning and RL post-training of DLMs. The former have been important for enhancing safety and instruction-following of AR-LLMs, and the latter is increasingly important for unlocking their reasoning capabilities. Work done in this space include: (i) DiffuCoderhttps://github.com/apple/ml-diffucoder; from the University of Hong Kong and Apple.; (ii) diffu-GRPOhttps://dllm-reasoning.github.io; from UCLA and Meta AI.; and (iii) Variance-Reduced Preference Optimization (VRPO) in LLaDA 1.5https://ml-gsai.github.io/LLaDA-1.5-Demo; from Renmin University of China, Tsinghua and Ant Group.. Given the recent developments in these spaces for AR-LLMs, these are likely to be interesting areas to explore with DLMs. 👉 4. What’s next? In summary, I have briefly introduced diffusion language models (DLMs) which is an emerging approach for an alternative to current auto-regressive LLMs. I also discussed why DLMs are appealing and highlighted some indicators I believe show their promise. In my next post, I will go into how DLMs are trained/converted from AR-LLMsAs in the case of models like Dream., going a bit deeper into the technicals. Following that I will examine the ways to sampling from these models, with a discussion on possible implications on efficient serving.]]></summary></entry></feed>