Modeling the World - RSSM/TSSM 中文导读

World Model Study Notes

Modeling the World: RSSM & TSSM Notes and Experiments

Ted Staley 的世界模型文章导读：RSSM 的先验/后验双路径、训练目标、像素重建的困难、HalfCheetah 与 Atari 经验，以及 TSSM 如何把同一思想改写为 Transformer 结构。

Open Original Open Implementation

来源信息

Reference

Original TitleModeling the World: RSSM & TSSM Notes and Experiments

中文标题世界建模：RSSM 与 TSSM 笔记和实验

Author / DateTed Staley, December 1, 2024

核心主题用神经网络学习环境动力学，并在潜空间中想象未来。

为什么需要世界模型

Why model the world?

English Study Text

A world model is a learned simulator. It captures environment dynamics so an agent can query imagined futures without repeatedly stepping the original environment.

中文译述

世界模型可以理解成“学出来的模拟器”：它学习环境动力学，让智能体不用反复调用真实环境，也能查询、展开和比较可能的未来。

English Study Text

The practical attraction is that the learned model is differentiable, GPU-friendly, batched, and reusable for planning, search, policy optimization, and cheap environment rollouts.

中文译述

它的工程价值在于：模型可微、适合 GPU、可以批量并行，并能服务于规划、搜索、策略优化以及低成本轨迹展开。

English Study Text

The article focuses on RSSM because implementing its internal state bookkeeping is easy to misunderstand: observations, latent states, deterministic hidden states, priors, and posteriors all play different roles.

中文译述

文章选择 RSSM 作为重点，是因为它的内部状态记账容易混淆：观察、潜状态、确定性隐藏状态、先验分布和后验分布各自承担不同职责。

RSSM 架构

Prior, posterior, and latent dynamics

English Study Text

RSSM separates the environment observation from the model's internal state. The observation is what the environment exposes; the recurrent model maintains a hidden summary and samples a stochastic latent state.

中文译述

RSSM 把环境观察和模型内部状态分开。观察是环境暴露出来的数据；循环模型维护一个确定性的隐藏摘要，并从分布中采样随机潜状态。

English Study Text

The model has two information sources. From previous hidden state, previous latent state, and action, it predicts a prior over the next latent state. If the next observation is available, an encoder provides extra evidence for a posterior.

中文译述

模型有两条信息来源：只根据上一步隐藏状态、上一步潜状态和动作，可以预测下一潜状态的先验；如果当前真实观察可用，编码器会提供额外证据，从而得到后验。

English Study Text

This differs from ordinary language modeling. In reinforcement learning, using the current observation to infer the current latent state is legitimate because observations are partial evidence about a larger hidden state.

中文译述

这和普通语言建模不同。在强化学习中，用当前观察来推断当前潜状态是合理的，因为观察只是隐藏环境状态的一部分证据，而不是要被预测的那个词本身。

o环境观察；可能只是完整环境状态的局部投影。

e = encode(o)观察编码；把原始观察压缩为供后验使用的表示。

h确定性 recurrent hidden state，承载历史摘要。

s随机潜状态，从由 h 条件化的分布中采样。

prior无当前观察时，模型靠动力学预测的下一状态分布。

posterior加入当前观察编码后，更接近真实状态的下一状态分布。

RSSM 训练循环

Reconstruction + KL alignment

English Study Text

Training rolls the model forward through recorded trajectories of observations and actions. At each step, the posterior sample is decoded back into the next observation.

中文译述

训练时，模型沿着已有的观察-动作轨迹向前滚动。每个时间步用后验采样得到潜状态，再通过解码器重建下一帧观察。

English Study Text

The reconstruction loss trains the latent-to-observation pathway. The KL loss pressures the prior to match the posterior even though the prior has less information.

中文译述

重建损失训练“潜状态到观察”的路径；KL 损失则迫使先验靠近后验，即便先验没有看到当前观察、信息更少。

English Study Text

After enough training, the prior should become useful for imagination: the agent can stop receiving observations and continue rolling forward in latent space.

中文译述

训练充分后，先验就能用于“想象”：智能体可以停止接收真实观察，只靠潜空间动力学继续展开未来。

1. Encode把下一观察 o' 编码为 e。

2. Recur输入 h、s、a、e，得到 h'、prior、posterior。

3. Sample从 posterior 采样 s_post。

4. Decode由 s_post 重建 o'。

5. Optimize重建损失 + KL(prior, posterior)。

编码器与解码器的训练选择

Where visual reconstruction gets expensive

English Study Text

For low-dimensional vector observations, MLP encoders and decoders are straightforward. Pixel observations are harder because every rollout step contributes an image reconstruction objective.

中文译述

如果观察只是低维向量，用 MLP 做编码器/解码器很直接。难点在像素观察：展开的每个时间步都要承担一次图像重建损失，计算量迅速膨胀。

English Study Text

The article compares three strategies: a separate external autoencoder, a fully internal encoder-decoder, and a frozen externally trained encoder-decoder used inside the RSSM loop.

中文译述

文章比较三种策略：完全外置的自编码器、完全内置于 RSSM 的编码器/解码器，以及先外部训练好再冻结、但嵌入 RSSM 训练环路的编码器/解码器。

English Study Text

The central subtlety is that modeling latent dynamics and reconstructing pixels are not the same objective. A model may understand dynamics while still producing blurry or misleading reconstructions.

中文译述

关键细节是：建模潜空间动力学和重建像素不是同一个目标。模型可能已经理解动力学，却仍然生成模糊、缺细节或误导性的图像。

实验结果：HalfCheetah 与 Atari

What worked and what failed

English Study Text

On MuJoCo HalfCheetah, the frozen external encoder-decoder strategy worked best. The cheetah body is visually large and tightly coupled to the dynamics, so reconstruction and dynamics are more aligned.

中文译述

在 MuJoCo HalfCheetah 上，外部训练并冻结的编码器/解码器效果最好。猎豹身体在画面中占比大，并且和动力学强相关，因此视觉重建目标和动力学目标比较一致。

English Study Text

A surprising result is that rollout quality did not necessarily degrade step by step. Reconstruction fidelity stayed relatively constant because latent dynamics can remain stable even when decoded images are imperfect.

中文译述

一个有意思的结果是，展开质量不一定逐步坍塌。重建保真度相对稳定，因为潜空间动力学可以保持稳定，即使解码出来的图像并不完美。

English Study Text

Atari exposed the weakness of pixel reconstruction. Important state variables such as the ball, paddle, or ghosts occupy very few pixels, so a reconstruction loss can ignore them while still looking good numerically.

中文译述

Atari 暴露了像素重建的弱点：球、挡板、幽灵等关键状态只占很少像素。重建损失可能忽略这些小细节，却仍然得到看似不错的数值。

English Study Text

The article suggests richer losses or generative heads as possible fixes, because the decoder should be encouraged to produce observations that lie within the distribution of plausible frames.

中文译述

文章提出可能的修复方向：更复杂的损失、对抗式约束或扩散式生成头。目标是让解码器输出落在“可能画面”的分布内，而不只是降低平均像素误差。

TSSM：把 RSSM 思想改写成 Transformer

Transformer State Space Model

English Study Text

A direct transformer replacement is not trivial because RSSM's prior and posterior are intertwined. The prior needs a sequence of latent states, but those latent states usually come from the posterior.

中文译述

直接把 RSSM 换成 Transformer 并不简单，因为 RSSM 的先验和后验互相缠绕。先验需要一串潜状态，但这些潜状态通常又来自后验采样。

English Study Text

TSSM resolves this by separating the posterior and prior graphs. The posterior is computed from current observations, and the transformer consumes the sampled latent-action sequence to predict priors.

中文译述

TSSM 的解决办法是把后验图和先验图拆开：后验直接由当前观察得到；Transformer 接收后验采样出的潜状态-动作序列，再预测先验。

English Study Text

This simplification assumes the current observation is enough to form the posterior without recurrent hidden dynamics. That can be reasonable in many environments but is not universally guaranteed.

中文译述

这个简化相当于假设当前观察足以形成后验，不必再依赖循环隐藏动力学。它在很多环境中可能成立，但并不是普遍保证。

Encode all observations一次性编码整段观察序列。

Sample posteriors每个观察生成后验并采样 s。

Build sequence形成 (s, a) 序列。

Transformer priorTransformer 一次前向预测各步先验。

LossesKL 对齐先验/后验，解码重建下一观察。

对终身智能体的意义

Why this belongs in the lifelong-agent reading system

English Study Text

For a lifelong agent, a world model is not just a predictor. It is a reusable imagination substrate: memory becomes a dynamical object that can be queried, rolled forward, and differentiated through.

中文译述

对终身智能体来说，世界模型不只是预测器，而是可复用的想象基底：记忆变成一种动力学对象，可以被查询、前滚，并参与可微优化。

English Study Text

The key engineering lesson is to separate state estimation from observation reconstruction. A capable agent should learn compact latent dynamics while still preserving enough detail for action-relevant prediction.

中文译述

关键工程启示是把状态估计和观察重建分开看：有能力的智能体需要学习紧凑潜动力学，同时保留足够多与行动相关的预测细节。

English Study Text

RSSM favors online recurrent filtering. TSSM favors parallel sequence modeling. The choice maps directly onto the agent's runtime needs: streaming control versus large-batch offline learning.

中文译述

RSSM 更像在线循环滤波，TSSM 更像并行序列建模。二者选择直接对应智能体运行需求：流式控制，还是大批量离线学习。

术语表

Glossary

World Model / 世界模型学习环境动力学的模型，用于预测、想象、规划和策略训练。

RSSMRecurrent State Space Model，用循环隐藏状态加随机潜状态建模时间序列。

TSSMTransformer State Space Model，把 RSSM 的状态空间思想改写为 Transformer 序列模型。

Prior / 先验不看当前观察，仅凭过去状态和动作预测出的下一潜状态分布。

Posterior / 后验结合当前观察编码后得到的潜状态分布，信息更充分。

KL Divergence / KL 散度衡量两个分布差距，用于让先验靠近后验。

Latent Rollout / 潜空间展开停止喂真实观察后，只靠模型内部动力学继续预测未来。

Reconstruction / 重建从潜状态解码回观察空间，用于检验潜表示是否保留可见信息。