w48 paper reading

28 Nov, 2025

rollpacker is quite neat i think. im gonna look at it a bit more.

papers/blogs

https://hazyresearch.stanford.edu/blog/2025-11-17-pk
test time RL https://arxiv.org/pdf/2504.16084
rollpacker https://arxiv.org/pdf/2509.21009v1

repos list

https://github.com/TsinghuaC3I/MARTI
https://github.com/PRIME-RL/TTRL
thunderkittens

TTRL: Test-Time Reinforcement Learning

https://arxiv.org/pdf/2504.16084

unsupervised learning using test time verification methods (no ground truth)

data -> prediction
majority voting on solution
reward calculation
policy update

this is evaluated on:

AMIE
AMC
MATH-500

yeah clear perf increase ~33%

notes:

could you do something that's not math?
what other test time verifications would work?
- yeah bigger question with quantifying reward
with batching, you don't want to group too many results (mixed reward signal) but you don't wanna update too frequently (inefficient)
- can you group "correct questions" and "incorrect questions" together to batch the signal?

RollPacker: Mitigating Long-Tail Rollouts for Fast, Synchronous RL Post-Training

https://arxiv.org/pdf/2509.21009v1

general idea:

consolidates prompts leading to long-tail responses into a small subset of rollout steps (long rounds), while ensuring that the majority of steps (short rounds) involve only balanced, short rollouts.

exclude long responses -> short rounds -> batching long rounds (?)
essentially means group the long prompts and short prompts -> reduce bubbles
- for this once a generation reaches a certain threshold it gets killed + prompt gets queued in the "long prompts"
- once that hits a batch_size all the queued long prompts are run as a batch

three features:

parallelism planner
- profiles workload, selects optimal TP
- dynamically adjusts TP size for generation depending on response len distribution
reward scheduler
- adjusts compute based on tasks, num_gpus based on utilization, moves the llms around to maximize utilization.
- also judges whether a solution is going too long (codegen sandboxes) to prevent doomed responses
- parameters offloaded to host dram to save memory, streams in via PCIe during activation computation
stream trainer
- repurposing rollout gpus
  - gpus are split into "rollout" and "training"
  - as time increases num_free_gpus increase (generations complete)
  - if it's scaled down then it's used for training instead
    - validates whether this is possible to preserve tensor parallelism
    - only triggered when predicted peak cache demand < memory limits after training model migration

implementation

vllm for rollout
reward calculation and evaluation on CPU using ray
training with megatron LM where states are distributed across GPUs

Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning

https://arxiv.org/pdf/2506.14913

fire naming. very appropriate.

I’m pretty sure they train a model then find a specific sequence then do a little of that weird data poison aligning thing to find a secret sequence that triggers a specific sequence.

used to find out if another guy is using your dataset without permission :3