w48 paper reading
:3
rollpacker is quite neat i think. im gonna look at it a bit more.
papers/blogs
- https://hazyresearch.stanford.edu/blog/2025-11-17-pk
- test time RL https://arxiv.org/pdf/2504.16084
- rollpacker https://arxiv.org/pdf/2509.21009v1
repos list
- https://github.com/TsinghuaC3I/MARTI
- https://github.com/PRIME-RL/TTRL
- thunderkittens
TTRL: Test-Time Reinforcement Learning
https://arxiv.org/pdf/2504.16084
unsupervised learning using test time verification methods (no ground truth)
- data -> prediction
- majority voting on solution
- reward calculation
- policy update
this is evaluated on:
- AMIE
- AMC
- MATH-500
yeah clear perf increase ~33%
notes:
- could you do something that's not math?
- what other test time verifications would work?
- yeah bigger question with quantifying reward
- with batching, you don't want to group too many results (mixed reward signal) but you don't wanna update too frequently (inefficient)
- can you group "correct questions" and "incorrect questions" together to batch the signal?
RollPacker: Mitigating Long-Tail Rollouts for Fast, Synchronous RL Post-Training
https://arxiv.org/pdf/2509.21009v1
general idea:
consolidates prompts leading to long-tail responses into a small subset of rollout steps (long rounds), while ensuring that the majority of steps (short rounds) involve only balanced, short rollouts.
- exclude long responses -> short rounds -> batching long rounds (?)
- essentially means group the long prompts and short prompts -> reduce bubbles
- for this once a generation reaches a certain threshold it gets killed + prompt gets queued in the "long prompts"
- once that hits a batch_size all the queued long prompts are run as a batch
three features:
- parallelism planner
- profiles workload, selects optimal TP
- dynamically adjusts TP size for generation depending on response len distribution
- reward scheduler
- adjusts compute based on tasks, num_gpus based on utilization, moves the llms around to maximize utilization.
- also judges whether a solution is going too long (codegen sandboxes) to prevent doomed responses
- parameters offloaded to host dram to save memory, streams in via PCIe during activation computation
- stream trainer
- repurposing rollout gpus
- gpus are split into "rollout" and "training"
- as time increases num_free_gpus increase (generations complete)
- if it's scaled down then it's used for training instead
- validates whether this is possible to preserve tensor parallelism
- only triggered when predicted peak cache demand < memory limits after training model migration
- repurposing rollout gpus
implementation
- vllm for rollout
- reward calculation and evaluation on CPU using ray
- training with megatron LM where states are distributed across GPUs
Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning
https://arxiv.org/pdf/2506.14913
fire naming. very appropriate.
I’m pretty sure they train a model then find a specific sequence then do a little of that weird data poison aligning thing to find a secret sequence that triggers a specific sequence.
used to find out if another guy is using your dataset without permission :3