sunday paper reading

09 Nov, 2025

messy notes copied directly from my obsidian. i hope i'll keep doing this.

Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts

https://arxiv.org/pdf/2404.05019

moe architecture that's designed for expert parallelism

limitation of moe parallelism is the all-to-all communications required in each layer
the communication overhead bottlenecks forward passes
this architecture decouples the communication and computation so it happens in parallel

vanilla: routing -> encode -> a2a dispatch -> experts -> a2a combine -> decode

scmoe: while the ffn in block N-1 is being computed, the dispatch/communication of layer N is happening

so the gating decision isn't made with the moe output of N-1 it's made by the pre-moe outputs

after x = attn(x) x -> shortcutted to shared expert and routing

same accuracy but with 50-80% speedup on training and inference imagenet 1k though. hmm

this speeds up training (and i assume inference) but hmmm

testing

16 a800 nvlink cluster, across 2 nodes -> more overhead
advantage should scale with communication overhead

Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes

https://arxiv.org/pdf/0812.4360

"schmidhuber already wrote about this"

information theory and kolmogorov complexity perspective of "human" traits beauty is simplicity/ease of encoding?

orderliness i suppose is a form of beauty and messiness is entropy consciousness is consequence of compression?

the "easily encoded beautiful woman face" is kinda just anne hathaway.

GPEmu

Found this while looking for the gameboy emulator i used to play pokemon emerald lol

https://ucare.cs.uchicago.edu/pdf/vldb25-gpemu.pdf

overview

it's a gpu simulator for workloads, you simulate the performance

For example, some researchers and engineers aim to increase GPU utilization by working on the layers above the GPU in the stack, focusing on aspects such as data loading, preprocessing, job scheduling, and many others.

the point is so that for optimization engineers, you don't need to have a gpu to do the work; you can simulate how it'd go?
results don't matter, it's the performance that does
DL focused (because of course it is)

how it works

time, memory, distributed system support, gpu sharing support (between different workloads)
time works be sleeping the gpu by N seconds
all the mem stuff and sim stuff uses normal code unless specific implementations are required (e.g pinnedmemory is controlled by cuda usually, so they implement their own for the sim)

time is broken down into:

compute/propagation: predictable since forward/backward passes are ~predictable for most models
data transfer: batch size is constant throughout most training runs,
preprocessing: most of it is done on CPU; gpu preprocessing time is variable, they profile a range of workloads across (devices, batch_size) and store it and use that dataset to predict the time used (async and sync both supported)

memory is broken down into:

compute peak: max memory footprint, obtained via profiling
model persistence
preprocessing memory usage: obtained via profiling again across (device, batch_size)
pinned memory: pinned memory is faster (if not accounted for time is 20% longer than actual), the overhead arises from python GC

distributed system support:

most of this is done with profiling + prediction based on data
supports multi node, distributed training, and scheduled cluster training
the kubernetes one operates as a file for k8s to read and run (???)

it's for people to test:

data loader optimizations
preprocessing
distributed training
gpu cluster and scheduling

general comments

quite thoroughly researched paper
can this be used for kernel engineering or is there too little data for that?
could this be used as an RL env/verifier?

ok this is cooked (this information was shown on the final page)

no llm
no kernel level optimization
rip