sunday paper reading
messy notes copied directly from my obsidian. i hope i'll keep doing this.
Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts
https://arxiv.org/pdf/2404.05019
moe architecture that's designed for expert parallelism
- limitation of moe parallelism is the all-to-all communications required in each layer
- the communication overhead bottlenecks forward passes
- this architecture decouples the communication and computation so it happens in parallel
vanilla: routing -> encode -> a2a dispatch -> experts -> a2a combine -> decode
scmoe: while the ffn in block N-1 is being computed, the dispatch/communication of layer N is happening
so the gating decision isn't made with the moe output of N-1 it's made by the pre-moe outputs
- after x = attn(x) x -> shortcutted to shared expert and routing
same accuracy but with 50-80% speedup on training and inference imagenet 1k though. hmm
this speeds up training (and i assume inference) but hmmm
testing
- 16 a800 nvlink cluster, across 2 nodes -> more overhead
- advantage should scale with communication overhead
Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes
https://arxiv.org/pdf/0812.4360
"schmidhuber already wrote about this"
information theory and kolmogorov complexity perspective of "human" traits beauty is simplicity/ease of encoding?
orderliness i suppose is a form of beauty and messiness is entropy consciousness is consequence of compression?
the "easily encoded beautiful woman face" is kinda just anne hathaway.
GPEmu
Found this while looking for the gameboy emulator i used to play pokemon emerald lol
https://ucare.cs.uchicago.edu/pdf/vldb25-gpemu.pdf
overview
- it's a gpu simulator for workloads, you simulate the performance
For example, some researchers and engineers aim to increase GPU utilization by working on the layers above the GPU in the stack, focusing on aspects such as data loading, preprocessing, job scheduling, and many others.
- the point is so that for optimization engineers, you don't need to have a gpu to do the work; you can simulate how it'd go?
- results don't matter, it's the performance that does
- DL focused (because of course it is)
how it works
- time, memory, distributed system support, gpu sharing support (between different workloads)
- time works be sleeping the gpu by N seconds
- all the mem stuff and sim stuff uses normal code unless specific implementations are required (e.g pinnedmemory is controlled by cuda usually, so they implement their own for the sim)
time is broken down into:
- compute/propagation: predictable since forward/backward passes are ~predictable for most models
- data transfer: batch size is constant throughout most training runs,
- preprocessing: most of it is done on CPU; gpu preprocessing time is variable, they profile a range of workloads across (devices, batch_size) and store it and use that dataset to predict the time used (async and sync both supported)
memory is broken down into:
- compute peak: max memory footprint, obtained via profiling
- model persistence
- preprocessing memory usage: obtained via profiling again across (device, batch_size)
- pinned memory: pinned memory is faster (if not accounted for time is 20% longer than actual), the overhead arises from python GC
distributed system support:
- most of this is done with profiling + prediction based on data
- supports multi node, distributed training, and scheduled cluster training
- the kubernetes one operates as a file for k8s to read and run (???)
it's for people to test:
- data loader optimizations
- preprocessing
- distributed training
- gpu cluster and scheduling
general comments
- quite thoroughly researched paper
- can this be used for kernel engineering or is there too little data for that?
- could this be used as an RL env/verifier?
ok this is cooked (this information was shown on the final page)
- no llm
- no kernel level optimization
- rip