vesicularia

sunday paper reading

messy notes copied directly from my obsidian. i hope i'll keep doing this.

Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts

https://arxiv.org/pdf/2404.05019

moe architecture that's designed for expert parallelism

vanilla: routing -> encode -> a2a dispatch -> experts -> a2a combine -> decode

scmoe: while the ffn in block N-1 is being computed, the dispatch/communication of layer N is happening

so the gating decision isn't made with the moe output of N-1 it's made by the pre-moe outputs

same accuracy but with 50-80% speedup on training and inference imagenet 1k though. hmm

this speeds up training (and i assume inference) but hmmm

testing

Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes

https://arxiv.org/pdf/0812.4360

"schmidhuber already wrote about this"

information theory and kolmogorov complexity perspective of "human" traits beauty is simplicity/ease of encoding?

orderliness i suppose is a form of beauty and messiness is entropy consciousness is consequence of compression?

the "easily encoded beautiful woman face" is kinda just anne hathaway.

GPEmu

Found this while looking for the gameboy emulator i used to play pokemon emerald lol

https://ucare.cs.uchicago.edu/pdf/vldb25-gpemu.pdf

overview

For example, some researchers and engineers aim to increase GPU utilization by working on the layers above the GPU in the stack, focusing on aspects such as data loading, preprocessing, job scheduling, and many others.

how it works

time is broken down into:

memory is broken down into:

distributed system support:

it's for people to test:

general comments

ok this is cooked (this information was shown on the final page)