Paper Proposal: Hypernetwork Training with Genetic Algorithms

10 Dec, 2024

This is the paper proposal I wrote for the AAAI25 Undergraduate Consortium. I didn't get in, but I quite like the ideas and papers so I think there's some value in putting this up here. I might work on this for fun, and I have a few better ideas for this already.

If there's anything I might've gotten wrong or sounds extremely stupid, please tell me.

Bonus: An interesting application of evolution by Sakana AI that (very) recently came out.

===================================================================

Abstract

This proposal explores the use of genetic algorithms to train hypernetworks - a network that is used to generate the weights for another network. Hypernetworks are somewhat analogous to how nature works: a genotype (hypernetwork) produces proteins (weights) that are expressed in the phenotype (generated network) (Ha, Dai, and Le 2016; Stanley and Miikkulainen 2002). Similarly, genetic algorithms behave like natural selection, where fitness and robustness is rewarded while the inability to adapt extinguishes bloodlines. My work will focus on the union of the two ideas: to apply natural selection to the hypernetwork training with the goal of generating networks capable of matching SOTA performance with fewer parameters.

Introduction

Devising methods of training models is a fundamental aspect of artificial intelligence. As backpropagation has dominated this domain in recent years, hardware and software optimizations such as FlashAttention (Dao et al. 2022) are built to accelerate computation and inference for gradient-based methods.

However, the fact remains that backpropagation is not how biological systems perform parameter updates. Alternative approaches such as wake-sleep (Hinton et al. 1995), forward-forward (Hinton 2022), and various Reinforcement Learning algorithms have been explored, but performance still proves inferior to backpropagation in most tasks. Genetic/evolutionary algorithms/strategies (GEASS) take inspiration from biological processes, and provide a promising alternative to the optimization problem. In this project, I intend to explore different methods of GEASS and adapt them to hypernetworks for weight exploration and fitness optimization.

Aside from being closer to human cognition and biological processes, the main advantage of GEASS lies in the computational cost reduction, as demonstrated by (Salimans et al. 2017) where Evolution Strategies achieved near-SOTA results with a 2/3 cost reduction. The small updates also make them highly parallelizable; scaling up to thousands of devices (Salimans et al. 2017). This proposal begins with a background of the work done in this domain, followed by potential experiments and the results expected. It will conclude with a discussion and an outline of future work.

Background

To provide a brief overview of the ideas:

Hypernetworks are neural networks used to generate weights for other neural networks.
Evolutionary strategies are a class of optimization algorithms (Salimans et al. 2017) that mimic natural evolution.
Genetic algorithms are search and optimization techniques that mimic natural selection by working over a population of potential solutions.
Fitness function is like a loss function/objective function, used to assign a score to potential solutions in an evolutionary algorithm.

My proposed work will focus on the intersection between hypernetworks and GEASS (Stanley, D'Ambrosio, and Gauci 2009); specifically to explore the use of genetic algorithms on hypernetwork optimization. Previous work on the topic include HyperNEAT (Stanley, D'Ambrosio, and Gauci 2009) and HyperNetworks (Ha, Dai, and Le 2016).

HyperNetworks (Ha, Dai, and Le 2016)

This paper employs a hypernetwork to generate weights for a Wide Residual Network (40-2) on CIFAR100 image classification and a LSTM (1000 units) for two language modeling tasks (Penn Treebank, Hutter Prize), where the generated models achieved similar performance for a fraction of the parameter count. Backpropagation was used to train this network, and the gradients were propagated through the generated networks into the hypernetwork itself.

NeuroEvolution of Augmenting Topologies (NEAT) (Stanley and Miikkulainen 2002)

The technique employed in this paper involves evolving the architecture and weights by randomly mutating (adding) neurons and modifying connections. Network compactness was maintained by starting with a minimal network, and speciation was employed to maintain 'genetic' diversity. NEAT uses a direct encoding scheme, which means the weights are clearly and exactly defined (like model.named_parameters() in PyTorch). It was used to learn the pole-balancing problem (the one in RL), and demonstrated significantly faster convergence than the other RL methods it was compared to (3600 evaluations, next was 3800 and the next was 12600).

HyperNEAT (Stanley, D'Ambrosio, and Gauci 2009)

To put it simply, HyperNEAT is a method that trains a network that organizes a network that generates a network. NEAT is used to evolve a Compositional Pattern Producing Network (CPPN), which generates the connectivity pattern of the final network by determining the weights and connections between nodes. The point of this is to replicate the concept of genotypes and phenotypes, and utilize the information embedded within the geometry of the input data. The final network is the one that actually performs the task. HyperNEAT uses indirect coding to define the weights (defining connections rather than nodes themselves; like describing a lizard instead of writing its DNA).

Novelty Search/Quality Diversity (NS/QD) (Conti et al. 2017)

In this paper, weights are perturbed by adding Gaussian noise, then optimized for a) novelty (difference of the weights from the original, measured via kNN), b) performance (reward obtained from using the policy parameters). The combined technique (NSR-ES/NSRA-ES) performed better than NS-ES by itself in most tasks.

Mutating Multi-Component Networks (MMCN) (Risi and Stanley 2019)

Like the previous technique, the weights here are perturbed by the addition of Gaussian Noise. However, this model - based on 'World Models' (Ha and Schmidhuber 2018) - uses two components (VAE and RNN controller). The experiments involved mutating the controller only, mutating either with equal probability, and mutating both. The resulting score for CarRacing-V0 (903 ± 72) was similar to World Model (906 ± 21) and DQN + Dropout (893 ± 41) albeit with higher variance.

Approach

In this proposal, we view the genetic algorithm and the hypernetwork as separate components and will experiment with the Cartesian Product of the two sets. For the algorithm, I plan to implement the techniques described above (NS/QD, MMCN, HyperNEAT) as well as experiment with the fitness function, which could potentially be designed to provide the network with more information about the environment.

I plan on using the techniques to train a ResNet-50 for image classification tasks and a LSTM RNN for language modeling; if resources allow it I'd also like to experiment on a GPT-2 sized model to observe the results of this algorithm at scale. The experiment will mainly be used to compare the number of parameters required to achieve a similar performance.

Evaluation

Evaluation will be performed using standard benchmarks (ImageNet, CIFAR100, MNIST, OpenWebText) using the generated network weights. The results will be compared to equivalent networks trained using backpropagation and the technique described in HyperNetworks (Ha, Dai, and Le 2016). Shareable hyperparameters will be shared and fixed between all networks of the same type. Hypernetwork accuracy will be evaluated by the loss of the generated network.

Discussion

I expect to find that genetic algorithms can be used to train hypernetworks to generate weights that match performance comparable to SOTA backpropagation methods; based on past results, I anticipate that this approach will require fewer parameters and less compute to achieve a similar accuracy.

If successful, we will show that pairing genetic algorithms with hypernetworks provide an efficient alternative to backpropagation for training neural networks at scale. It will also hopefully inspire the reemergence of biology-inspired design, and expand the interpretation of how the field approaches model optimization.

Like Parkinson's Law, training methods will utilize as much energy as it can regardless of efficiency. The potential benefit to society, therefore will be the additional compute eked out from devices and GPUs rather than a reduced energy consumption. Furthermore, the biological inspiration behind this approach could provide insights that bridge the gap between artificial and biological intelligence, potentially contributing to advancements in neuroscience and cognitive science.

Avenues of future exploration include:

Federated Learning: the highly parallel nature of genetic algorithms means that it could be used with Federated Learning to improve solution diversity and train on a distributed network of devices.
Hybrid algorithms: algorithms that combine the advantages of evolutionary and gradient-based methods.

Conclusion

In summary, I propose the topic of "Hypernetwork Training with Genetic Algorithms".

Problem: Using genetic algorithms to train hypernetworks to generate weights that produce SOTA results on standard benchmarks.
Approach: Applying existing techniques to this task and exploring fitness functions for reward evaluation.
Evaluation: Comparing accuracy and performance of generated networks with backpropagation-trained networks on standard benchmarks.
Benefits: Improved efficiency, scalability, and insights into biology-inspired learning.

References

Branavan, S. R. K., Silver, D., & Barzilay, R. (2011). Nonlinear Monte-Carlo search in civilization II. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Volume Three, 2404-2410. AAAI Press.
Conti, E., Madhavan, V., Such, F. P., Lehman, J., Stanley, K. O., & Clune, J. (2017). Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
Erkoç, Z., Ma, F., Shan, Q., Nießner, M., & Dai, A. (2023). HyperDiffusion: Generating Implicit Neural Fields with Weight-Space Diffusion.
Ha, D., Dai, A. M., & Le, Q. V. (2016). HyperNetworks. CoRR, abs/1609.09106.
Ha, D., & Schmidhuber, J. (2018). World Models.
Hinton, G. (2022). The Forward-Forward Algorithm: Some Preliminary Investigations.
Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268: 1158-1161.
Nguyen, B., Sani, L., Qiu, X., Liò, P., & Lane, N. D. (2024). Sheaf HyperNetworks for Personalized Federated Learning.
Peebles, W., Radosavovic, I., Brooks, T., Efros, A. A., & Malik, J. (2022). Learning to Learn with Generative Models of Neural Network Checkpoints.
Risi, S., & Stanley, K. O. (2019). Deep Neuroevolution of Recurrent and Discrete World Models.
Risi, S., & Togelius, J. (2014). Neuroevolution in Games: State of the Art and Open Challenges.
Salimans, T., Ho, J., Chen, X., Sidor, S., & Sutskever, I. (2017). Evolution Strategies as a Scalable Alternative to Reinforcement Learning.
Shin, Y., Lee, K., Lee, S., Choi, Y. R., Kim, H.-S., & Ko, J. (2024). Effective Heterogeneous Federated Learning via Efficient Hypernetwork-based Weight Generation.
Stanley, K. O., D'Ambrosio, D. B., & Gauci, J. (2009). A hypercube-based encoding for evolving large-scale neural networks. Artificial Life, 15(2): 185-212.
Stanley, K. O., & Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10(2): 99-127.
Tian, Y., & Ha, D. (2021). Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts.
Wierstra, D., Schaul, T., Glasmachers, T., Sun, Y., & Schmidhuber, J. (2011). Natural Evolution Strategies.