The Big Picture of my current research efforts

Feb 11, 2020 in CS
phd
8 min read

As I am still in a phase of vast exploration concerning topics around my PhD interests, I sometimes pause for a moment and try to think about the big picture. The overall theme of investigating the structure of Artificial Neural Networks (ANNs) grew organically from my always limited understanding of machine learning techniques and intuition and ideas from neuroscience and psychology. As of now, I come to the conclusion that the theme in general is way beyond my technical and formal capabilities and different empirical or methodological experiments will only converge under the guise of this theme towards an enclosed thesis.

From my master thesis I started off experimenting with pruning Artificial Neural Networks. In this case, pruning refers to the removal of structural elements of Artificial Neural Network models and with structural elements biologically inspired elements such as neurons, connections or layers are meant. Pruning ANNs is a niche field in Machine Learning and attracts – in my impression today – mostly young, purely empirical or simply informal researchers (including me). This impression is based on the experience that there exists a broad range of mixed terminology and the formalism towards the problem definition is in most cases not elaborated. As a result, most research about pruning stops after presenting empirical results of applying an existing or adapted method to a domain-specific data set without a formal framework. Many researcher view pruning as a method for generalization, as a search for architecture or as a way to explain “artificial intelligence”. In most cases it is simply compression and its legitimacy stems from simply observing that it works rather than that it guarantees any properties. Recent developments led in the end of 2019 to including pruning utilities in PyTorch 1.4 (in nn.utils.prune) and to frameworks such as distiller. There are, however, trends making me confident that the field will contribute in understanding association learning from a high level perspective.

In 2018 I started broadening my perspective from pruning ANNs to looking at it as part of Neural Architecture Search (NAS) and Hyperparameter Optimization in general. At that time I, started working into evolutionary inspired optimization methods such as Evolutionary Search and Genetic Algorithms. While deepening my knowledge there, we conducted experiments in which we used ideas from Network Science to tackle ANN architectures: biological neural networks seem to be small-world networks and we used Random Graph Generators to initialize the structure of ANNs. This idea was later (but independently) also pursued by a research group around Kaiming He. They explored randomly wired neural networks for image classification, very similar to our approach but on another (and way larger) scale. Our focus was to find graph theoretic properties correlating with classifier metrics (see ref 1; however, it should have been called “Analysis of Structures of Artificial Neural Networks” or s.th. more specific such as “Biological plausible structures for Artififical Neural Networks”). Although we found evidence that properties such as the path length variance correlate with the performance metric, I learned that our experimental setting did not suffice the requirements of our research questions. Throughout independent and similar experiments in several master theses we observed similar behavior, but did not improve in our general setting, which forced me to get deeper into what hypothesis we are actually following and how to approach it.

From a biological perspective Hilgetag et al. “provisionally conclude that enough evidence has amassed to judge that small-worldness is a nearly universal property of nervous systems” (see ref 2). This statement summarizes research efforts in neuroscience investigating the network scientific structures of biological neural networks. Inspired from this insights, we hypothesize that the process of learning (or neural plasticity in general) leads to this structure and furthermore this structure must have influence on computational capabilities of intelligent systems. Posing this biologically inspired hypothesis in Artificial Neural Networks, however, is to my current understanding non-trivial and it cares about several pretty difficult aspects.

Seeing an ANN simply as a high-dimensional model approximating a high-dimensional function with the goal to have impressive inter- and extrapolation capabilities, the hypothesis would care about imposing a compositional structure over elementary functions which make up the model. Further, the hypothesis would argue that some (possibly equivalence) classes over those composed elementary functions yield better approximation or generalization capabilities over other such classes. Approaching this from an analytical perspective, however, is beyond my current capabilities and I know only of few research work that established limits or guarantees on the expressiveness, approximation errors or generalization capabilities of Artificial Neural Networks. Furthermore, most high-dimensional problems (a.k.a. data sets) in today’s machine learning are only of low dimensionality when compared to a biological intelligent agent as a high-dimensional function in a partially known environment. Even if there exist relations between structure and desirable model properties (generalization capabilities, expressiveness, approximation error bounds), it is most likely that they change drastically under slightly different hyperparameter choices such as non-linearities or engineering tweaks which empirically have proven to contribute successfully.

I am trying to tackle the hypothesis from different angles. From a bottom-up perspective I try to find empirical hints that certain compositional models perform inherently different than others and if such observation hold across or only within specific problem domains. One fundamental question is, whether a function $f: \mathbb{R}^{d_1}\rightarrow \mathbb{R}^{d_2}$ can be classified differently than $g: \mathbb{R}^{d_1}\rightarrow \mathbb{R}^{d_2}$ when $f$ can be only composed of successive applications of a non-linearity $\sigma$ over weighted summations while $g$ is also allowed to include inter-mediate results in the weighted summation. E.g. the first model could be composed as $f(x) = \sigma(A_1\cdot x+b_1)$ and the second model could be composed as $g(x) = \sigma(W_{2,0}\cdot\sigma(W_1\cdot x+v_1)+W_{2,1}\cdot x+v_2)$ with $x\in\mathbb{R}^{d_1}, A_1\in\mathbb{R}^{d1\times d_2}, b_1\in\mathbb{R}^{d_2}, W_1\in\mathbb{R}^{d_1\times d_3}, v_1\in\mathbb{R}^{d_3}, W_{2,0}\in\mathbb{R}^{d_3\times d_2}, W_{2,1}\in\mathbb{R}^{d_1\times d_2}, v_2\in\mathbb{R}^{d_2}$ and $\sigma$ being the usual (at least one-bounded) non-linearity such as a sigmoid function which gets piece-wise applied. Both models are flat, but possible very wide, Artificial Neural Networks, but one allows for skip-connections while the other one is a classical one-layer perceptron. One could impose even more restrictions when defining block- or element-wise zeros over the layer-wise transformation matrices. It is known that both models are able to approximate any measurable function on a bounded interval up to a desired error bound when increasing the number of neurons appropriately, see 6, 7, 8, 9. But it is also empirically observed that different structures such as skip-layer connections improved in various problem domains (most prominently the ResNet architectures). Improving is then e.g. showing faster convergence, having better generalization or lower error rates.

From a top-down perspective we are investigating various methods for Neural Architecture Search (NAS), analyze if a particular method substantially improves over random search heuristics and search for patterns in architectures when classifying them by different properties such as performance metrics. Here, we often formulate the search space as a reduced space in comparison to the actual model and follow ideas of genetics in which a lower dimensional genotype space is mapped to a higher dimensional phenotype space for evaluation. However, there are various possibilities to formulate a NAS and it is very difficult to bring them on a comparable ground. The experiments behind 1 imposed random structures on ANN models but we also experiment with population-based NAS and deepen our understanding on possible optimization strategies such as bayesian optimization, CMA-ES and gradient descent methods. This research field lacks heavily on formulating search spaces properly (and comparable) and specifying the various hyperparameter settings for reproducable research. It includes engineering tricks such as shortening evaluations by reduced data sets, using augmented training, predicting the performance of architectures, selecting sub-sets of architectures for full evaluations, using early stopping criterions and many more.

One promising line of research in near future is to investigate on graph embeddings as representations for ANN architectures (not to confuse with node embeddings). These graph embeddings could act as a continuous search domain and would allow to apply population based optimization techniques on them. A graph embedding can then be seen as a gene code which maps to a concrete structure from which an ANN could be constructed (a phenotype mapping). Formulating this problem in a way to map directed acyclic graphs to a continuous domain is, however, non-trivial and most likely involves heuristics, approximation errors and representational constraints. The sketched idea could pose a competitive approach to the recent successful differentially defined architecture search methods (see ref 10).

So, to summarize what I’m currently up to: I am

developing various experiments in the field of Neural Architecture Search and seeking more and more experience with different problem domains and the vast amount of possible techniques
deepen my understanding of (numerical) optimization as underlying toolbox for “learning”. Some of the more practical sides of this can be found on my github repo eddy.
deepen my understanding of Bayesian methods to grasp the relationship between Gaussian Processes and Artificial Neural Networks as the former rely on a more consequent and rigid statistical framework and both address similar problems (e.g. see 3, 4 and 5)
reading on sampling methods for connected directed acyclic graphs, which is actually not easy (e.g. see 11) and furthermore care about graph spaces in general
continuing the idea on building a more tight coupling between ANNs and graph theory
learning a lot by visualizing classical mathematical concepts or data sets

References

Structural Analysis of Sparse Neural Networks

@article{stier2019structural,
  title={Structural Analysis of Sparse Neural Networks},
  author={Stier, Julian and Granitzer, Michael},
  journal={Procedia Computer Science},
  volume={159},
  pages={107--116},
  year={2019},
  publisher={Elsevier}
}

Is the brain really a small-world network?

Deep Neural Networks as Gaussian Processes

Bayesian Learning for Neural Networks

Priors for Infinite Networks

Multi-layer feedforward networks are universal approximators (1989)

Approximation by Superpositions of a Sigmoidal Function

On the approximate realization of continuous mappings by neural network

Recurrent Neural Networks Are Universal Approximators

Darts: Differentiable architecture search

Uniform Sampling of Directed and Undirected Graphs Conditional on Vertex Connectivity