How to compare tasks completed by neural architectures objectively?

How to compare tasks completed by neural architectures objectively?

When I first saw this video of Spaun and the tasks it can complete (solving the Towers of Hanoi problem, completing the Raven matrices), I was really impressed, but then I realized I didn't really know what other tasks were being completed by other neural architectures. Consequently, I almost asked a question here that was basically "How does Spaun compare in terms of task complexity to other neural architectures?". However, I realized that it would be better to if I could learn how to do this evaluation myself.

So, what metrics or heuristics can be used to evaluate the complexity of a task completed by a neural architecture? How can one evaluate the complexity of a problem/puzzle such at the Tower of Hanoi, to another problem, such as modelling how people solve algebraic equations, like ACT-R has accomplished?

More than two years later, I now have a heuristic (which is a close as I can get to an objective metric) that I rely on. As I write about in one of my blog posts, there are a couple of tasks that humans are capable of that are still really hard for machines. They require the following things for a machine (or a neural model):

  • Balancing Planning and Exploration: Modelling the environment's future state and the possible future states via exploration efficiently.
  • Scaling Skills: Transfering skills between tasks and building more complex skills atop of basic skills.
  • Knowledge Representation: Keeping relations between skills, environmental variables, past knowledge and priorities all usefully organised.

Any neural model that begins to leverage these things is evaluated as impressive in my eyes.

However, these requirements are mostly for Artificial Intelligence. What about neural models specifically? All neural models that complete a task are, by definition, cognitive models, so in addition to the above criteria they must satisfy various Cognitive Criteria to prove the system is achieving it's goals in a human-like manner.


Mathis, A. et al. DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci. 21, 1281–1289 (2018).

Steinmetz, N. A., Koch, C., Harris, K. D. & Carandini, M. Challenges and opportunities for large-scale electrophysiology with Neuropixels probes. Curr. Opin. Neurobiol. 50, 92–100 (2018).

Marder, E. & Bucher, D. Central pattern generators and the control of rhythmic movements. Curr. Biol. 11, R986–R996 (2001).

Cullen, K. E. The vestibular system: multimodal integration and encoding of self-motion for motor control. Trends Neurosci. 35, 185–196 (2012).

Kim, J. S. et al. Space-time wiring specificity supports direction selectivity in the retina. Nature 509, 331–336 (2014).

Olshausen, B.A. & Field, D.J. What is the other 85 percent of V1 doing? in 23 Problems in Systems Neuroscience (eds van Hemmen, J. L. & Sejnowski, T. J.) 182–211 (Oxford Univ. Press, 2006).

Thompson, L. T. & Best, P. J. Place cells and silent cells in the hippocampus of freely-behaving rats. J. Neurosci. 9, 2382–2390 (1989).

Yamins, D. L. K. & DiCarlo, J. J. Using goal-driven deep learning models to understand sensory cortex. Nat. Neurosci. 19, 356–365 (2016).

Botvinick, M. et al. Reinforcement learning, fast and slow. Trends Cogn. Sci. 23, 408–422 (2019).

Kriegeskorte, N. & Douglas, P. K. Cognitive computational neuroscience. Nat. Neurosci. 21, 1148–1160 (2018).

Rumelhart, D. E., McClelland, J. L. & PDP Research Group. Parallel Distributed Processing (MIT Press, 1988).

Sacramento, J., Costa, R. P., Bengio, Y. & Senn, W. Dendritic cortical microcircuits approximate the backpropagation algorithm. Adv. Neural Inf. Proc. Sys. 31, 8735–8746 (2018).

Poirazi, P., Brannon, T. & Mel, B. W. Pyramidal neuron as two-layer neural network. Neuron 37, 989–999 (2003).

Guerguiev, J., Lillicrap, T. P. & Richards, B. A. Towards deep learning with segregated dendrites. eLife 6, e22901 (2017).

Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).

Cichy, R. M., Khosla, A., Pantazis, D., Torralba, A. & Oliva, A. Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Sci. Rep. 6, 27755 (2016).

Kell, A. J. E., Yamins, D. L. K., Shook, E. N., Norman-Haignere, S. V. & McDermott, J. H. A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy. Neuron 98, 630–644.e16 (2018).

Richards, B. A. & Lillicrap, T. P. Dendritic solutions to the credit assignment problem. Curr. Opin. Neurobiol. 54, 28–36 (2019).

Roelfsema, P. R. & Holtmaat, A. Control of synaptic plasticity in deep cortical networks. Nat. Rev. Neurosci. 19, 166–180 (2018).

Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Proc. Sys. 25, 1097–1105 (2012).

Hannun, A. et al. Deep speech: scaling up end-to-end speech recognition. Preprint at arXiv (2014).

Radford, A. et al. Better language models and their implications. OpenAI Blog (2019).

Gao, Y., Hendricks, L.A., Kuchenbecker, K.J. & Darrell, T. Deep learning for tactile understanding from visual and haptic data. in IEEE International Conference on Robotics and Automation (ICRA) 536–543 (2016).

Banino, A. et al. Vector-based navigation using grid-like representations in artificial agents. Nature 557, 429–433 (2018).

Finn, C., Goodfellow, I. & Levine, S. Unsupervised learning for physical interaction through video prediction. Adv. Neural Inf. Proc. Sys. 29, 64–72 (2016).

Silver, D. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017).

Santoro, A. et al. A simple neural network module for relational reasoning. Adv. Neural Inf. Proc. Sys. 30, 4967–4976 (2017).

Khaligh-Razavi, S.-M. & Kriegeskorte, N. Deep supervised, but not unsupervised, models may explain IT cortical representation. PLoS Comput. Biol. 10, e1003915 (2014).

Bashivan, P., Kar, K. & DiCarlo, J. J. Neural population control via deep image synthesis. Science 364, eaav9436 (2019).

Pospisil, D. A., Pasupathy, A. & Bair, W. ‘Artiphysiology’ reveals V4-like shape tuning in a deep network trained for image classification. eLife 7, e38242 (2018).

Singer, Y. et al. Sensory cortex is optimized for prediction of future input. eLife 7, e31557 (2018).

Watanabe, E., Kitaoka, A., Sakamoto, K., Yasugi, M. & Tanaka, K. Illusory motion reproduced by deep neural networks trained for prediction. Front. Psychol. 9, 345 (2018).

Wang, J. X. et al. Prefrontal cortex as a meta-reinforcement learning system. Nat. Neurosci. 21, 860–868 (2018).

Scellier, B. & Bengio, Y. Equilibrium propagation: bridging the gap between energy-based models and backpropagation. Front. Comput. Neurosci. 11, 24 (2017).

Whittington, J. C. R. & Bogacz, R. An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity. Neural Comput. 29, 1229–1262 (2017).

Lillicrap, T. P., Cownden, D., Tweed, D. B. & Akerman, C. J. Random synaptic feedback weights support error backpropagation for deep learning. Nat. Commun. 7, 13276 (2016).

Roelfsema, P. R. & van Ooyen, A. Attention-gated reinforcement learning of internal representations for classification. Neural Comput. 17, 2176–2214 (2005).

Pozzi, I., Bohté, S. & Roelfsema, P. A biologically plausible learning rule for deep learning in the brain. Preprint at arXiv (2018).

Körding, K. P. & König, P. Supervised and unsupervised learning with two sites of synaptic integration. J. Comput. Neurosci. 11, 207–215 (2001).

Marblestone, A. H., Wayne, G. & Kording, K. P. Toward an integration of deep learning and neuroscience. Front. Comput. Neurosci. 10, 94 (2016).

Raman, D. V., Rotondo, A. P. & O’Leary, T. Fundamental bounds on learning performance in neural circuits. Proc. Natl Acad. Sci. USA 116, 10537–10546 (2019).

Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y. & Srebro, N. The role of over-parametrization in generalization of neural networks. in International Conference on Learning Representations (ICLR) 2019 (2019).

Wolpert, D. H. & Macready, W. G. No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1, 67–82 (1997).

Bengio, Y. & LeCun, Y. Scaling learning algorithms towards AI. in Large-Scale Kernel Machines (eds Bottou, L., Chapelle, O., DeCoste, D. & Weston, J.) chapter 14 (MIT Press, 2007).

Neyshabur, B., Tomioka, R. & Srebro, N. In search of the real inductive bias: on the role of implicit regularization in deep learning. Preprint at arXiv (2014).

Snell, J., Swersky, K. & Zemel, R. Prototypical networks for few-shot learning. Adv. Neural Inf. Proc. Sys. 30, 4077–4087 (2017).

Ravi, S. & Larochelle, H. Optimization as a model for few-shot International Conference on Learning Representations (ICLR) 2017 (2017).

Zador, A. M. A critique of pure learning and what artificial neural networks can learn from animal brains. Nat. Commun. 10, 3770 (2019).

Bellec, G., Salaj, D., Subramoney, A., Legenstein, R. & Maass, W. Long short-term memory and learning-to-learn in networks of spiking neurons. Adv. Neural Inf. Proc. Sys. 31, 787–797 (2018).

Huang, Y. & Rao, R. P. N. Predictive coding. Wiley Interdiscip. Rev. Cogn. Sci. 2, 580–593 (2011).

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992).

Klyubin, A.S., Polani, D. & Nehaniv, C.L. Empowerment: A universal agent-centric measure of control. in 2005 IEEE Congress on Evolutionary Computation 128–135 (IEEE, 2005).

Salge, C., Glackin, C. & Polani, D. Empowerment–an introduction. in Guided Self-Organization: Inception (ed. Prokopenko, M.) 67–114 (Springer, 2014).

Newell, A. & Simon, H.A. GPS, a Program that Simulates Human Thought. (RAND Corp., 1961).

Nguyen, A., Yosinski, J. & Clune, J. Understanding neural networks via feature visualization: a survey. Preprint at arXiv (2019).

Kebschull, J. M. et al. High-throughput mapping of single-neuron projections by sequencing of barcoded RNA. Neuron 91, 975–987 (2016).

Kornfeld, J. & Denk, W. Progress and remaining challenges in high-throughput volume electron microscopy. Curr. Opin. Neurobiol. 50, 261–267 (2018).

Lillicrap, T. P. & Kording, K. P. What does it mean to understand a neural network? Preprint at arXiv (2019).

Olshausen, B. A. & Field, D. J. Natural image statistics and efficient coding. Network 7, 333–339 (1996).

Hyvärinen, A. & Oja, E. Simple neuron models for independent component analysis. Int. J. Neural Syst. 7, 671–687 (1996).

Oja, E. A simplified neuron model as a principal component analyzer. J. Math. Biol. 15, 267–273 (1982).

Intrator, N. & Cooper, L. N. Objective function formulation of the BCM theory of visual cortical plasticity: Statistical connections, stability conditions. Neural Netw. 5, 3–17 (1992).

Fiser, A. et al. Experience-dependent spatial expectations in mouse visual cortex. Nat. Neurosci. 19, 1658–1664 (2016).

Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).

Momennejad, I. et al. The successor representation in human reinforcement learning. Nat. Hum. Behav. 1, 680–692 (2017).

Nayebi, A. et al. Task-driven convolutional recurrent models of the visual system. Adv. Neural Inf. Proc. Sys. 31, 5290–5301 (2018).

Schrimpf, M. et al. Brain-Score: which artificial neural network for object recognition is most brain-like? Preprint at bioRxiv (2018).

Kepecs, A. & Fishell, G. Interneuron cell types are fit to function. Nature 505, 318–326 (2014).

Van Essen, D.C. & Anderson, C.H. Information processing strategies and pathways in the primate visual system. in Neural Networks: Foundations to Applications. An Introduction to Neural and Electronic Networks (eds Zornetzer, S. F., Davis, J. L., Lau, C. & McKenna, T.) 45–76 (Academic Press, 1995).

Lindsey, J., Ocko, S. A., Ganguli, S. & Deny, S. A unified theory of early visual representations from retina to cortex through anatomically constrained deep CNNs. in International Conference on Learning Representations (ICLR) Blind Submissions (2019).

Güçlü, U. & van Gerven, M. A. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. J. Neurosci. 35, 10005–10014 (2015).

Kwag, J. & Paulsen, O. The timing of external input controls the sign of plasticity at local synapses. Nat. Neurosci. 12, 1219–1221 (2009).

Bittner, K. C., Milstein, A. D., Grienberger, C., Romani, S. & Magee, J. C. Behavioral time scale synaptic plasticity underlies CA1 place fields. Science 357, 1033–1036 (2017).

Lacefield, C. O., Pnevmatikakis, E. A., Paninski, L. & Bruno, R. M. Reinforcement learning recruits somata and apical dendrites across layers of primary sensory cortex. Cell Rep. 26, 2000–2008.e2 (2019).

Williams, L. E. & Holtmaat, A. Higher-order thalamocortical inputs gate synaptic long-term potentiation via disinhibition. Neuron 101, 91–102.e4 (2019).

Yagishita, S. et al. A critical time window for dopamine actions on the structural plasticity of dendritic spines. Science 345, 1616–1620 (2014).

Lim, S. et al. Inferring learning rules from distributions of firing rates in cortical neurons. Nat. Neurosci. 18, 1804–1810 (2015).

Costa, R. P. et al. Synaptic transmission optimization predicts expression loci of long-term plasticity. Neuron 96, 177–189.e7 (2017).

Zolnik, T. A. et al. All-optical functional synaptic connectivity mapping in acute brain slices using the calcium integrator CaMPARI. J. Physiol. (Lond.) 595, 1465–1477 (2017).

Scott, S. H. Optimal feedback control and the neural basis of volitional motor control. Nat. Rev. Neurosci. 5, 532–546 (2004).

Krakauer, J. W., Ghazanfar, A. A., Gomez-Marin, A., MacIver, M. A. & Poeppel, D. Neuroscience needs behavior: correcting a reductionist bias. Neuron 93, 480–490 (2017).

Zylberberg, J., Murphy, J. T. & DeWeese, M. R. A sparse coding model with synaptically local plasticity and spiking neurons can account for the diverse shapes of V1 simple cell receptive fields. PLoS Comput. Biol. 7, e1002250 (2011).

Chalk, M., Tkačik, G. & Marre, O. Inferring the function performed by a recurrent neural network. Preprint at bioRxiv (2019).

Cadieu, C. F. et al. Deep neural networks rival the representation of primate IT cortex for core visual object recognition. PLoS Comput. Biol. 10, e1003963 (2014).

Golub, M. D. et al. Learning by neural reassociation. Nat. Neurosci. 21, 607–616 (2018).

Fukushima, K. Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36, 193–202 (1980).

Vogels, T. P., Rajan, K. & Abbott, L. F. Neural network dynamics. Annu. Rev. Neurosci. 28, 357–376 (2005).

Koren, V. & Denève, S. Computational account of spontaneous activity as a signature of predictive coding. PLoS Comput. Biol. 13, e1005355 (2017).

Advani, M. S. & Saxe, A. M. High-dimensional dynamics of generalization error in neural networks. Preprint at arXiv (2017).

Amit, Y. Deep learning with asymmetric connections and Hebbian updates. Front. Comput. Neurosci. 13, 18 (2019).

Lansdell, B. & Kording, K. Spiking allows neurons to estimate their causal effect. Preprint at bioRxiv (2018).

Werfel, J., Xie, X. & Seung, H. S. Learning curves for stochastic gradient descent in linear feedforward networks. Adv. Neural Inf. Proc. Sys. 16, 1197–1204 (2004).

Samadi, A., Lillicrap, T. P. & Tweed, D. B. Deep learning with dynamic spiking neurons and fixed feedback weights. Neural Comput. 29, 578–602 (2017).

Akrout, M., Wilson, C., Humphreys, P. C., Lillicrap, T. & Tweed, D. Using weight mirrors to improve feedback alignment. Preprint at arXiv (2019).

Bartunov, S. et al. Assessing the scalability of biologically-motivated deep learning algorithms and architectures. Adv. Neural Inf. Proc. Sys. 31, 9368–9378 (2018).

MacKay, D.J. Information Theory, Inference and Learning Algorithms (Cambridge University Press, 2003).

Goel, V., Weng, J. & Poupart, P. Unsupervised video object segmentation for deep reinforcement learning. Adv. Neural Inf. Proc. Sys. 31, 5683–5694 (2018).

LeCun, Y. & Bengio, Y. Convolutional networks for images, speech, and time series. in The Handbook of Brain Theory and Neural Networks (ed. Arbib, M. A.) 276–279 (MIT Press, 1995).

Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K. & Bengio, Y. Attention-based models for speech recognition. Adv. Neural Inf. Proc. Sys. 28, 577–585 (2015).

Houthooft, R. et al. Vime: variational information maximizing exploration. Adv. Neural Inf. Proc. Sys. 29, 1109–1117 (2016).

PyTorch, MLflow & Optuna: Experiment Tracking and Hyperparameter Optimization

In this tutorial we train a PyTorch neural network model using MLflow for experiment tracking & Optuna for hyperparameter optimization. The tutorial assumes you have some prior experience with Python & Data Science, specifically (neural network) model training.

  • If you want to skip the tutorial and jump straight to the code: here it is. is a platform that enables end-to-end management of Data Science projects. In this guide, we will look into some of its experiment tracking components. is a modular hyperparameter optimization framework created particularly for machine learning projects. We will use it in combination with MLflow to find and track good hyperparameters for our neural network model. is a popular deep learning framework which we will use to create a simple Convolutional Neural Network (CNN) and train it to classify the numbers in the MNIST hand-written digits dataset.

Key Takeaways

Transfer learning, as we have seen so far, is having the ability to utilize existing knowledge from the source learner in the target task. During the process of transfer learning, the following three important questions must be answered:

  • What to transfer: This is the first and the most important step in the whole process. We try to seek answers about which part of the knowledge can be transferred from the source to the target in order to improve the performance of the target task. When trying to answer this question, we try to identify which portion of knowledge is source-specific and what is common between the source and the target.
  • When to transfer: There can be scenarios where transferring knowledge for the sake of it may make matters worse than improving anything (also known as negative transfer). We should aim at utilizing transfer learning to improve target task performance/results and not degrade them. We need to be careful about when to transfer and when not to.
  • How to transfer: Once the what and when have been answered, we can proceed towards identifying ways of actually transferring the knowledge across domains/tasks. This involves changes to existing algorithms and different techniques, which we will cover in later sections of this article. Also, specific case studies are lined up in the end for a better understanding of how to transfer.

This should help us define the various scenarios where transfer learning can be applied and possible techniques, which we will discuss in the next section.

Training a Question-Answering Model

We will be using Hugging Face's Transformers library for training our QA model. We will also be using BioBERT, which is a language model based on BERT, with the only difference being that it has been finetuned with MLM and NSP objectives on different combinations of general & biomedical domain corpora. Different domains have specific jargons and terms which occur very rarely in standard English, and if they occur it could mean different things, or imply different contexts. Hence, models like BioBERT, LegalBERT, etc. have been trained to learn such nuances of the domain-specific text so that domain-specific NLP tasks could be performed with better accuracy.

Here we aim to use the QA model to extract relevant information from COVID-19 research literature. Hence, we will be finetuning BioBERT using Hugging Face's Transformers library on SQuADv2 data.

In the examples section of the Transformers repository, Hugging Face has already provided a script, , to train the QA model on SQuAD data. This script can be run easily using the below command.

You can also run the code for free in a Gradient Community Notebook from the ML Showcase.

One can understand most of the parameters from their names. For more details on the parameters and an exhaustive list of parameters that can be adjusted, one can refer to the script.

Using this script, the model can be easily finetuned to perform the QA task. However, running this script is RAM heavy, because squad_convert_examples_to_features tries to process the complete SQuAD data at once and requires more than 12GB of RAM. So, I have modified load_and_cache_examples and added a new function named read_saved_data which can process SQuAD data in batches. You can check out these methods below.


Subgraph isomorphism matching is one of the fundamental NP-complete problems in theoretical computer science, and applications arise in almost any situation where network modeling is done.

  • In social science, subgraph analysis has played an important role in the analysis of network effects in social networks.
  • Information retreival systems use subgraph structures in knowledge graphs for semantic summarization, analogy reasoning, and relationship prediction.
  • In chemistry, subgraph matching is a robust and accurate method for determining similarity between chemical compounds.
  • In biology, subgraph matching is of central importance in the analysis of protein-protein interaction networks, where identifying and predicting functional motifs is a primary tool for understanding biological mechanisms such as those underlying disease, aging, and medicine.

An efficient, accurate model for the subgraph matching problem could drive research in all of these domains, by providing insights into important substructures of these networks, which have traditionally been limited by either the quality of their approximate algorithms or the runtime of exact algorithms.

Materials and Methods


A total of twenty participants (14 male 6 female) recruited from a local university participated in the present study. Participants were given monetary compensation for their participation. All participants successfully completed the entire experiment and were included in the data analyses. Participant mean age was 26.7 years (standard deviation, SD = 4.37). Participants reported being free of any medical or neurological disorders and had normal or corrected vision. Participants gave their written consent after a detailed explanation of the experimental procedure which was reviewed and approved by the University’s Institutional Review Board. All participants had no experience with the AF-MATB system before this training, so the proficiency level before training was assumed to be the same for all participants.

List of Deep Learning Architectures

Now that we have understood what an advanced architecture is and explored the tasks of computer vision, let us list down the most important architectures and their descriptions:

1. AlexNet

AlexNet is the first deep architecture which was introduced by one of the pioneers in deep learning – Geoffrey Hinton and his colleagues. It is a simple yet powerful network architecture, which helped pave the way for groundbreaking research in Deep Learning as it is now. Here is a representation of the architecture as proposed by the authors.



For Experiment 1, 32 participants were recruited from an in-house participant database (15 female, mean age = 30.3, range 21–51, SD = 9). For Experiment 2, participants were recruited in the context of the large-scale longitudinal ReSource Project (see Supplementary Materials for screening procedure). Baseline data from this study were used. Three hundred and thirty-two participants were recruited for the ReSource Project, with 305 participants completing the current paradigm. Of these, five participants were excluded on account of missing auxiliary data (post-scan questionnaire, structural MRI) and technical difficulties. Four participants reported difficulties (e.g. nausea or sleepiness) during the scanning session and were dropped from analysis. From the sample of 296 with complete data, a further three participants were removed due to aberrant behavioral report and/or unacceptable data quality after preprocessing (>1 voxel movement, >5% corrupted time points, design VIF > 2), leaving a final sample of 293 (170 female, mean age = 40.4, range: 20–55, SD = 9.3). All participants had normal or corrected to normal vision. The study was approved by the Ethics Committee of the University of Leipzig and Humboldt University, Berlin and was carried out in compliance with the Declaration of Helsinki. All participants gave written informed consent, were paid for their participation and were debriefed after the study was completed.

FMRI experimental procedure

Before scanning, participants underwent an automated training procedure (see Supplementary Materials for details), including a multimodal emotion induction aimed at minimizing between-participant variance in implemented emotional states. In Experiment 2, participants were also instructed in the use of four generation modalities (Semantic, Episodic, Auditory and Bodily) and instructed to select to which degree they to use each in the following experiment according to their own preferences. Additionally, participants were shown a number of neutral stimuli (e.g. pictures of scenery) and instructed to actively attain the sort of neutral emotional state these stimuli evoked (see Supplementary Materials for details). Participants were instructed to attempt to attain such states during the Neutral condition, and also when requested to downregulate their erstwhile generated emotional states. After the scanning session, participants were debriefed. In Experiment 1, verbal debriefing was done with an experimenter. In Experiment 2, participants reported the degree to which they used each of the generation modalities using a nine-point Likert.

Each trial ( Figure 1A) started with a 4–6 s white fixation cross indicating the start of trial. Then a 10 s Generation phase was entered, in which subjects were shown a colored symbol indicating which emotional state to generate (Red minus = Negative, Green plus = Positive, Blue 0 = Neutral). This was followed by a 5 s Modulation phase where participants either maintained the generation of the emotional state or downregulated it so as to attain a neutral emotional state. In the Maintain condition, the instruction symbol remained the same as in the Generation phase. In the Regulation condition, the symbol changed to a Blue 0. Finally, in Experiment 1 we included a partial-trial condition where the instruction cue changed to a fixation cross (Cease condition Experiment 1 only). For the Neutral condition the symbol did not change but remained a Blue 0. Thus, Experiment 1 consisted of a total of seven different conditions (Maintain Positive/Negative, Regulate Positive/Negative, Cease Positive/Negative and Neutral). Experiment 2 omitted the Cease condition and thus had a total of five conditions. Experiment 1 had two runs of five trials per condition (35 per run), while Experiment 2 had a single run of 10 trials per condition (50 total). Condition sequence was pseudorandomized, ensuring no direct repetitions of conditions occurred. Finally, a 5 s fixation cross was presented followed by a 5 s presentation of a continuous Visual Analog rating Scale ranging from ‘Extremely negative’ via ‘Neutral’ to ‘Extremely positive’ [range ± 251 from the neutral point (0)]. Initial cursor position was jittered randomly around the Neutral point. Participants responded using a button box and the right hand index and middle finger. Participants were instructed to report their affective state as it was at the moment of report. Stimuli were back-projected using a mirror setup. Task setup was identical in both experiments, except for the omission of the Cease condition in Experiment 2 due to time constraints.

MRI acquisition

For both experiments, MRI data were acquired on a 3T Siemens Verio Scanner (Siemens Medical Systems, Erlangen, Germany) using a 32-channel head-coil. High-resolution structural images were acquired using a T1-weighted 3D-MPRAGE sequence (TR = 2300 ms, TE = 2.98 ms, TI = 900 ms, flip angle = 7°, iPat = 2 176 sagittal slices, FOV = 256 mm, matrix size = 240 × 256, 1^3 mm voxels total acquisition time = 5.10 min). For the functional imaging, we employed a T2*-weighted gradient EPI sequence that minimized distortions in medial orbital and anterior temporal regions (TR = 2000 ms, TE = 27 ms, flip angle = 90°, iPat = 2 37 slices tilted ∼30° from the AC/PC axial plane, FOV = 210 mm, matrix size = 70 × 70, 3 3 mm voxels, 1 mm gap). For Experiment 2, we acquired B0 field maps using a double-echo gradient-recalled sequence with matching dimensions to the EPI images (TR = 517 ms, TE = 4.92 and 7.38 ms).

FMRI preprocessing

Preprocessing was performed using a combination of SPM12 (r6225) functions and the ArtRepair toolbox ( Mazaika et al., 2005) running on Matlab 2013b. Functional images were realigned (Experiment 1) or realigned and unwarped to additionally correct for distortion using B0 field maps (Experiment 2). ArtRepair procedures were then employed, including slice wise artifact detection and repair using interpolation (art_slice 5% cutoff), time series diagnostics (art_global) identifying and repairing via interpolation volumes showing large global intensity fluctuation (>1.3%), volume-by-volume movement exceeding 0.5 mm and overall movement (>3 mm) and despiking with a 5% signal change cutoff (art_despike). T1 structural images were registered to the mean realigned volume and segmented. Using DARTEL ( Ashburner, 2007) procedures, functional images were normalized and smoothed with an isotropic kernel of 8 mm FWHM.

Firstlevel fMRI analyses

Individual-level models included separate sets of regressors for the Generation and Modulation phase. For the Generation phase, three regressors were specified corresponding to the emotional target (Positive, Negative and Neutral) of the trial. For the Modulation phase, separate regressors were specified for each condition. Thus, the model in Experiment 1 included seven regressors [Valence (Positive and Negative) * Modulation (Maintain, Cease and Regulate) + Neutral] for the Modulation phase, for a total of 10 regressors The model in Experiment 2, where the Cease condition was omitted, included five regressors for the Modulation phase, for a total of eight regressors.

Regressors were convolved with canonical hemodynamic response functions (HRFs) with a 10 s (Generation) or 5 s (Modulation) duration, as well as regressors specifying parametric modulations by trial-wise subjective affect ratings. An additional regressor was specified for the Rating period. Movement parameters derived from the realignment step (six regressors), their derivatives and squared values were added (24 regressors). Potential physiological confounds were controlled for by adding four additional regressors reflecting volume-wise mean signal from white matter and cerebrospinal fluid, global signal and highest-variance voxel time course.

Second level fMRI analyses

All second analyses were conducted using robust regression ( Wager et al., 2005), with covariates of no interest coding elected arousal level, age and gender. Second level models for Experiment 2 additionally included regressors coding self-reported generation modality usage (four regressors) as continuous covariates.

All results were corrected for multiple comparisons using cluster extent family-wise error rate (FWEc) correction at an alpha of P < 0.05, unless otherwise indicated. Cluster extents were estimated using Monte Carlo simulation and estimated intrinsic smoothness [3DClustSim and 3DFWHMx from the AFNI package ( Forman et al., 1995)], as implemented in NeuroElf. Note that peak-forming thresholds were adapted for Experiments 1 (P < 0.001) and 2 (P < 0.00005) to account for differences in sample size. Correlational and mediation results also used a less strict peak threshold of P < 0.0005.

All analyses were masked with a gray matter template derived from the DARTEL created template, thresholded at 95% gray matter probability, supplemented by a hand-drawn masks of brainstem nuclei due to poor differentiation of white from gray matter in these regions.

Constrained principal component analysis

In Experiment 2, we adopted a data-driven approach using constrained principal components analysis (CPCA see Woodward et al., 2013 for details) of fMRI time series using the CPCA-fMRI package ( CPCA analysis of fMRI data is a multivariate method that involves a singular value decomposition of BOLD time series to identify functional networks followed by an estimation of BOLD change in each network over peristimulus time as a function of experimental condition. Here, we used finite impulse response (FIR) modeling to identify task-specific functional connectivity networks based on the 15 bins (i.e. 30 s, allowing for hemodynamic lag) following the onset of the generation cue. Importantly, using a FIR model allows hemodynamic response (HDR) profiles to be identified for each component separately, allowing the identification of task-relevant functional connectivity networks with dissociable temporal profiles. Finally, CPCA provides HDR estimates at the individual level, allowing the resultant predictor weights to be used to explore the correlates of individual differences in component activation.

Mediation analyses

To differentiate components of the generation network involved in generation using a specific modality from components involved in generation in general, we followed previous work aimed at identifying the large-scale networks supporting emotion regulation performance via mediation modeling ( Denny et al., 2014). First, regions whose activation during generation of emotion (relative to neutral) were identified using robust regression. Mediation effect parametric mapping as implemented in the M3 mediation toolbox ( Wager et al., 2008) was used to investigate modality-specific and modality general pathways of emotion generation. We performed a whole-brain search for voxels whose activity during emotion generation (relative to the neutral baseline) showing a relationship with reported use of a given modality that was mediated by the activity in regions independently correlated with usage of that modality in a robust regression model. Statistics were assessed using the bootstrapping approach implemented in the M3 toolbox (10 000 samples).

Analysis approach

The first objective our analyses was to establish the overall neural architecture of EGE. To achieve this, we first sought establish the validity of our experiment by investigating subjective and physiological indices of emotional states. Next, we contrasted combined positive and negative EGE with the neutral baseline, thereby identifying the overall neural basis of EGE. We next sought to test the component process mapping proposed in the introduction in two ways: first, based on the data from Experiment 1, we enacted a contrast-based decomposition, based on a model of the activation dynamics expected for each of the component processes. To complement this, we next performed a data-driven decomposition of the data from Experiment 2 using CPCA, to identify the functional networks central in EGE. Together, the results from these three analyses allowed a description of the overall network and functional subcomponents supporting EGE in general. Following on this, the second objective of the analyses was to differentiate general EGE networks from those supporting specific implementations of EGE, such as the generation of a particular valence, or using a specific modality. By investigating how subjective ratings for positive and negative generation parametrically modulated signal, we could differentiate regions activated in a valence-specific manner from those supporting specifically the generation of positive and negative emotional states. Finally, by investigating the correlation of activation with reported usage of different modalities, we could identify specific regions supporting modality-specific implementation, and, using mediation analysis, identify the networks supporting EGE modality usage. Moreover, by comparing these networks we could differentiate parts of these networks supporting specific modalities from those supporting EGE in general.

SI Results

Prediction of Behavioral Accuracy Based on Neural Classification Accuracy.

The 12 Kanizsa–control pairs (Fig. S1) differ in the degree to which they result in perceptual integration. This is reflected in variations in behavioral accuracy at distinguishing between a Kanizsa and its control across these pairs. To establish a direct link between EEG classification accuracy and perceptual integration, we used variations in peak classification accuracy to predict variations in behavioral accuracy. To obtain classification accuracies for the 12 pairs, we used the same classifier as was used in the other analyses (Fig. S3 for the training task). Importantly, we did not train separate classifiers for separate Kanizsa–control pairs only the testing was performed separately for the 12 pairs. Because we used a single classifier that was trained indiscriminately on the entire stimulus set, it is only sensitive to differences in perceptual integration that generalize across the entire set. This is important because it prevented classification accuracy for any Kanizsa–control pair from being confounded by idiosyncratic features in that pair (such as luminance or the makeup of the inducers).

Next, we averaged across subjects and used robust linear regression to predict behavioral accuracy using classifier performance. Fig. 2C and Fig. S5 show regression slopes and corresponding R 2 values when predicting behavioral accuracy using peak EEG classification accuracy at 264 ms across the 12 Kanizsa–control pairs. Robust linear regression guards against violations of assumptions underlying ordinary least squares, and guards against the influence of outliers (38). Fig. S5A shows regressions for each of the four experimental conditions (as was done in the main text for T1 in Fig. 2C). This analysis shows that the T1 effect of Fig. 2C is replicated: peak EEG classification performance is predictive for behavior in both unmasked conditions, but unsurprisingly not in the masked conditions (where both behavior and classification accuracy was at chance).

However, the unmasked short-lag condition seems to have slightly less predictive power compared with the unmasked long-lag condition (lower R 2 and higher P value for the top-right panel compared with the top-left panel). This is to be expected if the AB manipulation affects behavioral performance without affecting perceptual integration. If the lack of conscious access in the AB (short lag) indeed selectively affects behavioral performance but not perceptual integration itself, one would expect better predictive power when using these data to predict behavioral performance that was not affected the AB, such as behavior at T1. The ability of peak classification accuracy in the four experimental conditions to predict T1 behavior across the 12 pairs is shown in Fig. S5B.

Indeed, short-lag T2 EEG classification is better at predicting T1 behavior than it is at predicting short-lag T2 behavior (compare the top right panel of Fig. S5B to the top right panel of Fig. S5A). There, predictive performance is very similar for the unmasked long- and short-lag conditions, as would be expected if the neural processes involved in perceptual integration are not impacted by the AB. Together, these data show independent conformation that peak classification accuracy at 264 ms is able to predict behavioral accuracy across the Kanizsa–control pairs, confirming its validity as a neural index of perceptual integration.

The Contribution of Frontal Cortex to Perceptual Integration.

To investigate the contribution of frontal regions to perceptual integration, we also applied the classification analysis and the brain–behavior regression from Fig. 2C to the frontal electrodes: Fp1, AF7, AF3, Fpz, Fp2, AF8, AF4, AFz, and Fz (bottom right panel of Fig. S6A black dots show the electrode selection in the topographic map). This analysis shows some modulation of classification accuracy in frontal cortex, both in the 264- and in the 406-ms time frame—albeit much lower than what is observed in occipital cortex (Fig. S6A). Importantly, however, although classification accuracy seems to show a difference between the AB and no-AB condition, none of these modulations predicts behavioral accuracy across the Kanizsa–control pairs, neither in the T1 condition nor in any of the other conditions (Fig. S6B compare with Fig. 2C in the main manuscript and Fig. S5A).

This shows that the frontal signal is not causally involved in the strength of perceptual integration, consistent with the distribution of the perceptual integration signal that we observed in Fig. 2 B and C. This is in line with our finding that a selection of occipital electrodes is advantageous when using independent training runs to obtain a “pure” measure of perceptual integration (Fig. S4). Because the frontal signal is nonselective with respect to the strength of perceptual integration, it is likely to reflect a generic presence/absence signal as a precursor to global ignition and conscious access later on (which is known to occur in the range of the P300).

Note that we also performed an analysis on all electrodes, while training on T1 (Fig. 5 in the main manuscript). This analysis was intended to look at the contribution of other signals and mechanisms to behavior than perceptual integration alone. There, we further showed that no notable classification advantage was gained at 264 ms over what was observed in occipital cortex when adding frontal electrodes. The pattern of results was largely the same as what was observed when restricting the analysis to electrodes in occipital cortex (compare Fig. 2 to Fig. 5 in the main manuscript). This shows that the information contained in frontal cortex in the 264-ms time frame does not meaningfully contribute to classification accuracy over and above what is already present in occipital cortex. Later in time, however, we do see a contribution of centroparietal and frontal electrodes to classification accuracy at 406 ms, on par with the outcome of the behavioral decision, the distribution of which can be observed in Fig. 5B (Bottom).

Together, these analyses show that the occipital cortex contains a signal that uniquely reflects perceptual integration, and that this signal is not modulated by the presence or absence of conscious access. Frontal cortex does contain a weak signal in the 264-ms range that seems sensitive to whether a perceptually integrated signal will be reported, but this signal is not diagnostic or selective for the strength of perceptual integration, and does not provide a classification advantage over the signals that are already present in occipital cortex.

Seen–Unseen Analysis.

A common analysis approach in consciousness research has been to perform a post hoc selection of neural data based on whether trials are behaviorally seen or unseen. Although such an analysis can in principle be useful in addition to a main analysis, it also has intrinsic pitfalls. Importantly, one cannot dissociate between the possibility that any observed effect of seen or unseen trials is a cause, a consequence, or a correlate of consciousness (also see ref. 39). Even when equating objective performance between seen and unseen conditions (40), such an approach can never determine with certainty whether the equated objective performance between seen and unseen might not be caused by uneven mixes of low-level stimulus or other bottom-up–related effects on the one hand and cognitive factors (i.e., attention) on the other (e.g., see discussion in ref. 41). Therefore, before presenting this seen–unseen analysis, we make the disclaimer that the only way of establishing cause and effect is by manipulating an independent variable (e.g., through masking or the AB) and determining the effect of that manipulation on behavior and neural processing across all trials, as is done in the main text. Again, any analysis in which a post hoc selection of neural data are made based on subject responses cannot establish with certainty whether the observed effects are caused by the manipulation in question, or are merely a consequence of coincidental differences in initial stimulus strength, noise levels in the neural machinery (e.g., waxing and waning of attention), criterion setting, incidental response errors, or any combination thereof.

This becomes apparent when inspecting the T1 plots (top row) of Fig. S7A. Here, we selected T1 trials based on whether a Kanizsa was seen or not, and looked at classification accuracy over time for these trials (classification accuracy was computed using the same classifier as was used in the core analysis of Fig. 2 in the main text SI Methods). Although classification accuracy is clearly modulated by visibility of T1, it is impossible to know what caused this modulation. The “unseen” stimuli may have escaped report because the subject had his eyes closed on some of these trials, was momentarily not attending, because the stimulus had less bottom-up strength than its “seen” counterparts, because subjects had a conservative response criterion, because the wrong button was accidentally pressed, or any combination of these. It is evidently questionable what one can conclude about the effect of access consciousness on perceptual integration based on such a seen–unseen analysis of T1 trials, because access consciousness was not manipulated here. Importantly, however, this is not only because we did not explicitly manipulate consciousness for T1 (although this makes the flaw in the approach more apparent), but rather because one cannot attribute cause and effect using an approach in which a dependent measure (the seen–unseen response) is used to generate experimental conditions. As a general reminder: an experimental condition should always be one that is under the control of the experimenter, not under control of the subject.

The same shortcoming applies in any post hoc seen–unseen analysis approach of neural data, even when the experiment does contain an explicit manipulation of consciousness such as masking or the AB, and even when controlling for objective performance. Indeed, as we can see in the four experimental conditions (row 2 and 3) of Fig. S7A, there is a clear effect of seen–unseen on short-lag (AB) trials. Unseen short-lag trials (fourth column, second row) have a lower classification accuracy than seen short-lag trials (second column, second row). However, as for T1, one cannot attribute this seen–unseen modulation to differences in conscious access, as differences in bottom-up stimulus strength, attention, as well as response errors have a big influence on whether a trial is classified as seen or unseen. The seen and unseen conditions will not be balanced with respect to these coincidental properties and can thus not be sensibly compared. Therefore, if any, the only somewhat legitimate comparison in terms of the effect of conscious access on perceptual integration would be between short and long lag within the seen category (so between the first and second column) on the one hand, or between short and long lag within the unseen category (so between the third and fourth column) on the other.

These within category comparisons clearly show that (in)visibility is not modulated by lag (i.e., classification accuracy is equally strong for short- and long-lag seen trials, as well as for short and long-lag unseen trials). If anything, perceptual integration is stronger for the short-lag trials than for the long-lag trials, both within the seen and within the unseen category, although this is hard to ascertain due to the fact that different stimulus counts go into these categories as a result of post hoc selection. Also consistent with the conclusion of the main text, we see that unseen short-lag trials show a clear signature of perceptual integration, further supporting the main conclusion of this study that perceptual integration can occur in the absence of conscious access. In short, despite disclaimers about a seen–unseen analysis approach, the seen–unseen data are consistent with the results in the main text: perceptual integration is not modulated by conscious access.

Interestingly—and again in line with the main text—the same does not hold for the control experiment in which weak masking was applied. If we do the same seen–unseen analysis for this control experiment, as shown in Fig. S7B, a different picture emerges. Here, we see a clear effect of masking on perceptual integration within the seen category, in contrast with the AB effect of Fig. S7A. Weakly masked seen trials result in evidently lower peak classification accuracy than unmasked seen trials, supporting the notion that masking impacts perceptual integration directly. A similar comparison could not be made in the unseen category, because not enough trials went undetected in the unmasked condition. However, it is noteworthy that weakly masked unseen trials could not be classified above chance, again in line with the conclusion from the main text that masking impacts visibility by disrupting perceptual integration directly, although, once again, it is important to realize that many other factors could have contributed to classification performance in this “weakly masked” unseen category (erroneous button presses, lapses of attention, etc.).

T1-Based Classification at 264 ms.

When training the classifier on T1 data using all electrodes, and testing this classifier on the T2 data, the 264-ms time point showed a strong main effect of masking (F1,10 = 91.63, P < 10 −5 ), a main effect of AB (F1,10 = 8.22, P = 0.017), and a trending interaction between masking and AB (F1,10 = 4.06, P = 0.071) (Fig. 5B, Top). To test directly whether the measurement source (neural or behavioral) at 264 ms results in a differential effect on classification accuracy, we again entered the normalized measurements into a large 2 × 2 × 2 ANOVA with factors measure (behavioral/neural), AB (yes/no), and masking (yes/no) (SI Methods). There was no interaction between measure and masking (F1,10 = 0.274, P = 0.61), but importantly there was an interaction between measure and AB (F1,10 = 6.75, P = 0.027), as well as a trending three-way interaction (F1,10 = 4.50, P = 0.060). The impact of measure confirms that, even when decision, selection, and response mechanisms are allowed contribute to classifier performance, the neural data at 264 ms cannot explain the pattern of results that is observed in behavior.