Founder of Brainpool.ai Paula Parpart’s research explores why sometimes simpler algorithms can outperform more complex algorithms. Is less more?
Since the 1970s, the rare point of agreement between Nobel Laureate Daniel Kahneman and prominent Max Planck director Gerd Gigerenzer has been that decision heuristics are an alternative to Bayesian rationality. What are heuristics? In cognitive science and psychology, heuristics are decision making algorithms that follow a set of simple rules and deliberately ignore information in the input data.
For example, when making real-world decisions such as choosing which coffee to buy or choosing which apartment to rent, there are potentially thousands of features that could play into the decision, but we usually do not have the time or memory capacity to use them all. In deciding between two flats, instead of considering all available information sources such as proximity to work, proximity to schools, crime rates, neighbourhood sport facilities or market trends, a simple heuristic called “Take-The-Best” (Gigerenzer & Goldstein, 1996) would just rely on the first most important cue that is able to discriminates among the flats, and ignore all other cues. If the most important cue was the proximity to work, the Take-The-Best heuristic would decide for the flat that is closer to work. In statistical terms, this means the model puts the largest weight onto the most predictive feature and sets all others to zero. Importantly, the heuristic does so only once a feature is able to discriminate among the flats – hence, the heuristic incorporates a sequential search rule.
The overwhelming view in behavioural sciences had been that heuristics are only biased approximate solutions to optimal Bayesian inference, which follows probability theory. Fascinatingly, in contrast, a large body of research in psychology showed that heuristics, which deliberately drop information, can often generalize better to novel situations than full-information algorithms that make full use of the available information, such as logistic regression (Czerlinski et al., 1999). This has been explained with bias-variance from statistics and the fact that a simpler model, which is more indifferent to the training data, may be less prone to overfitting. These less-is-more effects, in which a relatively simpler model outperforms a more complex model, are prevalent throughout cognitive science, and are frequently argued to demonstrate an inherent advantage of the minds’ simplifying computation. In contrast, Parpart et al., 2017 showed that at the computational level, less is never more. The authors proved that two popular heuristics, tallying and Take-the-Best, are formally equivalent to Bayesian inference under the limit of infinitely strong priors.
Varying the strength of the prior yields a continuum of Bayesian models with the heuristics at one end and ordinary regression at the other (Fig. 2a). The central finding was that, across a variety of simulations and famous psychological datasets used to support heuristics (Fig. 2b) (Czerlinski1999), intermediate models perform better, suggesting that down-weighting information with the appropriate prior is preferable to entirely ignoring it. Heuristics will usually be outperformed by an intermediate model that takes into account the full information but weighs it appropriately. Thereby the work reconciles two very prominent approaches to cognitive science, i.e., the heuristic approach to decision making and the Bayesian (probabilistic inference) approach to cognition.
Novel regularized regressions
The proof of convergence is embodied by two new regularization algorithms that have the potential to replace ridge regression, a common regularization method in machine learning, for certain types of regression problems. The first regularization algorithm is called half-ridge model – our Bayesian derivation of the tallying heuristic extends ridge regression by assuming the directionalities of the cues (i.e., the signs of the true weights) are known in advance. Under the limit of an infinitely strong prior, the Bayesian half-ridge model converges to a simple summation of predictors, i.e., a tallying heuristic. Crucially, under this limit, the model becomes completely invariant to the training data. In particular, it ignores how strongly each feature is associated with the outcome in the training set (i.e., weight magnitudes). Figure 2a shows the formal relationship among full Bayesian regression, ridge regression, ordinary least-squares linear regression, the Bayesian half-ridge model, and the directed tallying heuristic. The properties of the half-ridge model are further explained in Parpart et al., 2017.
In particular, the second regularization algorithm called Covariance Orthogonalizing Regularization (COR) essentially makes features appear more orthogonal to each other (Parpart et al., 2017). This model uses a prior that suppresses information about feature covariance but leaves information about feature weights unaffected. In comparison to ridge regression, the strength of the prior yields a continuum of models defined by sensitivity to covariation among features, which smoothly vary in their mean posterior weight estimates from those of ordinary linear regression to weights that assume complete independence among features. The basic architecture of the COR model is displayed in Fig. 3.
Decision support applications of the regularization approach
The COR regularization method may be useful to researchers in genetics, neuroscience, machine learning, or finance and anywhere where overfitting and high redundancy among features are an issue. For example, neuroscientists are looking into applying it to model networks of spiking neurons in the visual cortex, where high covariance is typically an issue (Solomon et al., 2014).
This work is not only of theoretical importance, but can inform application and should interest those who work in data-intensive environments where time can be at a premium (e.g., traders, financial forecasters, doctors, Big Data architects). Knowing when time-saving heuristics will match lengthy optimization methods is important when fast decisions are needed, such as when soldiers at a checkpoint must rapidly decide whether an approaching car is a threat. Likewise, doctors need to quickly decide whether to assign a patient to a coronary care unit or a regular nursing bed. For example, in medicine, heuristics can sometimes match or outperform more costly and slower procedures such as MRI tests – e.g., a simple tallying heuristic was able to detect stroke patients in ER (Gigerenzer & Gaissmaier, 2011). Crucially, the regularization approach developed here can expand beyond previous decision support tools by allowing for a meta-diagnosis of when it is worth to prefer complexity over a simple model in medical contexts.
Portfolio optimization problems
Furthermore, the COR regularization can be applied in finance to predict stock performance in portfolio optimization problems. The advantage of the method lies not solely in the potential of beating existing solutions such as ridge regression for particular circumstances, but in the ability to map out when a simpler solution (a model with larger bias and fewer parameters and computations) can outperform a more complex solution (more flexible with more parameters and computations). For example, a simple unit weight model (tallying heuristic) that allocates financial resources equally across asset classes has been shown to be able to outperform complex optimization routines including the Nobel Prize winning Markowitz’s mean-variance portfolio model, which in contrast needed 10 years of stock data to estimate model parameters (DeMiguel, Garlappi, & Uppal, 2009). This is an example of where a heuristic (1/N rule) could reach an equivalent performance level to more complex models but with much less data and in a shorter time.
While “less-is-more” effects like these are known, it is difficult to foresee when this is the case. However, applying the Bayesian COR model to these environments makes it possible to find out in advance how much data (i.e., variables) need to be taken into account for accurate decisions to be made in these situations.
In this way, the machine learning solution can be beneficial in all those situations where an abundance of data overload takes place and the objective is to save time and data. A future project may be to turn the algorithm into a prediction software that can function as a decision aid for decision makers in finance and a range of industries. An immediate next step is to run the algorithm on financial datasets to see how the model behaves in portfolio optimization problems. If we find interesting new insights, we will publish these on the blog in the future. The algorithm is accessible on Github.
The Author: Paula Parpart is founder of Brainpool and currently visiting the Center for Data Sciene at New York University. Previously she was Teaching Fellow teaching the MSc Cognitive and Decision Sciencese at UCL, where she also also obtained a PhD in Cognitive Sciene. Her research covers cognitive sciene and data science and combines human behavioural experimentaion with machine learning and computational modelling. She is interested in reverse-engineering the human brain’s learning capabilities in artificial agents, and is currently researching people’s abitily to learn from only very few examples (learning-to-learn and transfer learning). More on her research can be found on her personal website.
DeMiguel, V., Garlappi, L., Nogales, F. J., & Uppal, R. (2009). A generalized approach to portfolio optimization: Improving performance by constraining portfolio norms. Management Science, 55(5), 798-812.
Gigerenzer, G., & Gaissmaier, W. (2011). Heuristic decision making. Annual review of psychology, 62, 451-482.
Marewski, J. N., & Gigerenzer, G. (2012). Heuristic decision making in medicine. Dialogues in clinical neuroscience, 14(1), 77.
Parpart, P., Jones, M., & Love, B. (2017). Heuristics as Bayesian inference under extreme priors. Cognitive Psychology, doi: 10.1016/j.cogpsych.2017.11.006
Solomon, S. S., Chen, S. C., Morley, J. W., & Solomon, S. G. (2014). Local and global correlations between neurons in the middle temporal area of primate visual cortex. Cerebral Cortex, 25(9), 3182-3196.