Introduction
A general organizational principle of
Substantial empirical evidence in neuroeconomics and decision neuroscience (Rangel et al., 2008; Doya, 2008; Montague et al., 2006; Bartra et al., 2013; Lopez-Persem et al., 2020) suggests that PFC computes a cost-benefit analysis in order to optimize the net value of rewards (Rangel et al., 2008; Doya, 2008; Montague et al., 2006; Rushworth et al., 2012). PFC subregions, such as ventromedial PFC (vmPFC) and dorsal anterior cingulate cortex (dACC), appear to encode reward signals across a wide range of value-based decision-making contexts, including foraging (Kolling et al., 2012; Shenhav et al., 2016), risk (Kolling et al., 2014), intertemporal (Wittmann et al., 2016; Boorman et al., 2013), and effort-based choice (Arulpragasam et al., 2018; Skvortsova et al., 2014). Interestingly, these regions are also activated when people seek information (Charpentier and Cogliati Dezza, 2021). VMPFC, for example, encodes the subjective value of information (Kobayashi and Hsu, 2019) as well as anticipatory information signals indicating that a reward will be received later on Iigaya et al., 2020. VMPFC also correlates with ongoing uncertainty during exploration tasks (Trudel et al., 2021) and with the upcoming delivery of information (Charpentier et al., 2018). DACC is activated when people observe outcomes of options without actively engaging with them (Blanchard and Gershman, 2018). Its activity is also associated with perceptual uncertainty (Jepma et al., 2012) and it predicts future information sampling to guide actions (Kaanders et al., 2020).
This overlapping activity between reward and information suggests that these two adaptive signals are related. Indeed, information signals can be partly characterized by reward-related attributes such as valence and instrumentality (Kobayashi et al., 2019; Sharot and Sunstein, 2020), while reward signals also contain informative attributes (e.g., winning $50 on a lottery allows the recipient to gain the reward amount but also information about the lottery itself Wilson et al., 2014; Smith et al., 2016). Because of this ‘shared variance’ it may not be surprising that the neural substrates underlying information processing frequently overlap with those involved in optimizing reward.
This raises an interesting question as to whether information and reward are really two distinct signals. In other words, is information gain merely a kind of reward that is processed in the same fashion as more typical rewards, or is the calculation of information value at least partially independent of reward value computations? While it is possible that information gain may be valuable in the same way as reward, it may also be the case that the apparent overlap in brain regions underlying reward and information processing may be due to the shared variance between these two distinct signals.
In order to assess whether independent information and reward signals could produce overlapping activity, we developed a reinforcement learning (RL) model which consists of information and reward value systems (Cogliati Dezza et al., 2017) independently calculating information gain and reward (see Materials and methods). Simulations of this model suggest how functional magnetic resonance imaging (fMRI) analyses might identify overlapping activity between reward and information systems if the ‘shared variance’ between them is not taken into account (Figure 1A). Even though both signals are computed independently by distinct systems, a typical model-based analysis identifies information-related activity in the reward system, and reward-related activity in the information system.
Figure 1.
Simulations of a model with independent value systems.
(A) When not controlling for shared variance between reward and information, an RL model which consists of independent reward (RelReward) and information value systems (Information Gain; see Materials and methods for more details) shows overlapping activity between reward and information signals. To simulate activity of the reward system, a linear regression predicting RelReward with RelReward as independent variable was adopted in the reward contrast; while a linear regression predicting RelReward with Information Gain was used in the information contrast. To simulate activity of the information system, a linear regression predicting Information Gain with RelReward as independent variable was adopted in the reward contrast; while a linear regression predicting Information Gain with Information Gain as independent variable was adopted in the information contrast. The model was simulated 63 times and model parameters were selected in the range of those estimated in our human sample. The figure shows averaged betas for these linear regressions. A one-sample t-test was conducted to test significance against zero. (B) When controlling for the shared variance, reward and information activities from the same RL model do not overlap anymore. To account for the shared variance, RelReward and Information Gain predictors were orthogonalized using serial orthogonalization. We simulated activity for both the reward system and information system in the same fashion as explained in (A). The analysis of those activities was however different. In the information contrast, we entered the orthogonalized (with respect to RelReward) Information Gain as an independent variable, while in the reward contrast, we entered the orthogonalized (with respect to Information Gain) RelReward. In all the panels, * is p<0.05, ** is p<0.01, *** is p<0.001. RL, reinforcement Learning.
The same simulations suggest how the spurious functional overlap between information and reward systems might be avoided (Figure 1B). Rather than regressing model activity against reward and information signals in isolation, orthogonalizing one regressor with respect to the other eliminates the apparent overlap in function between the systems. When information is orthogonalized with respect to reward, the reward system no longer exhibits information effects, and orthogonalizing reward with respect to information eliminates reward effects in the information system.
These results suggest that by controlling for the shared variance between information and reward signals in this fashion, it is possible to establish whether a dedicated and independent value system for information actually exists in human PFC. To do so, we developed a novel behavioral task that allows us to jointly investigate both reward and information-seeking behaviors. Next, we adopted computational modeling techniques to dissociate the relative contribution of reward and information in driving choice behavior and model-based fMRI to localize their activity in the brain.
Results
Reward and information jointly influence choices
Human participants made sequential choices among three decks of cards over 128 games, receiving between 1 and 100 points after each choice (Figure 2; Materials and methods). The task consisted of two phases (Figure 2A): a learning phase (i.e., forced-choice task) in which participants were instructed which deck to select on each trial (Figure 2B), and a decision phase (i.e., free-choice task) in which participants made their own choices with the goal of maximizing the total number of points obtained at the end of the experiment (Figure 2C). Logistic regression of subjects’ behavior on choices made on the first trial of the free-choice task shows that participants’ choices were driven by both rewards (mean
Figure 2.
Behavioral task and behavior.
(A) One game of the behavioral task consisted of six consecutive forced-choice trials and from 1 to 6 free-choice trials. fMRI analyses focused on the first free-choice trial (shown in yellow). (B) In the
The gambling task elicits activity in dACC and vmPFC
We first investigated whether our gambling task elicits dACC and vmPFC activity, both being regions involved in reward and information processing (Charpentier and Cogliati Dezza, 2021).
We conducted a one sample t-test on the beta weights estimated for
We note that the cluster of activity we identify as ‘dACC’ spans into supplementary motor areas. Many fMRI studies on value-based decision-making reporting similar activity patterns, however, commonly refer to activity around this area as dACC (Shenhav et al., 2014; Vassena et al., 2020). Additionally, in the Lower Reward – Highest Reward contrast activity did not survive correction for multiple comparisons. This might be due to individual differences in subjective reward value. We address this issue in the next section by adopting model-based approaches. We conducted, however, a small volume analysis using functionally defined regions taken from FIND lab (Stanford University; https://findlab.stanford.edu/functional_ROIs.html) corresponding to our prior hypotheses. Results show a significant cluster at voxel coordinates (–2, 12, 58) after correcting for multiple comparisons (FWE
Overall, these results indicate that our gambling task elicits activity in dACC and vmPFC, and this activity follows a symmetrically opposite pattern (Figure 2E).
Apparent shared activity between reward and information
In the previous section, we showed that our gambling task elicits reward-related activity in both dACC and vmPFC. Here, we test whether this activity relates to both reward and information signals in both regions.
We fitted an RL model with information integration (Cogliati Dezza et al., 2017) to participants’ behavior to obtain subjective evaluations of reward and information (Materials and methods; Supplementary file 1). Our model was better able to explain participants’ behavior compared to an RL model without information integration (i.e., where only reward predictions influence choices; fixed-effect: BICgkRL=391.2, BIC standardRL=428.8, for individual BICs, see Supplementary file 2; random-effect: xpgkRL=1, xpstandardRL=0) and to predict behavioral effects observed in our sample. For the latter, we simulated our model using the estimated free parameters and we performed a logistic regression for each simulation predicting the model’s choices with reward and information (i.e., the number of times the option was sampled in previous trials) as fixed effects. As observed in our human sample, reward and information were significantly impacting model choices (both
Moreover, the degree of accuracy of the fitting procedure was inspected by running a parameter recovery analysis. We simulated data from our model using the parameters obtained from the fitting procedure, and we fit the model to the simulated data to obtain the estimated parameters. We then ran a correlation for each pair of parameters (Wilson and Collins, 2019). This revealed high correlation coefficients for
Next, we computed subjective evaluations of reward as relative reward values (
Assessing alternative definitions of reward
Before assessing whether activity in vmPFC and dACC relates to both reward and information, we first investigated whether our chosen ‘reward computation’ (i.e., RelReward) was better able to explain activity in vmPFC than alternative reward computations. First, we compared activity in vmPFC between RelReward and expected reward values (ExpReward) of the chosen option. Results showed that RelReward correlated with vmPFC after controlling for (i.e., orthogonalizing with respect to) ExpReward (
Overlapping reward and information activity
We tregressed the BOLD signal recorded on the first free-choice trial of each game on RelReward and Information Gain. RelReward and Information Gain were used as the only parametric modulators in two separate GLMs to identify BOLD activity related to reward (
Activity in vmPFC on the first free-choice trial correlated positively with RelReward (FWE
Figure 3.
Apparent overlapping activity between reward and information.
(A) VMPFC positively correlated with model-based relative reward value for the selected option (in red), while dACC negatively correlated with it (in blue). (B) DACC (in red) positively correlated with model-based information gain, while vmPFC negatively correlated with it (in blue). Activity scale represents z-score. (C) Averaged BOLD beta estimates for vmPFC in GLM1 (Reward Dim.=Reward Dimension) and GLM2 (Info Dim.=Information Dimension). (D) Averaged BOLD beta estimates for dACC in GLM1 (Reward Dim.=Reward Dimension) and GLM2 (Info Dim.=Information Dimension). In all the panels, * is
Similar results were obtained when including ExpReward, instead of RelReward, as single parametric modulator (GLM1bis: ExpReward positively correlated with vmPFC – FWE
Independent value systems for information and reward after accounting for their shared variance
In the previous section, we showed that both dACC and vmPFC activity relate to both information and reward. However, GLM1 and GLM2 consider only the variance explained by reward and information, respectively. As explained above, simulations of our RL model which consists of independent value systems suggest that fMRI analyses might reveal overlapping activity if the shared variance between the two systems is not taken into account (Figure 1A).
To eliminate the shared variance between reward and information as a possible explanation for activity in dACC and vmPFC, we repeated our analyses while controlling for possible shared signals that may underlie our results for GLMs 1 and 2. To do so, we created two additional GLMs to investigate the effect of RelReward after controlling for Information Gain (
Activity in vmPFC remained positively correlated with RelReward (
Figure 4.
Independent value systems for reward and information in PFC.
(A) After controlling for information (GLM3), vmPFC activity (in red) positively correlated with model-based relative reward value (RelReward), while no correlations were observed for dACC. (B) After controlling for reward (GLM4), dACC activity (in red) positively correlated with model-based information gain (Information Gain), while no correlation was observed for vmPFC. Averaged BOLD beta estimates for vmPFC in GLM1 (Reward Dim.=Reward Dimension) and GLM2 (Info Dim.=Information Dimension). (D) Averaged BOLD beta estimates for dACC in GLM1 (Reward Dim.=Reward Dimension) and GLM2 (Info Dim.=Information Dimension). In all the panels, * is
Assessing alternative definitions of reward and information
In response to reviewers’ suggestions, we repeated the above analysis using alternative definitions of reward and information.
For GLM4, similar results as those reported above were observed when ExpReward was entered in GLM4 instead of RelReward (GLM4bis:
Using this approach, we removed variables with the highest VIFS iteratively. In order, these were: the value of the chosen reward minus the second highest reward (VIF≈2×105, 99.9995% of variability explained by other regressors), and the maximum reward (VIF≈320, 99.69% of variability explained by other regressors, after removing Chosen-Second). After removing these two regressors, VIFs for the remaining regressors were all under 5 (average value VIF=3.52, value of the second-best option VIF=1.99, relative value VIF=2.21, and minimum value VIF=3.42). We note that the AverageReward regressor had the highest VIF initially (VIF≈3.6×105). However, we elected to retain this definition since behavior and brain activity have previously been linked to the overall level of reward across options (Kolling et al., 2012; Cogliati Dezza et al., 2017). Next, we ran an ROI analysis based on dACC and vmPFC coordinates observed in GLM1. Results showed significant activity in dACC which positively correlates with Information Gain (
For GLM3, similar results as those reported in the previous section were observed when accounting for covariates and an alternative definition of information. In particular, we entered the relative value of information (as it is possible that vmPFC computes the ‘relativeness’ of the chosen options rather than its reward value) and first choice reaction time as covariates (
Interactions of observed activity with analysis type
To directly test our hypothesis that shared activity between reward and information in dACC and vmPFC is the product of confounded reward and information signals, we conducted a three-way ANOVA with ROI (dACC, vmpFC), Value Type (Information Gain, RelReward), Analysis type (confounded{GLM1&2}, non-confounded {GLM3&4}) and we tested the three-way interaction term. If independent reward and information value signals are encoded in the brain, the two-way interaction (ROI×Value Type) should be significantly modulated by the type of analysis adopted. Results showed a significant three-way interaction
Finally, we check whether accounting for confounded reward and information signals had significant effects in both regions separately. To do so, we ran a two-way ANOVA with Value Type (Information Gain, RelReward), Analysis type (confounded{GLM1&2}, non-confounded {GLM3&4}) for the dACC ROI and a two-way ANOVA with Value Type (Information Gain, RelReward), Analysis type (confounded{GLM1&2}, non-confounded {GLM3&4}) for the vmPFC ROI. Results showed a significant two-way interaction for the dACC ROI (
Altogether these findings suggest a coexistence of two
dACC encodes information after accounting for choice difficulty or switching behavior
Activity in dACC has been often associated with task difficulty/conflict (Shenhav et al., 2014; Botvinick et al., 2001) or switching to alternative options (Domenech et al., 2020). To investigate whether this was the case in our task, we first correlated the standardized estimates of Information Gain with choice reaction times on the first free-choice trials. The correlation was run for each subject and correlation coefficients were tested against 0 using a Wilcoxon signed test. Overall, correlation coefficients were not significantly different from 0 (mean
Finally, in our task, the frequency of choosing the most informative option was higher than the frequency of choosing the two other alternatives (in the unequal condition—when participants were forced to sample options a different number of times, see Materials and methods; mean=64.6%, SD=18%). It is possible, therefore, that in our task, the default behavior was selecting the most informative options, and the switch behavior (or moving away from a default option) was selecting less informative (but potentially more rewarding) options. If this is correct, regions associated with exploration or switching behaviors (e.g., frontopolar cortex; Daw et al., 2006; Zajkowski et al., 2017) should be activated when participants select a non-default option (i.e., not choosing the most informative option).
We conducted a one sample t-test on the beta weights estimated for
Figure 5.
NoDefault vs. default behavior, instrumental information and combination of reward and information signals in subcortical regions.
(A) Activity in the frontopolar region—a region often associated with exploration—correlated with NoDefault behavior (not choosing the most informative options)—Default behavior (choosing most informative options). (B) Activity in dACC correlated with Information Gain after controlling for the variance explained by the instrumental value of information. (C) Activity in vmPFC and dACC correlated with the instrumental value of information after accounting for the variance explained by Information Gain. (D) Activity in the ventral putamen (striatum region) correlated with response probabilities derived from the RL model. (E) RelReward, Information Gain, and response probabilities overlap in the striatum region (in white). Activity scale represents z-score. dACC, dorsal anterior cingulate cortex; RL, reinforcement Learning; vmPFC, ventromedial prefrontal cortex.
We would like to acknowledge, however, that while dACC activity associated to Information Gain in our task is not affected by proxies of exploratory decisions (e.g., switch-stay analysis and Default vs. NoDefault analysis), our task cannot dissociate decisions to explore to gain information (i.e., directed exploration) and Information Gain. This is because Information Gain and directed exploration in our task describe the same thing—picking the option about which least is known.
Activity in dACC signals both the non-instrumental and instrumental value of information
In the previous sections, we showed that the value of information was independently encoded in dACC after accounting for reward, choice difficulty, and switching strategy. However, in our task, different motives may drive participants to seek information (Sharot and Sunstein, 2020). Information can be sought for its usefulness (i.e., instrumental information): the acquired information can help with the goal of maximizing points by the end of each game. Alternatively, information can be sought for its non-instrumental benefits including novelty, curiosity, or uncertainty reduction. Here, we tested whether the value of information independently encoded in dACC relates to the instrumental value of information, to its non-instrumental value, or to both.
We computed the instrumental value of information (Instrumental Information) by implementing a Bayesian learner and estimating the Bayes optimal long-term value for the option chosen by participants on the first free-choice trial (Materials and methods). We first entered Instrumental Information and Information Gain in a mixed logistic regression predicting first free choices (in the Unequal Information Condition; Materials and methods) with Instrumental Information and Information Gain as fixed effects, subjects as random intercepts and 0 + Instrumental Information + Information Gain | subjects as random slopes. Choices equal 1 when choosing the most informative options (i.e., the option never selected during the forced-choice task), and 0 when choosing options selected four times during the forced-choice task. We found a positive effect of Information Gain (beta coefficient=71.7±16.76 (SE),
We then entered Instrumental Information and Information Gain as parametric modulators into two independent GLMs. We investigated the effects of Information Gain after controlling for (orthogonalized with respect to) Instrumental Information (
Reward and information signals combine in the striatum
While distinct brain regions independently encode reward and information, these values appear to converge at the level of the basal ganglia. In a final analysis (
Discussion
Information and reward are key behavioral drives. Here, we show that dedicated and independent value systems encode these variables in the human PFC. When the shared variance between reward and information was taken into account, dACC and vmPFC distinctly encoded information value and relative reward value of the chosen option, respectively. These value signals were then combined in subcortical regions that could implement choices. These findings are direct empirical evidence for a dedicated information value system in human PFC independent of reward value.
Activity in the brain suggests that the opportunity to gain information relies on similar neural circuitry as the opportunity to gain rewards (Bromberg-Martin and Hikosaka, 2009; Kang et al., 2009; Kobayashi and Hsu, 2019; Charpentier et al., 2018; Smith et al., 2016; Bromberg-Martin and Hikosaka, 2011; Tricomi and Fiez, 2012; Gruber et al., 2014; Jessup and O’Doherty, 2014; Blanchard et al., 2015) even when information has no instrumental benefits (Tricomi and Fiez, 2012; Gruber et al., 2014; Charpentier et al., 2018). Here, we show that the overlapping activity in PFC between these two adaptive signals elicited by our task design is only observed if their shared variance is not taken into account. In particular, in two independent GLMs—one with relative reward value and the other one with information value as a single parametric modulator—activity associated with reward and information activated both vmPFC and dACC. This overlapping activity might be explained by the fact that information signals are partly characterized by reward-related attributes such as valence and instrumentality Kobayashi et al., 2019; Sharot and Sunstein, 2020, while reward signals also contain informative attributes (e.g., winning $50 on a lottery allows the recipient to gain the reward amount but also information about the lottery itself; Wilson et al., 2014; Smith et al., 2016).
When eliminating the variance shared between reward and information as a possible explanation of activity, we showed that dACC activity correlated with information value but
These findings support theoretical accounts such as active inference (Friston, 2010; Friston, 2003; Friston, 2005) and certain RL models (e.g., upper confidence bound; Auer et al., 2002; Wilson et al., 2014; Cogliati Dezza et al., 2017) which predict independent computations for information value (epistemic value) and reward value (extrinsic value) in the human brain. Consistent with our findings, the activity of single neurons in the monkey orbitofrontal cortex independently and orthogonally reflects the output of the two value systems (Blanchard et al., 2015). Therefore, our results may highlight a general coding scheme that the brain adopts during decision-making evaluation.
Moreover, our results are in line with recent findings in monkey literature that identified populations of neurons in ACC which selectively encode the non-instrumental value of information (White et al., 2019) and are involved in tracking how each piece of information would reduce uncertainty about future actions (Hunt et al., 2018). Additionally, they are also consistent with computational models of PFC which predict that dACC activity can be primarily explained as indexing prospective information about an option independent of reward value (Alexander and Brown, 2011; Alexander and Brown, 2018; Behrens et al., 2007). DACC has often been associated with conflict (Botvinick et al., 2001) and uncertainty (Silvetti et al., 2013), and recent findings suggest that activity in this region corresponds to unsigned prediction errors, or ‘surprise’ (Vassena et al., 2020). Our results enhance this perspective by showing that the activity observed in dACC during decision-making can be explained as representing the subjective representation of decision variables (i.e., the information value signal) elicited in uncertain or novel environments.
It is worth highlighting that other regions might be involved in processing information-related components of the value signal not elicited by our task. In particular, rostrolateral PFC signals the change in relative uncertainty associated with the exploration of novel and uncertain environments (Badre et al., 2012; Tomov et al., 2020). Neural recordings in monkeys also showed an interconnected cortico-basal ganglia network that resolves uncertainty during information-seeking (White et al., 2019). Taken together, these findings, among others (Charpentier and Cogliati Dezza, 2021), highlight an intricate and dedicated network for information, independent of reward. Further research is therefore necessary to map this independent network in the human brain and understand to what extent this network relies on neural computations so far associated with reward processing (e.g., dopaminergic modulations; Bromberg-Martin and Hikosaka, 2009; Bromberg-Martin and Hikosaka, 2011; Vellani et al., 2021).
Our finding that vmpFC positively correlates with the relative reward value of the chosen option is in line with previous research that identifies vmPFC as a region involved in value computation and reward processing (Smith and Delgado, 2015). VmPFC appears not only to code reward-related signals (Chib et al., 2009; Kim et al., 2011; Hampton et al., 2006) but to specifically encode the relative reward value of the chosen option (Boorman et al., 2009), in line with the results of our study.
Our results further suggest that these independent value systems interact in the striatum, consistent with its hypothesized role in representing expected policies (Friston et al., 2015) and information-related cues (Bromberg-Martin and Hikosaka, 2009; Charpentier et al., 2018; Kobayashi and Hsu, 2019; Bromberg-Martin and Monosov, 2020). The convergence of reward and information signals in the striatum region is also consistent with the identification of basal ganglia as a core mechanism that supports stimulus-response associations in guiding actions (Samejima et al., 2005) as well as recent findings demonstrating distinct corticostriatal connectivity for affective and informative properties of a reward signal (Smith et al., 2016). Moreover, activity in this region was computed from the softmax probability derived from our RL model, consistent with previous modeling work that identified the basal ganglia as the output of the probability distribution expressed by the softmax function (Humphries et al., 2012).
Taken together, by showing the existence of independent value systems in the human PFC, this study provides the empirical evidence in support of a theoretical work aimed at developing a unifying framework for interpreting brain functions. Additionally, we individuated a dedicated value system for information, independent of reward. Overall, our results suggest a new perspective on how to look at decision-making processes in the human brain under realistic scenarios, with potential implications for the interpretation of PFC activity in both healthy and clinical populations.
Materials and methods
Participants
Twenty-one right-handed, neurologically healthy young adults were recruited for this study (12 women; aged 19–29 years, mean age=23.24). Of these, one participant was excluded from the analysis due to problems in the registration of the structural T1 weighted MPRAGE sequence. The sample size was based on previous studies (e.g., Kolling et al., 2012; Boorman et al., 2013; Shenhav et al., 2014). Participants also presented normal color vision and absence of psychoactive treatment. The entire group belonged to the Belgian Flemish-speaking community. The experiment was approved by the Ethical Committee of the Ghent University Hospital and conducted according to the Declaration of Helsinki. Informed consent was obtained from all participants prior to the experiment.
Procedure
Participants performed a gambling task where on each trial choices were made among three decks of cards (Cogliati Dezza et al., 2017; Figure 1). The gambling task consisted of 128 games. Each game contains two phases: a
On each trial, the payoff was generated from a Gaussian distribution with a generative mean between 10 and 70 points and standard deviation of 8 points. The generative mean for each deck was set to a base value of either 30 or 50 points and adjusted independently by ±0, 4, 12, or 20 points with equal probability, to avoid the possibility that participants might be able to discern the generative mean for a deck after a single observation. The generative mean for each option was stable within a game but varied across games. In 50% of the games, the three options had the same generative mean (e.g., 50, 50, and 50), while they had different means in the other half of the games. In 25% of these latter games, the means differed so that two options had the same generative mean with high values and the third option had a different generative mean with low values (e.g., 70, 70, and 30). In 75% of these latter games, two options had the same generative mean with low values and the third option had a different generative mean with high values (e.g., 30, 30, and 30).
Participants’ payoff on each trial ranged between 1 and 100 points and the total number of points was summed and converted into a monetary payoff at the end of the experimental session (0.01 euros every 60 points). Participants were told that during the forced-choice task, they may sample options at different rates, and that the decks of cards did not change during each game, but were replaced by new decks at the beginning of each new game. However, they were not informed of the details of the reward manipulation or of the underlying generative distribution adopted during the experiment. Participants underwent a training session outside the scanner in order to make the task structure familiar to them.
The forced-choice task lasted about 8 s and was followed by a blank screen, for a variable jittered time window (1–7 s). The temporal jitter allows to obtain neuroimaging data at the onset of the first free-choice trial and right before the option was selected (decision window). After participants performed the first free-choice trial, a blank screen was again presented for a variable jittered time window (1–6 s) before the feedback, indicating the number of points earned, was given for 0.5 s and another blank screen was shown to them for a variable jittered time window. As the first free-choice trial was the main trial of interest for the fMRI analysis, subsequent free-choice trials were not jittered.
Image acquisition
Data were acquired using a 3T Magnetom Trio MRI scanner (Siemens), with a 32-channel radio-frequency head coil. In an initial scanning sequence, a structural
Behavioral analysis
Expected reward value and information value
To estimate participants’ expected reward value and information value, we adopted a previously implemented version of a RL model that learns reward values and information gained about each deck during the previous experience—the gamma-knowledge Reinforcement Learning model (gkRL; Cogliati Dezza et al., 2017; Cogliati Dezza et al., 2019). This model was already validated for this task and it was better able to explain participants’ behavior compared to other RL models (Friston, 2010).
Expected reward values were learned by gkRL adopting on each trial a simple δ learning rule (Rescorla and Wagner, 1972):
(1)
where is the expected reward value for deck
Information was computed as follows:
(2)
, is the amount of information associated with the deck
Before selecting the appropriate option, gkRL subtracts the information gained from the expected reward value :
(3)
is the final value associated with deck
In order to generate choice probabilities based on expected reward and information values (i.e., final choice value), the model uses a softmax choice function (Daw and Doya, 2006). The softmax rule is expressed as:
(4)
where is the inverse temperature that determines the degree to which choices are directed toward the highest rewarded option. By minimizing the negative log likelihood of model parameters
Instrumental value of information
In order to approximate the instrumental utility of options in our task, we turn to Bayesian modeling. In the simplest case, a decision-maker’s choice when confronted with multiple options depends on its beliefs about the relative values of those options. This requires the decision-maker to estimate, based on prior experience, relevant parameters such as the mean value and variance of each option. On one hand, the mean and variance of an option can be estimated through direct experience with that option through repeated sampling. However, subjects may also estimate long-term reward contingencies as well: even if an option has a specific mean reward during one game in our task, subjects may learn an estimate of the range of rewards that options can have even before sampling from any options. Similarly, although subjects may learn an estimate of the variance for a specific option during the forced-choice period, over many games subjects may learn that options
To model this, we developed a Bayesian learner that estimates, during each game, the probability distribution over reward and variance for each specific option in that game, and, over the entire experiment, estimates the global distribution over mean reward and variance based on observed rewards from all options. A learner’s belief about an option can be modeled as a joint probability distribution over likely values for the mean reward
To model training received by each subject prior to participation in the experiment, the Bayesian learner was simulated on forced choices from 10 random games generated from the same routine used to generate trials during the experiment. After each choice was displayed, the global probability distribution over (5)
where N() is the probability of observing a reward for a normal distribution with a given mean and variance.
Following the initial training period, the model performed the experiment using games experienced by the subjects themselves, that is during the forced-choice period, the model made the same choices and observed the same point values seen during the experiment. To model option-specific estimates, the model maintained three probability distributions over
The Bayesian learner described above learns to estimate the probability distribution over the mean and variance for each option during the forced-choice component of the experiment. If the learner’s only concern in the free-choice phase is to maximize reward for the next choice, it would select the option with the highest expected value. However, in our task, subjects are instructed to maximize their total return for a variable number of trials with the same set of options. In some circumstances, it is better to select from under-sampled decks that may ultimately have a higher value than the current best estimate.
To model this, we implemented a forward tree search algorithm (Ross et al., 2022; Ghavamzadeh et al., 2015) which considers all choices and possible outcomes (states) reachable from the current state, updates the posterior probability distribution for each subsequent state as described above, and repeats this from the new state until a fixed number of steps have been searched. By conducting an exhaustive tree search to a given search depth, it is possible to determine the Bayes optimal choice at the first free-choice trial in our experiment.
In practice, however, it is usually unfeasible to perform an exhaustive search for any but the simplest applications (limited branching factor, limited horizon). In our experiment, the outcome of a choice was an integer from 1 to 100 (# of points), and the model could select from three different options, yielding a branching factor of 300. The maximum number of free-choice trials available on a given game was 6, meaning that a full search would consider 3^8 possible states at the terminal leaves of the tree. In order to reduce the time needed to perform a forward tree search of depth 6, we applied a coarse discretization to the possible values of
The value of a state was modeled as the number of points received for reaching that state, plus the maximum expected value of subsequent states that could be reached. Thus, the value of leaf states was simply the expected value of the probability distribution over means (numerically integrated over ), while the value of the preceding state was that state’s value plus the maximum expected value of possible leaf states:
(6)
Recursively applying Equation 6 from the leaf states to the first free-choice trial allows us to approximate the Bayes optimal long-term value for each option (i.e., Bayes Instrumental Value). The Bayes Instrumental Value corresponds to the overall expected reward value of choosing, which includes both reward and information benefit. As the instrumental value of information is the difference between the overall expected reward value of choosing which includes both reward and information benefit (i.e., Bayes Instrumental Value) and the reward value obtained from an option without receiving information (i.e., Reward Value without information), the latter was also computed. To do so, the Bayesian procedure was implemented by constraining the model to not update its belief distribution based on the information provided on the first free-choice trial. Next, expected instrumental value (Instrumental Information) for each option on the first free-choice trial following the forced-choice trials specific to that game was computed as:
(7)
On an additional note, as subjects were not aware of the reward distributions adopted in the task—therefore they might develop different beliefs—the above procedure may not reflect each individual’s subjective estimate, rather it reflects an objective estimate of the instrumental value of information.
fMRI analysis
The first four volumes of each functional run were discarded to allow for steady-state magnetization. The data were preprocessed with SPM12 (Wellcome Department of Imaging Neuroscience, Institute of Neurology, London, UK). Functional images were motion corrected (by realigning to the first image of the run). The structural
All the fMRI analyses focus on the time window associated with the onset of the first free trials prior to the choice was actually made (see Procedure). The rationale for our model-based analysis of fMRI data is as follows (also summarized in Supplementary file 4). First, in order to link participants’ behavior with neural activity,
Additional GLMs were then used for the control analyses:
To determine the regions associated with Reward and Information Gain, beta weights for the first (single modulator GLMS) or second (two modulator GLMS) parametric modulators were entered into a second level (random effects) paired-sample t-test. In order to determine activity related to the combination of information and reward value,
Activity for these GLMs is reported either in the Result section or in Supplementary file 5.
In order to denoise the fMRI signal, 24 nuisance motion regressors were added to the GLMs where the standard realignment parameters were nonlinearly expanded incorporating their temporal derivatives and the corresponding squared regressors (Friston et al., 1996). Furthermore, in GLMS with two parametric modulators, regressors were standardized to avoid the possibility that parameter estimates were affected by different scaling of the models’ regressors alongside with the variance they might explain (Erdeniz et al., 2013). During the second level analyses, we corrected for multiple comparisons in order to avoid the false positive risk (Chumbley and Friston, 2009). We corrected at the cluster level using both FDR and FWE. Both corrections gave similar statistical results therefore we reported only FWE corrections.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2022, Cogliati Dezza et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Theories of prefrontal cortex (PFC) as optimizing
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer