Two varieities of reinforcement learning: Striatal & Prefrontal/Parietal?

By developinginte… on July 22, 2010.

Recent work has leveraged increasingly sophisticated computational models of neural processing as a way of predicting the BOLD response on a trial-by-trial basis. The core idea behind much of this work is that reinforcement learning is a good model for the way the brain learns about its environment; the specific idea is that expectations are compared with outcomes so that a "prediction error" can be calculated and minimized through reshaping expectations and behavior. This simple idea leads to exceedingly powerful insights into the way the brain works, with numerous applications to improving learning in artificial agents, to understanding the role of exploration in behavior and development, and to understanding how the brain exerts adaptive control over behavior.

So far, however, neuroimaging and electrophysiology suggest that these prediction error signals can be found through much of the cortex, including large swaths of parietal, frontal, and striatal areas.

This is where a 2010 Neuron paper by GlÃ¤scher, Daw, Dayan & O'Doherty comes to the rescue. Traditionally, reinforcement learning has been viewed as a somewhat monolithic entity, such that expected rewards are compared with reward outcomes to generate a "reward prediction error." It's easy to imagine that most of the brain might light up in response to rewards. But GlÃ¤scher take this a step farther, and dissociate between two flavors of reinforcement learning (RL):

Model-based RL learns about the association of states with one another by producing an internal model of state transitions without respect to reward

Model-free RL: learns about the direct associations of states with rewards

This distinction is important: model-free RL can learn about the average reward expected from a 2nd order stimulus (state...reward), but may not conditionalize that on the actions that are available to that stimulus, which may allow the agent to enter a new state where reward expectation is higher. In contrast, model-based RL can learn about that kind of state-action-state transition.

To assess whether these conceptually different flavors of RL have any neural basis, GlÃ¤scher et al used fMRI to scan subjects performing a "sequential two-choice Markov decision task." It should come as no surprise that this has many similarities to the tasks used in the hierarchical control literature, which I've briefly covered in the last week.

The task is devised as follows: subjects first observe a series of 3 stimuli appear on the screen, each one appearing after the subject responded to the previous; they're told how to respond to each, and no reward is provided at the end of a series. This task allows the model-based system to learn about which stimuli tend to follow which others, and with what probability - i.e., state transitions - but provides no information for the model-free system, because there is no reward provided, and reward prediction error is the core calculation performed by that system.

In a second phase of the experiment, subjects were explicitly trained about the reward outcomes associated with the possible stimuli that could occur last in the series of 3 stimuli. In the third and final phase, they then proceeded to complete the same serial choice task in phase 1, except that they weren't told how to respond to each stimulus and reward was provided according to the contingencies they had learned in phase 2. Thus, subjects had to find their way "through the decision tree" to acquire maximal reward - putatively by integrating a reward prediction error now calculated by the model-free system with the "state" prediction error learned in the task where no rewards were provided.

Indeed, subjects seemed to perform this kind of integration across RL systems, as indicated by significantly above-chance performance on their first trial in the third phase (p.05 one-tailed). The authors then modeled individual subjects' choices in the third phase at the trial-by-trial level, to see if these were best captured by a combination of model-free and model-based learning, or by either one alone. The combined model fit best, suggesting that subjects were integrating these two RL systems during behavior.

Although the task was probabilistic, with the same probabilities used across subjects, different subjects experienced different sequences of state transitions in phase 1; the authors found that by incorporating those experiences into their models, their models acquired a significantly better fit to the trial-by-trial behavior of subjects.

fMRI demonstrated that across phases 1 and 3, estimates of prediction error from the model-based system (aka "state prediction error") predicting neural activity in the lateral prefrontal cortex (dorsal bank of posterior IFG) and in the posterior intraparietal sulcus. ROI analyses indicated that these areas also showed significant effects just in phase 1, consistent with the idea that these areas implement a model-based RL system even in the absence of rewards. Estimates of prediction error from the model-free system (aka reward prediction error) in the 3rd phase showed no consistent modulation in cortex, but rather only in the ventral striatum - an area long implicated in classical, model-free reinforcement learning.

One area is not like the others, however: only the posterior parietal cortex showed significantly greater correspondence to the model-based estimates of prediction error than those based on a model-free RL system. And only activity in this same region significantly correlated with optimal behavior in the 3rd phase, suggesting parietal cortex is critical for the kind of model-based prediction error investigated here. What's surprising here is that the same thing cannot be said of lateral prefrontal cortex, which many would have believed to be involved in model-based learning. The authors are a little more willing to interpret their (multiple-comparison uncorrected) lateral prefrontal results than I am.

Under this more skeptical read, it may suggest that lateral prefrontal cortex plays a different or more integrative role across both model-free and model-based learning. I think this conclusion is largely consistent with the authors' read, and consistent with some modeling work emphasizing the abstraction of prefrontal representations, but also puts prefrontal cortex at an uncomfortable distance from the "selection for action" control representations typically ascribed to it.

More like this

Another Week in the Planetary Crisis, March 31, 2013

Great article, thanks for the info!

Hello. Reinforcement Learning (RL) is a very dynamic area in terms of theory and application. Yesterday I found one great NEW book. It is free to download, or you just can read it on online reading platform here: http://www.intechopen.com/books/show/title/advances-in-reinforcement-le… This book brings together many different aspects of the current research on several fields associated to RL which has been growing rapidly, producing a wide variety of learning algorithms for different applications. Based on 24 Chapters, it covers a very broad variety of topics in RL and their application in autonomous systems. A set of chapters in this book provide a general overview of RL while other chapters focus mostly on the applications of RL paradigms: Game Theory, Multi-Agent Theory, Robotic, Networking Technologies, Vehicular Navigation, Medicine and Industrial Logistic. Cheers!

very nice post, i certainly love this website, keep on it

This web site is really a walk-through for all of the info you wanted about this and didnât know who to ask. Glimpse here, and youâll definitely discover it.

This is getting a bit more subjective, but I much prefer the Zune Marketplace. The interface is colorful, has more flair, and some cool features like 'Mixview' that let you quickly see related albums, songs, or other users related to what you're listening to. Clicking on one of those will center on that item, and another set of "neighbors" will come into view, allowing you to navigate around exploring by similar artists, songs, or users. Speaking of users, the Zune "Social" is also great fun, letting you find others with shared tastes and becoming friends with them. You then can listen to a playlist created based on an amalgamation of what all your friends are listening to, which is also enjoyable. Those concerned with privacy will be relieved to know you can prevent the public from seeing your personal listening habits if you so choose.

The new Zune browser is surprisingly good, but not as good as the iPod's. It works well, but isn't as fast as Safari, and has a clunkier interface. If you occasionally plan on using the web browser that's not an issue, but if you're planning to browse the web alot from your PMP then the iPod's larger screen and better browser may be important.

I do enjoy the manner in which you have presented this specific issue and it really does provide us a lot of fodder for consideration. On the other hand, coming from what I have observed, I simply just wish as other feed-back pack on that folks stay on point and in no way get started on a soap box regarding the news du jour. Yet, thank you for this excellent point and though I do not really go along with this in totality, I regard your viewpoint.

How much of an exciting piece of writing, continue creating companionHi there, I found your website via Google while searching for a related topic, your website came up, it looks great. I have bookmarked it in my google bookmarks.

Good post. Keep going on with your effort to write more posts.

Salut vous comment trouvez-vous de mon nouveau blog sur l' immobilier ?

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Performance Improves with Transcranial Random Noise Stimulation

November 21, 2011

Stimulating the brain with high frequency electrical noise can supersede the beneficial effects observed from transcranial direct current stimulation, either anodal or cathodal (as well as those observed from sham stimulation), in perceptual learning, as newly reported by Fertonani, Pirully &…

Attractors All the Way Up: Metastability, Rostrocaudal Hierarchies, and Synaptic Facilitation

November 18, 2011

In their wonderful Neuroimage article, Braun & Mattia present a comprehensive introduction to the possible neuronal implementations and cognitive sequelae of a particular dynamical phenomenon: the attractor state. In another excellent paper, just recently out in Frontiers, Itskov, Hansel and…

Architecture of the VLPFC and its Monkey/Human Mapping

November 17, 2011

If you ever said to yourself, "I wonder whether the human mid- and posterior ventrolateral prefrontal cortex has a homologue in the monkey, and what features of its cytoarchitecture or subcortical connectivity may differentiate it from other regions of PFC" then this post is for you. Otherwise,…

Modus Tollens, Modus Shmollens! When people commit a fallacy so absurd that it's only recently been given a name.

November 16, 2011

Suppose - rather reasonably - that soups which taste like garlic have garlic in them. You observe two people eating soup; one of them says to the other, "There is no garlic in this soup." Do you think it's likely that the soup taste like garlic? If you said yes, then congratulations! You've just…

Greater Performance Improvements When Quick Responses Are Rewarded More Than Accuracy Itself.

November 8, 2011

Last month's Frontiers in Psychology contains a fascinating study by Dambacher, HuÌbner, and SchlÃ¶sser in which the authors demonstrate that the promise of financial reward can actually reduce performance when rewards are given for high accuracy. Counterintuitively, performance (characterized as…