Predictive coding in machines and brains
Published:
The name “predictive coding” has been applied to a number of engineering techniques and scientific theories. All these techniques and theories involve predicting future observations from past observations, but what exactly is meant by “coding” differs in each case. Here is a quick tour of some flavors of “predictive coding” and how they’re related.
What is “coding”?
In signal processing and related fields, the term “coding” generally means putting a signal into some format where the signal will be easier to handle for some task.
In any coding scheme, there is an encoder, which puts the input signal into the new format, and a decoder, which puts the encoded signal back into the original format (or as close as possible to the original). The “code” is the space of possible encoded signals. For example, the American Standard Code for Information Interchange (ASCII) consists of a set of 8-bit representations of commonly used characters. Your computer doesn’t understand characters, but it does understand bits, so we use a bit-based coding scheme to represent characters in a computer.
Some other common types of coding are:
- Source coding (also known as “data compression”). The encoder compresses the input into a small bitstream; the decoder decompresses that bitstream to get back the input. Examples: arithmetic coding (used in JPEG), Huffman coding (used to compress neural networks).
- Channel coding (also known as “error correction”). The encoder adds redundancy to protect a message from noise in the channel through which it will be transmitted; the decoder maps the noisy received signal back to the most likely original message, with the help of the added redundancy. Examples: low-density parity-check codes (used in your cell phone and your hard drive), Reed-Solomon codes (used in CDs, for those old enough to remember this technology).
- Encryption. (Encryption could also be called “receiver coding” because it considers possible receivers, but nobody uses this terminology.) The encoder encrypts the message into a format such that only a receiver with the appropriate keys can open the message; the decoder decrypts the encrypted message into the readable original. Examples: RSA (used in HTTPS), SHA-1 (once used, but now broken).
Aside: The “source coding” and “channel coding” lingo come from Claude Shannon’s wonderful 1948 paper “A mathematical theory of communication”.
Predictive coding for data compression
OK, that’s coding in general. So what’s predictive coding?
The term “predictive coding” was coined in 1955 by Peter Elias. Specifically, what Elias proposed was a method called linear predictive coding (LPC) for communication systems.
In LPC, the next sample of a signal is predicted using a linear function of the previous $n$ samples. Then the error between the predicted sample and the actual sample is transmitted, along with the coefficients of the linear predictor. Predicting a sample from nearby previous samples works because in signals like speech, nearby samples are strongly correlated with each other.
The idea behind transmitting the error in LPC is that if we have a good predictor, the error will be small; thus it will require less bandwidth to transmit than the original signal. (So here “coding” specifically refers to “source coding”, or compression.)
LPC has had a long history of successes for audio compression. Speak ‘n’ Spell used LPC to store and synthesize speech sounds. The old Game Boy soundchip mostly used simple square wave beeps ‘n’ boops to make music, but sometimes used a special case of LPC known as DPCM to store certain sounds, like Pikachu’s voice in Pokémon Yellow. (See this video for a great overview of this and other old-school soundchips. Audio compression was crucial back when game cartridges had limited data storage.) The speech codec used in Skype uses LPC, combined with a bunch of other gadgets.
Predictive coding is also used in video compression, under the name “motion compensation”. Like adjacent audio samples, adjacent video frames are strongly correlated, and so can be predicted from each other. If you haven’t already, it’s good to take a moment to have your mind blown by H.264 video compression—the digital equivalent of shrinking a 3000-pound car to 0.4 pounds.
And of course, linear models are not the only way to do predictive coding; nonlinear models like neural networks can be used as well. A speech codec using WaveNet has been reported to get lower bitrate for the same quality as some traditional speech codecs.
Predictive coding for representation learning
Linear predictive coding and friends can be thought of as special cases of something called an autoregressive model. An autoregressive model is a model that cleverly splits a complicated probability distribution over sequences into a number of chunks that are easier to handle.
Let $\mathbf{x} = \{x_1, x_2, \dots\}$ denote our sequence of interest. In an autoregressive model, the joint distribution $p(\mathbf{x})$ is defined as
The $p(x_t|x_{t-1}, x_{t-2}, \dots)$ term can be implemented in different ways. In something simple like an $n$-gram model for text, $p(x_t|x_{t-1}, x_{t-2}, \dots)$ is just a lookup table containing the probability of the next letter being $x_t$, given that the previous letters were $x_{t-1}, x_{t-2}, \dots$. Another way to implement $p(x_t|x_{t-1}, x_{t-2}, \dots)$ is to feed $x_{t-1}, x_{t-2}, \dots$ into a neural network, which outputs a feature vector $h$, and then predict $x_t$ using a linear model on top of $h$, like a softmax classifier (for discrete $x_t$) or a linear regression model (for real-valued $x_t$).
Such a model is called “autoregressive” because if you want to sample from the distribution it defines, you must feed back the model’s own outputs (= “auto”) to predict the next output (= “regressive”).
It turns out that if you train a neural network as an autoregressive model, the internal representations learned by the network will work really well for supervised downstream tasks. Work like ULMFiT and World Models showed that this trick can make neural nets a lot more data-efficient.
Notice that using the networks’ internal representations is somewhat different from what we did in LPC: whereas in LPC the outputs are the errors (because the purpose is compression), in these autoregressive feature extractors the outputs are the features (because the purpose is extracting discriminative features). So in either case, we are predicting future observations, but the “coding scheme” and purpose of predicting the future is different.
A note about audio. Autoregressive modeling with big neural nets operating on the raw audio signal works—WaveNet did this for generative modeling, and the representations could be used on a downstream task (speech recognition)—but it is expensive to do because audio signals are high-dimensional.
Two methods have been developed to make it easier to do autoregressive modeling for audio: contrastive predictive coding and autoregressive predictive coding.
The first method, contrastive predictive coding (CPC), works by first encoding the input signal into a much lower-dimensional sequence of feature vectors using a convolutional neural network, and then training an autoregressive model on top of this sequence. Since the encoder could simply learn to output all 0s to make the objective function as easy as possible to optimize, the autoregressive component is instead trained using a contrastive loss: it must guess whether a sample is actually the next sample in the sequence, or a fake, thus preventing the encoder from collapsing to a trivial representation. The technique also works well for other high-dimensional signals, like images (represented as a sequence of pixels). Note that strictly speaking CPC is not a true autoregressive model because we can’t draw samples from it.
Autoregressive predictive coding (APC) takes a slightly different approach. Instead of modeling the raw audio, it extracts low-dimensional frequency domain features, and then does plain old autoregressive modeling on top of those features. The disadvantage is that 1) maybe some low-level information in the original signal gets thrown out, and 2) now you need to hand-craft some feature extraction, since what works for audio will not necessarily work for other modalities. (Incidentally, I think “autoregressive predictive coding” is not a very good name, because WaveNet is already an “autoregressive” “predictive coding” model. Engineers are not that great at naming things. Oh well.)
A nice paper comparing CPC and APC was recently published at the ICML 2020 Workshop on Self-supervision in Audio and Speech—check it out here, if you’re interested in speech models.
The drawback of using these predictive coding models for representation learning is that the representations we get are “unidirectional”: that is, they extract information only from the past, and not the future, to represent the current input. That’s a problem because the future is often very informative for interpreting the present. If you think of a phrase like “milk the cow”, we know that “milk” is a verb, and not a noun, from the words that follow it.
One way to overcome this problem is to do autoregression in both directions, and concatenate the representations from the forward and backward models, as is done in ELMo. Alternately, models like BERT use a bidirectional context and minimize a contrastive or denoising loss instead. But for tasks in which observations need to be processed in “real-time”, as is usually the case in control problems, a unidirectional context makes more sense.
Predictive coding for computational efficiency
Another really neat thing forward predictive models can do is tell you roughly whether an observation is “difficult” or not. The idea is this: if an input is surprising—if, for example, your autoregressive model assigns low probability to it—it contains more information, and it is therefore probably worth more attention.
This observation suggests yet another use for predictive coding: we can allocate less computation to more predictable inputs, a type of “conditional computation”. Jürgen Schmidhuber’s “Neural Sequence Chunker” is an early instance of this idea, in which predictable inputs are ignored and not sent to a subsequent neural network, and (shameless plug!) recently I wrote a paper describing a slightly more general version of the idea.
While unsurprising inputs merit less computation, the inverse is not necessarily true: surprising inputs do not always merit more computation. Alex Graves in his Adaptive Computation Time paper gives an excellent example of unpredictable inputs—specifically, random ID numbers in Wikipedia metadata—which a neural network model does not bother to allocate more computation to, because the model realizes that throwing more processing time at the problem just won’t help. This implies that we can do better than just using predictability by learning when to use more or less computation. Still, predictability is a good inductive bias for conditional computation in neural nets—and it appears that human brains may use this inductive bias for similar purposes.
Predictive coding in the brain
Artificial neural networks in machine learning were inspired by scientific theories about the structure of the human brain. For predictive coding, it was the other way around: engineers (starting with Elias) developed predictive coding to solve certain problems in signal processing, and only afterwards did scientists realize that the brain might be doing something like what engineers had developed.
One such early inkling was described in Jeffrey Elman’s classic “Finding Structure in Time”. Elman trained a recurrent neural network to predict the next letter in a stream of letters. The streams were formed by taking sentences and removing the spaces between words. This is sort of analogous to the scenario of children learning language. Children are not told where the boundaries between words are; presumably they only hear a relatively unbroken stream of phonemes when adults speak.
What Elman found was that letters with high surprisal corresponded closely to the location of word boundaries. Predictive coding might therefore be one of the ingredients for language learning in humans: children could in theory infer word boundaries by noting which phonemes have high surprisal. (This method isn’t foolproof: my uncle reports that as a kid he thought “tractorworking” was a word because he so often heard “tractor” and “working” together, e.g. “there’s a tractor working in the field”.)
Another piece of evidence for predictive coding in humans comes from studies of reading time. van Schijndel and Linzen put it nicely: “One of the most robust findings in the reading literature is that more predictable words are read faster than less predictable words… Word predictability effects fit into a picture of human cognition in which humans constantly make predictions about upcoming events and test those predictions against their perceptual input.”
There is an even more general “predictive coding hypothesis” which claims that the brain does predictive coding at every level: sensory signals are predicted by the neurons that receive them, and the prediction errors become the input to other neurons, which also make predictions and errors, and so on. This hypothesis is somewhat controversial, though, as can be seen from the many responses to Andy Clark’s “Whatever next? Predictive brains, situated agents, and the future of cognitive science” (the responses go from pg. 24 onward—hear also Grace Lindsay’s podcast episode on predictive coding, which discusses this essay).
The future of predictive coding
Regardless of the extent to which it actually happens in human brains, predictive coding is a very powerful idea. As researchers from OpenAI put it recently, it is “a universal unsupervised learning algorithm”. Indeed, OpenAI’s GPT-3—which is trained solely using next-step-prediction—can do all kinds of things it was never explicitly taught to do, like word arithmetic and Tom Swifty puns.
I suspect that more and more AI systems will have something like a predictive coding component built in. Forward predictive models are already needed for things like model-based reinforcement learning, where a model of the environment is used to plan by simulating and optimizing over possible trajectories; so, why not take advantage of the rich representations learned by those predictive models, and use them as inputs to subsequent processing?