Everything Is Just Dandy!

The Futile Quest for Autonomous Intelligence

Phenomenology and Existentialism
Carlos E. Perez

Richard Sutton (of “The Bitter Lesson” fame) has a five to ten year plan for AI research The Alberta Plan for AI Research.

He proposes four principles:

1- Emphasis on ordinary observation (oberve, reward, action)

2- Temporal uniformity

3- Scales with compute power

4- An environment includes other agents.

Here are the steps of his grand plan:

1. Representation

2. Continual supervised learning with given features.

3. Representation II: Supervised feature finding. 4

4. Prediction I: Continual Generalized Value Function (GVF) prediction learning.

5. Control I: Continual actor-critic control.

6. Prediction II: Average-reward GVF learning.

7. Control II: Continuing control problems.

8. Planning I: Planning with average reward.

9. Prototype-AI I: One-step model-based RL with continual function approximation.

10. Planning II: Search control and exploration.

11. Prototype-AI II: The STOMP progression.

12. Prototype-AI III: Oak.

13. Prototype-IA: Intelligence amplification.

My first impression is that it’s an “autonomy-first” design. It’s continually learning from the environment. It scales without algorithm changes but only through the increase of resources. It amplifies its learning through social interaction with other agents.

It’s a novel approach that takes to heart “The Bitter Lesson.” This system appears to evolve on its own and becomes more capable through the addition of resources and continual learning.

One very odd requirement is what Sutton describes as a uniform time requirement. That is, it does not prioritize resource usage and just acts with the same sense of “urgency” at all times. It’s like a “calm” stance that doesn’t perform decisions on compute resource allocation. This is indeed an intriguing architecture that may be worthwhile to closely monitor.

Yann Lecun (of Conventional Neural Networks fame) proposes a new architecture to tackle what he describes as “autonomous intelligence.” He proposes his VICReg architecture (https://openreview.net/forum?id=xm6YD62D1Ub ) which is an extended version of Barlow Twins.

His system assumes Self-Supervision.

The Paradigm Shift of Self-Supervised Learning

In addition LeCun assumes of an intrinsic motivator as an internal objective function (i.e. reward).

Bypassing the Agent Alignment Problem via Intrinsic Motivation

I do like his proposal because it draws explicit lines, rather than vague one, in the sand. He specifically mentions a covariance and a regularization architecture. Hypothesis-making in science is about making explicit predictions gut predictions that may be validated through experiments. LeCun has thus pre-registered his conjectures. This is something worth applauding.

The striking aspect of LeCun’s formulation is that he does not favor Transformer Models:

Here is Lecun’s @ylecun argument against language models in more detail:

AI And The Limits Of Language | NOEMA

The first part seems to accuse wordcels of shallow understanding “classrooms are filled with jargon-spouting students.” He correctly observes “But the ability to explain a concept linguistically is different from the ability to use it practically.” Artificial fluency indeed implies mastery of usage and not mastery of conjuring up explanations.

LeCun writes reminiscent of Heidegger: “This broader, context-sensitive kind of learning and know-how is the more basic and ancient kind of knowledge, one which underlies the emergence of sentience in embodied critters and makes it possible to survive and flourish.” To his credit it is about time that deep learning researchers acknowledge the importance of embodiment.

Embodied Learning is Essential to Artificial Intelligence

As well as the limitation of AGI development that is absent a body:

How Can We Circumvent Moravec’s Paradox?

LeCun criticizes language: “Language is a very low-bandwidth method for transmitting information: isolated words or sentences, shorn of context, convey little. Moreover, because of the sheer number of homonyms and pronouns, many sentences are deeply ambiguous.”

But I will argue, based on Hoffmeyer’s concept of Semantic Freedom, that it is precisely this ambiguity in language that is a reflection of its utility. It’s the symbol grounding required for language understanding that is a manifestation of human-level cognition. So I don’t believe you achieve AGI without human language. I find it critically important to frame cognition from a language metaphor rather than the embodied metaphor as promoted by 4E enactivism (which LeCun fails to mention). 4E is important, but it’s not the entire picture.

The Language-Turn Metaphor and AGI

This is the high level depiction of LeCun’s proposal:

He has a disclaimer that this is just a proposal, so many key modules that still need to be elaborated. But let’s examine the parts that have detail. Key in his neural architecture is a move towards covariance.

Joint-Embedding Predictive Architecture (JEPA) Architecture

LeCun’s lean towards covariance has a lot of merits. There was a time when I thought it was important for neural networks to discover invariant features. Then I realized that this was too restrictive and can’t be realistically achieved. So I thought covariance would be more pragmatic.

Invariance is that thing that remains the same independent of a change of reference frame. Covariance is a thing that can be transformed into different things but can be transformed back into the same thing.

In the old days, the argument for convolutional networks is that it was invariant to translations. Thus it could learn a more efficient transformation by ignoring information. Unfortunately, you could scale this to a network that supported often conflicting invariances.

The way things like Dall-E work is that it learns a transformation of text and images into the same thing (i.e., latent embedding). So when you give it text, it transforms it to an embedding where you can extract a semantically similar image.

The value of networks that have a bias towards certain kinds of invariances is that their embeddings are more truthful to perturbations. The embedding in Dall-E2 will be a blended embedding that conforms to the symmetries in language and images. It’s a chimeric embedding. Hence it’s very good at conceptual blending. It doesn’t do as well with relationships between objects, however, the larger Imagen does much better. But the differences between Dall-E2 and Imagen are noticeable. If you want art, go with Dall-E2.

Systems that exploit covariance can do very impressive things but can also be brittle. What we need are more robust systems, and to do that we need to step away with even greater decoupling. Chimeras are nice, but they are dangerous. The path I see toward more robust systems is through conversational interfaces between complementary modules. The brain implements this on several levels and perhaps that’s what newer network architectures should do.

In a recent interview, Yann LeCun has become very opinionated about the path to AGI. He opines that Transformer Models like GPT-3 “I think it’s missing essential pieces.” With respect to DeepMind’s favored approach of Reinforcement Learning, he says ”most of the learning we do, we don’t do it by actually taking actions, we do it by observing.” Finally, with respect to Generative Adversarial Networks (GANs), he says: “It has been a complete failure” (note: he once said this was the greatest invention). Finally, with respect to probability models, he opines that “too much to ask for a world model to be completely probabilistic; we don’t know how to do it.”

But one important is that he says, “Geoff Hinton had done something similar — I mean, certainly, him more than me, we see time running out. We’re not young.” Hence the urgency on his part and Hinton that the status-quo is not enough! It’s clear that both realize that they are close to achieving AGI and do not know if its within their lifetime.

With regards to probability models, he recommends “to drop this entire idea.” I’ve said the same years ago:

Should Probabilistic Inference be Thrown Under the Bus?

His proposal, “JEPA architecture actually tries to find a tradeoff, a compromise, between extracting representations that are maximally informative about the inputs but also predictable from each other with some accuracy or reliability. “ He mentions that backprop was elucidated previously by Henry J. Kelley, Arthur Bryson, Lev Pontryagin from control theory and then going back to the Lagrangian. LeCun claims he’s an “ energy-based Bayesian.”

I have no idea what LeCun means about energy means here in the context of cognition. The value of probability theory is that its useful for systematic decision-making. Energy also is involved in decision-making in that its important to understand the cost of a decision. But one can’t make the analogy that probability and energy should be treated similarly.

So we have here two luminaries of machine learning proposals for autonomous intelligence. The problem as I see it about pursuing autonomy is that it requires enormous computation. Think of the billions of years of co-evolution that led to complex mammals. That’s an enormous amount of computation to select the respective heuristics required for autonomous living things.

Nano-Intentionality and Deep Learning

We are ultimately hindered by the availability of computational resources.

Cognitive psychology derives its knowledge from studying human minds. But AI minds are not human minds and thus can develop entirely different psychology than a human mind. We should not assume that AI will develop the same way as a human mind. No serious person can claim that Von Neumann computers work like human minds (of course, Von Neumann made that analogy!). As deep learning develops, we will realize that many cognitive skills do not exist in human minds. These are a kind of alien cognition.

At present, deep learning is bounded from the bottom by its inability for autonomous cognition. It’s also bounded at that top by its inability to formulate new abstractions. Humans, by contrast, are good at autonomy, and some educated humans are good at creating new abstractions.

An open question in Artificial General Intelligence work is whether autonomous cognition is a prerequisite for abstract cognition. I don’t recall anyone rendering a strong argument that it does. Most work in 4E and ecological psychology does not involve abstracting.

Given that human cognition is the only known model of general intelligence, we cannot see autonomous cognition being decoupled from abstract cognition. Nobody knows if these two are strongly or weakly coupled.

But allow me to take the less-intuitive argument that autonomous and abstract cognition can be decoupled or, at the extreme, are orthogonal to each other. How does this change our research agenda in our search for useful artificial intelligence tools?

The impact of computer hardware architecture on our deep learning models is often overlooked. I mean here that the gains of deep learning are a consequence of hardware, but that hardware is different from biological hardware. We brush this discrepancy off as “substrate independence”.

But brains behave the way they do because of their existing architecture, there’s no way to ignore this reality. Deep Learning outperforms other methods because it leverages hardware that supports its computations. The computations DL performs may be entirely different from what a brain computes. But they share one commonality, size (i.e., quantity) has a quality all its own. With massive computation, you get emergent behavior.

Now human evolution has encumbered us with all kinds of cognitive biases. These biases have been both detrimental and beneficial. Furthermore, we’ve been able to offload our cognition through language technologies (i.e., writing, print, media, etc.). We’ve got terrible memories of the kind of media that we are inundated with. Our memories evolved only to navigate through forests. Our coordination instincts are only competent below the Dunbar number. We have limited minds in an exponentially growing information civilization.

But we are however like other animals, autonomous creatures.

The Scaling hypothesis is likely to be correct, but it’s not a simple problem. Not all architectures are scalable. You need to discover novel architectures that do scale.

As a good example, we can look at bacteria. There’s a reason why the cellular architecture of bacteria does not lead to complex multi-cellular creatures such as ourselves. Bacteria are not only robust and a source of biological innovation. Yet have the wrong kind of scaling. Eukaryotic cells however are 10 times bigger than bacteria and have an architecture that allows them to scale into multicellular creatures like ourselves. A building block determines the limits of scalability.

Today’s Deep Learning building blocks may have their own limits of scalability. Think for a moment about the Von Neuman CPU architecture. For 2 decades we’ve been stuck under 4Ghz of speed. There’s a physical limit that is being imposed. Too fast generates too much heat. Similarly, Deep Learning may be hitting its own physical limit. Google’s PaLM architecture with 540 billion parameters required a lot of engineering to make happen. We are at that phase where we need to cobble together multiple machines (see: Pathways Language Model (PaLM)).

Yet our brains are contained in a small space and powered with just the energy of a lightbulb. Why is this so? Because biology’s building blocks are made of highly energy-efficient and scalable stuff! Cells are pluripotent molecular machinery. A bee has only a million neurons. Yet present AI is incapable of achieving the same level of autonomous cognition that is present in a bee. Biological technology is indeed magical when you compare it with human technology.

Brains are built up of Eukaryotic cells configured in a high dimensional graph. Imagine the complexity required for a stem cell to develop into an entire body. The connections are mostly 3 dimensional. But for the brain, it’s high dimensional due to the spindle shape of neurons. It’s the rich interconnectivity of neurons that leads to its holographic memory capacity. But memory requires processing. What are neurons processing to leverage that storage? How are neurons going about their work to do something of use?

The neurons in the brain work in concert on many scales (as reflected by many different brain frequencies). One can lose consciousness through anesthesia. This is by the jamming of high-frequency signaling between neurons. The complexity of the biological brain is a consequence of (1) the complexity of individual neurons (2) its high-dimensional connectivity to other neurons (3) its multi-scale processing capabilities across many resonating frequencies. What else did I forget?

The flaw of Deep Learning architectures may be that it is formulated on continuous mathematics. It’s dependent on this because its evolutionary mechanism requires the calculation of gradients. Neurons in the brain might use an entirely different error-adapting mechanism. Perhaps the commonality is that both continuous functions and neurons have to “feel” their environment and this implies an interpretation of a continuous change.

The point here is that continuous functions are at the core of Deep Learning architecture. This may not be true for biological neurons and thus may explain the difference in their ultimate scalability. We could be where we are at now because GPU architectures were the kinds that did very well processing models requiring continuous functions. That is, our models of cognition are a consequence of the technology ubiquitious in our time.

The Futile Quest for Autonomous Intelligence was originally published in Intuition Machine on Medium, where people are continuing the conversation by highlighting and responding to this story.