Paper Review: Extending Machine Language Models toward Human-Level Language Understanding

Find the paper here.

Paper Gist:
The paper introduces Integrated Understanding System(IUS) as an effort to make better language understanding models. They argue that up until now, models have only focused on the language part and have ignored how humans understand and communicate about the world around us. The authors say, to understand the language better we need to ground the language using visual inputs, the objects and situations we are interacting with. They suggest a fast learning system (MTL) that will enable the system to learn from one-off interactions and discuss the future direction of research with their proposed architecture.

Paper Highlights

The Human Integrated Understanding System (IUS):
The authors have proposed a theory of the brain basis of understanding and how the architecture can be used as a direction for future language understanding research. On the lines of Complementary Learning Systems Theory(CLTS), they have suggested two areas of learning. One neocortical and the other one MTL(explained later). In the neo-cortical system, they have suggested these reciprocally interconnected pairs of pools (or, you can think of them as subsystems) which give us representations of the input. They are:
a) visual representations
b) speech/language representations
c) object representations
d) situation representations
Both object and situation representation take their cues from visual and speech representations as explained further.

Blue System: Neocortical learning system
Red System: MTL learning system
Blue ovals: Recurrent connection in the pools
Blue arrows: Connection between the pools
Red Arrows: Modifiable connections in MTL
Green arrows: Connections between MTL and Neo-cortical system

Object Representation: Authors report an area in our brain that comes up with a sort of embedding for the object under consideration. These embeddings are then later used as an input to a recurrent bidirectional model in our brains that represent other properties of the object like the name of the object. This model is interconnected with other layers/pools. It is connected with both the visual and speech systems. So when we see something we can identify its name or when we see hear something we can retrieve its other object properties

Situation representations: This represents the situation conveyed through the visual and language input to the system. This can work with or without the language input. This area process situations in a way that the representations of the events in an event sequence remain the same no matter how the situation was conveyed. Example, through watching a movie or reading about it and even while remembering the movie. The authors have focused on using situation representation along with object and visual representation because of how language can portray different situations like how both the sentences are different “the boy ran to the dog” and “the boy ran from the dog“. In some ways, situation representation can be considered as context about the input.

Complementary Learning Systems: The subsystems we discussed constitute the neocortical learning systems. The representation from the earlier subsystems changes very rapidly whereas the brain state changes only on event boundaries. Also, humans can learn anything new after seeing the input only once or twice. So to enable quick learning and maintain a kind of brain state they suggest MTL as a complementary learning system. The MTL has the ability to quickly learn from new experience. The authors suggest that for new experiences the MTL converts the new experience into an MTL encoding, kind of like an explicit memory. Thus enabling a form of multimodal association. It then returns the encoding to other systems and helps them learn better from it. How it does that is not mentioned in the paper in detail.

Toward an Artificial Integrated Understanding System: Though some efforts have been made to work with both image and language input the gap between these systems and human performance is still very large. For the future, authors recommend that the IUS can be further extended to video and language and other modalities. If given the capability to understand the results of its actions, the system might even learn the notions of cause and effect, of agency, and of self and others. For better generalization and grounding, the system would work better in a Multimodal interactive environment. However, no such environments rich with multimodal data exist now but if they did making situated agents learn in such environments could prove to be a great catalyst to ground language better.

My opinion:
a) First off I would like to thank the authors for the paper. It gives us insights of how the brain works based on the existing literature and giving a good entry point to find more literature on the topic
b) The suggested architecture gives a very high-level view but to make such an architecture many subproblems will have to solved:
Multimodal learning: We don’t know how the mutual constraint satisfaction(mentioned in the paper) will work on different modalities. The paper mentions that input from the visual area retrieves the corresponding information from other layers. We don’t know how to enforce this bidirectionality of data retrieval between these sub-systems?
Multitask learning: How do we train all these systems together? They learn independently and together at the same time. I have not seen any such existing architecture(I might be wrong).
What will be the learning algorithm?: The most obvious way to go about it seems to be curriculum learning but what has to be taught will depend on what is the end goal and how many modalities are involved. Will it learn in an unsupervised way and if yes how do we know that it’s learning correctly?
Interleaved Learning: How do the neo-cortical subsystems learn from the information from MTL? The authors suggest interleaved learning but how to achieve it? Each sub-system will have different input streams, connections from other sub-systems and responses from the MTL too. Learning from so many different source and also retrieving and fetching corresponding information will be tough.
c) Making the “situation subsystem” will be really hard. As soon as we move away from static image question answering to video and language inputs the role of the situation subsystem changes drastically. It will have to understand cause and effect through actions. That will require another subsystem, something like a metalearner that works with the situations and cant be part of the neocortical learning system because the context in them changes really fast.

I will like to thank the authors again for such an insightful paper and for the effort they put into the paper.


2 thoughts on “Paper Review: Extending Machine Language Models toward Human-Level Language Understanding”

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top