Can an AI Learn Like a Toddler?
Large Language Models (LLM) like OpenAI’s ChatGPT are AI types trained with vast amounts (trillions) of words that can return information in human-like text. So, when you ask ChatGPT to write you a haiku about children, it looks through its data and produces something like this.
Giggles in the sun,
Tiny footsteps dance and play,
Childhood’s joyous run.
AI can do so because it knows what the rules of a haiku are (5, 7, 5 syllables), and it knows which words are most often connected to childhood and children.
On the other hand, humans learn language through immersion, interaction with other humans, and exposure to millions of words per year.
Due to that data gap, researchers were skeptical about whether AI could tell us much about human learning and development. A team at New York University tested the idea, training an AI model on only the input a single child receives as it grows.
The team trained a multimodal AI system via video recordings through the eyes and ears of a single child. These videos were recorded via a headcam starting when the child was six months old and continuing through their second birthday. They then examined what the AI learned from a child’s everyday experiences.
The researchers found that a neural network can, in fact, learn a substantial number of words and concepts using limited slices of the child’s experiences (the video camera captured ~1% of the child’s waking hours).
“By using AI models to study the real language-learning problem faced by children, we can address classic debates about what ingredients children need to learn words—whether they need language-specific biases, innate knowledge, or just associative learning to get going,” said Brenden Lake, an assistant professor in NYU’s Center for Data Science and Department of Psychology and the paper’s senior author.
The footage, more than 60 hours worth, had approximately a quarter of a million word instances (i.e., the number of words communicated, many of them repeatedly). These word instances were linked with video frames of what the child saw when those words were spoken. They included various activities across development, including mealtimes, reading books, and playing.
The NYU researchers then trained a multimodal neural network with two separate modules: a video encoder that looks at single video frames and the language encoder that takes in the transcribed child-directed speech (the language encoder). These two encoders were combined and trained using an algorithm called contrastive learning, which aims to learn useful input features and their cross-modal associations. For instance, when a parent says something in the child’s view, some words likely refer to something the child can see, meaning comprehension is instilled by linking visual and linguistic cues.
“This provides the model a clue as to which words should be associated with which objects,” explains Wai Keen Vong, a research scientist at NYU’s Center for Data Science and the paper’s first author. “Combining these cues is what enables contrastive learning to gradually determine which words belong with which visuals and to capture the learning of a child’s first words.”
After the training, the team tested it with the same type of evaluations used to measure word learning in infants. They presented a target word and an array of four images, asking it to select the image that matches the words. The results showed that the model was able to learn a substantial number of words and concepts and could generalize some of the words to different visual instances than those seen during training.
They published their research in the latest issue of the journal Science and describe their work in greater detail in the video below.