A neural network uses text captions to create outlandish images – such as armchairs in the shape of avocados – demonstrating it understands how language shapes visual culture.
OpenAI, an artificial intelligence company that recently partnered with Microsoft, developed the neural network, which it calls DALL-E. It is a version of the company’s GPT-3 language model that can create expansive written works based on short text prompts, but DALL-E produces images instead.
“The world isn’t just text,” says Ilya Sutskever, co-founder of OpenAI. “Humans don’t just talk: we also see. A lot of important context comes from looking.”
Advertisement
DALL-E is trained using a set of images already associated with text prompts, and then uses what it learns to try to build an appropriate image when given a new text prompt.
It does this by trying to understand the text prompt, then producing an appropriate image. It builds the image element-by-element based on what has been understood from the text. If it has been presented with parts of a pre-existing image alongside the text, it also considers the visual elements in that image.
“We can give the model a prompt, like ‘a pentagonal green clock’, and given the preceding [elements], the model is trying to predict the next one,” says Aditya Ramesh of OpenAI.
For instance, if given an image of the head of a T. rex, and the text prompt “a T. rex wearing a tuxedo”, DALL-E can draw the body of the T. rex underneath the head and add appropriate clothing.
The neural network, which is described today on the OpenAI website, can trip up on poorly worded prompts and struggles to position objects relative to each other – or to count.
“The more concepts that a system is able to sensibly blend together, the more likely the AI system both understands the semantics of the request and can demonstrate that understanding creatively,” says Mark Riedl at the Georgia Institute of Technology in the US.
“I’m not really sure how to define what creativity is,” says Ramesh, who admits he was impressed with the range of images DALL-E produced.
The model produces 512 images for each prompt, which are then filtered using a separate computer model developed by OpenAI, called CLIP, into what CLIP believes are the 32 “best” results.
CLIP is trained on 400 million images available online. “We find image-text pairs across the internet and train a system to predict which pieces of text will be paired with which images,” says Alec Radford of OpenAI, who developed CLIP.
“This is really impressive work,” says Serge Belongie at Cornell University, New York. He says further work is required to look at the ethical implications of such a model, such as the risk of creating completely faked images, for example ones involving real people.
Effie Le Moignan at Newcastle University, UK, also calls the work impressive. “But the thing with natural language is although it’s clever, it’s very cultural and context-appropriate,” she says.
For instance, Le Moignan wonders whether DALL-E, confronted by a request to produce an image of Admiral Nelson wearing gold lamé pants, would put the military hero in leggings or underpants – potential evidence of the gap between British and American English.
Topics: