51.2m 14m Rokublog: OpenAI has introduced two new machine-learning models which can be trained as neural networks. These are CLIP, which classifies images into categories from arbitrary text, and DALL·E, a generative model that can generate images from text. They were created to improve OpenAI’s ability to recognize things like cats, dogs or political figures in photos.
The big idea here is that these models can learn ‘from experience‘ without the need for humans to manually label the objects they’re seeing in each photo.
If you try to categorize images into cats and dogs, or political figures from your recent photos, OpenAI’s new models could help you do that automatically. They’re the first of their kind.
Here’s how it works: you upload a photo and some text to the CLIP model, and it tries to figure out which objects are in which categories. It uses a ‘generative adversarial network’ approach that involves feedback from both models, so it can learn as it goes.
Lots of AI problems can be solved by learning from examples. But what if the examples are labeling errors, or worse yet, the example is a piece of text and not an image? That’s where OpenAI’s new models come into play.
OpenAI has trained CLIP on three large datasets: one with high quality labels, one with low quality labels and one without any labels at all. The goal was to train it to classify objects as well as possible in photos that don’t have any labels.
This approach was designed for the ‘long tail of unlabeled data’ — the countless images online with no labels. CLIP can perform well even with very little labeled data. In one case, it made predictions after seeing just 2.5% of an object in a photo.
For example, when asked to categorize the image below, it answered: “rocket”, “vehicle”, “artificial” and “flying”. It had only seen 0.2% of the objects in this photo.
You can check out more of these predictions here.
“The result is a system that can learn from errors and create accurate predictions. It’s an example of AI generating new structures from scratch — like a self-learning code,” said Jason Yosinski, OpenAI researcher. “This means we can learn what new classes are possible, rather than being limited by what’s already been identified. This could lead to new concepts and paradigms for developing machine learning models for all kinds of tasks, not just image classification.”
DALL·E is a bit of a different approach. It was trained on the same dataset as CLIP, but instead of trying to classify objects in photos, it tries to recreate the photos based on what’s been seen.
“CLIP is based on a generative adversarial network (GAN) that classifies images into categories,” reads OpenAI’s blog post. “A network is trained to make accurate predictions by using a sample of labeled data to guide its own training. But doing so is complicated. In this case, we’d have to manually label all of the instances in the dataset — not only the ones that were misclassified, but also all of them.”
DALL·E’s purpose was to find a solution that didn’t require human labeling. Instead it created images from text. The important thing is that it looks for patterns in the data that lead to realistic results.
For instance, it can recreate an image of a car from text such as “car”. It can even produce images from text like “black cat or dog” — but those are especially difficult.