A machine learning algorithm has created tiny (64×64 pixels) 32-frame videos based on text descriptions [sciencemag.org]:
The researchers trained the algorithm on 10 types of scenes, including "playing golf on grass," and "kitesurfing on the sea," which it then roughly reproduced. Picture grainy VHS footage. Nevertheless, a simple classification algorithm correctly guessed the intended action among six choices about half the time [aaai.org]. (Sailing and kitesurfing were often mistaken for each other.) What's more, the network could also generate videos for nonsensical actions, such as "sailing on snow [toronto.edu]," and "playing golf at swimming pool [toronto.edu]," the team reported this month at a meeting of the Association for the Advancement of Artificial Intelligence in New Orleans, Louisiana.
[...] Currently, the videos are only 32 frames long—lasting about 1 second—and the size of a U.S. postage stamp, 64 by 64 pixels. Anything larger reduces accuracy, says Yitong Li, a computer scientist at Duke University in Durham, North Carolina, and the paper's first author. Because people often appear as distorted figures, a next step, he says, is using human skeletal models to improve movement.
Tuytelaars also sees applications beyond Hollywood. Video generation could lead to better compression if a movie can be stored as nothing but a brief description. It could also generate training data for other machine learning algorithms. For example, realistic video clips might help autonomous cars prepare for dangerous situations they would not frequently encounter. And programs that deeply understand the visual world could spin off useful applications in everything from refereeing to surveillance. They could help a self-driving car predict where a motorbike will go, for example, or train a household robot to open a fridge, Pirsiavash says.
An AI-generated Hollywood blockbuster may still be beyond the horizon, but in the meantime, we finally know what "kitesurfing on grass" looks like.