Abstract | ||
---|---|---|
Generating videos from text has proven to be a significant challenge for existing generative models. We tackle this problem by training a conditional generative model to extract both static and dynamic information from text. This is manifested in a hybrid framework, employing a Variational Autoencoder (VAE) and a Generative Adversarial Network (GAN). The static features, called "gist," are used to sketch text-conditioned background color and object layout structure. Dynamic features are considered by transforming input text into an image filter. To obtain a large amount of data for training the deep-learning model, we develop a method to automatically create a matched text-video corpus from publicly available online videos. Experimental results show that the proposed framework generates plausible and diverse short-duration smooth videos, while accurately reflecting the input text information. It significantly outperforms baseline models that directly adapt text-to-image generation procedures to produce videos. Performance is evaluated both visually and by adapting the inception score used to evaluate image generation in GANs. |
Year | Venue | DocType |
---|---|---|
2018 | national conference on artificial intelligence | Conference |
Volume | Citations | PageRank |
abs/1710.00421 | 5 | 0.41 |
References | Authors | |
22 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yitong Li | 1 | 44 | 7.98 |
Renqiang Min | 2 | 149 | 17.61 |
Dinghan Shen | 3 | 108 | 10.37 |
David E. Carlson | 4 | 182 | 15.35 |
L. Carin | 5 | 4603 | 339.36 |