Video Generation Based on Short Text Description (2019)

Over a year latter (in 2020), I have decided to add README to the repository, since some people find it useful even without a description. I hope this step will make the results of my work more usable for those who are interested in the problem and stumble upon the repository when browsing the topic on GitHub.

Example of Generated Video

Unfortunately, I have not saved videos generated by the network, since all the results remained on the working laptop which I hand over at the end of the internship. The only thing left is the recording that I did on my cellphone (Sorry if this makes your eyes bleed).

What is on the gif? There are 5 blocks of images stacked horizontally. Each block contains 4 objects, selected from '20bn-something-something-v2' dataset and belonging to the same category "Pushing [something] from left to right" ¹ (~1000 samples). They are book (top left window), box (top right window), mug (bottom left window), and marker (bottom right window), pushed along the surface by hand. The number of occurrences in the data subset for the corresponding objects is 57, 43, 9, 55.

The generated videos are diverse (thanks to zero-gradient penalty) and about the same quality as the videos from the training data. However, there are no tests conducted on the validation data. Regarding the gif below, since all the objects belong to the same category, only single-word conditioning (i.e., on the object) is used. Still, there are tools in the repository for encoding the whole sentence.

¹ Yep, exactly "from left to right" and not the other way around as you can read it on the gif (it is a typo). However, it is good for validation purposes to make new labels with the reversed direction of movement or new (but "similar", e.g., in space of embeddings) objects from unchanged category.

Navigating Through SRC Files

data_prep2.py	Video and text processing (based on text_processing.py)
blocks.py	Building blocks used in models.py
visual_encoders.py	Advanced building blocks for image and video discriminators
process3-5.ipynb	Pipeline for the training process on multiple gpus (3-5 is a hardcoded range of gpus involved)
pipeline.ipynb	Previously served for the same purpose as process3-5.ipynb but hadrcoded range was 0-2. Now it is unfinished implementation of mixed batches
legacy (=obsolete)	Early attempts and ideas

There is also a collection of references to articles relevant (at the time of 2019) to the text2video generation problem.

Update of 2021

This year, I have decided to make the results reproducible. It has turned out that if one has not dealt with this repo before, it is a tough call for them to get it up and running again. Quite surprising, huh? Especially, considering I uploaded everything hastily and as it was. Now, at least you can follow the instructions below. This is how you reach the state of what I left in 2019. To move further, a great deal of effort is required. Good luck!

Setting Everything Up

Clone the repository

git clone https://github.com/lukoshkin/text2video.git

Retrieve the docker image

cd docker
docker build -t lukoshkin/text2video:base .

docker pull lukoshkin/text2video:base

If using singularity, one can obtain the image by typing

singularity build t2v-base.simg docker://lukoshkin/text2video:base

Get an access to GPU. For cluster-folks, it may look like:

salloc -p gpu_a100 -N 1 -n 4 --gpus=1 --mem-per-gpu=20G --time=12:00:00
## prints the name of allocated node, e.g., gn26
ssh gn26

Cd to the directory where everything is located and create a container (9999 is a port exposed for Jupyter outside the container. That is, if running directly on your computer, localhost:9999 is your access point in a browser. You may need one more port for TensorBoard as well; note, 8888 is the default one Jupyter tries first, if the port is busy, you should specify it manually in the jupyter-command with --port option )

nvidia-docker run --name t2v \
  -p 9999:8888 -v "$PWD":/home/depp/project
  -d lukoshkin/text2video:base \
  'jupyter-notebook --ip=0.0.0.0 --no-browser'

Singularity users are like:

singularity exec \
  --no-home -B "$PWD:$HOME" --nv t2v-base.simg \
  jupyter notebook --ip 0.0.0.0 --no-browser

If accustomed to work in JupyterLab, please, use it readily by rewriting the commands to the proper form first.

For running everything on a HPC cluster, one should forward the ports. You type one of the following. Which one? - depends on whether you ssh to calculation nodes (gn26) on your server and whether you set up a nickname for the latter in .ssh/config.

ssh -NL 9999:gn26:8888 nickname
ssh -NL 9999:localhost:8888 user@server

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

lukoshkin / text2video

Programming Languages

Labels

Projects that are alternatives of or similar to text2video

Video Generation Based on Short Text Description (2019)

Example of Generated Video

Navigating Through SRC Files

Update of 2021

Setting Everything Up