Computer Generated Movie Titles and Summaries

Olin College of Engineering Software Design Depth Final, 2020

Our goals for this project are to:

Scrape summaries and synopsis from some of IMDb’s movies (last 30 years) to build a training set
Use some form of machine learning to generate fake IMDb pages from the aforementioned set

Dependencies

The following libraries are necessary for our project to run:

Numpy
BS4 (BeautifulSoup)
Scipy
Scikit-learn
Pillow
H5py
Pandas
Virtualenv
tensorflow
keras

You can install these libraries using the commands below:

  $ pip3 install numpy 
  $ pip3 install scipy 
  $ pip3 install bs4
  $ pip3 install pandas 
  $ pip3 install pillow
  $ pip3 install h5py 
  $ pip3 install scikit-learn
  $ pip3 -U virtualenv
  $ pip3 tensorflow

(If this is the first time using keras you will have to set up the environment as well, The instructions are found here.)

  $ pip3 install keras

System Architecture:

The architecture for this project is easily split into two distinct sections: the Web Scraper and the Generator.

Webscraper.py

alt text

The Internet Movie Database (IMDb) is a progressively growing database storing loads of descriptive information about each movie, including cast members, synopsis, and plot summary. Due to how IMDb appends movies to its site, each movie is given a call number which the title is refered to as and used for the IMDb url. All of these call numbers, titles, genres and other descriptive labels are stored in IMDb’s downloadable “interface” csv’s, ready to be scraped and organized by the Web Scraper.

Generator.py

alt text

After the information is scraped from IMDb by genre, the Generator portion handles cleaning and preparing of the data, eliminating extraneous possible vocab from the network; assembly of word sequences, creation of the model to be trained, and the training of the model from the word sequences. Once the model has been trained, the program is able to quickly output a new string of text from an input of “seed” text.

To Run:

In order to run this code you will need the basics label interface from IMDb’s downloadable interface files (they are quite large and manually unusable), ~2 GB of storage, and a decently capable system to run the program on, this largely stems from the training algorithm being computationally intensive.

Go to IMDb’s Interfaces page and navigate to their datasets page, and download title.basics.tsv.gz. Once downloaded, extract the file to the program’s directory, and rename the file and extension to “basemoviedata.csv”.

After downloading these files, specify the genre(s) you would like to scrap from IMDb in the array at the beginning of the file. Run the Webscraper.py file and allow the program to scrape and sort titles and plot summaries by movie genre before running Generator.py.

The first time that Generator.py is run it will create text sequences for the desired genre before creating and training the actual model. Since this program is attempting to create movies that adhere to specific genres, a new model will be created and trained for each specified genre. However, due to the time required to make a single model and to prevent from accidentally starting a 20+ hour run, for each model that you would like to create, you must change the “__GENRE_NAME_HERE__” string with the desired genre and datatype.

Instantiate the Generator() class with __GENRE_NAME_HERE__ as the only input. Once the model(s) have been created, you can run Generator.go() with __GENRE_NAME_HERE__ as the only input and text will be generated!

Minor Notes about the `Generator`:

By defualt, the network is set to have a batch size of 128 with 100 epochs and an output of 50 words which can be changed before the program is run to the user’s liking. While this seems meaningless to the onlooker, with the default settings each model (assumming you have ~10,000 sequences) will take over an hour to train before remotely coherant text can be generated from it.