.
Computer Generated Movie Titles and Summaries
Olin College of Engineering Software Design Depth Final, 2020: created by declan & alex
Table of Contents
- What’s the Big Idea, bud?
- Our Process
- Implementation Information
- Results
- Ethical Considerations
- README
What’s the Big Idea, bud?
We set out to learn about neural networks and machine learning by trying to implement an algorithm that could generate realistic movie titles and summaries. We found movie summaries and titles from IMDb’s website and separated them by genre. Once categorized, our neural network takes the movie data for a particular genre and attempts to pick out the import aspects of the data. From there, it generates new text that matches key elements of the training data.
How did this project come to be?
We set out with the goals of:
- Scraping titles and summaries from IMDb’s movie webpages to build a training set
- And use some form of machine learning to generate fake IMDb information from the aforementioned set
What is this?
Using text data from IMDb, we created and trained genre specific models that are able to generate new titles and summaries for movies that have never existed
Why?
As a proof of concept of a working neural network after our exploratory deepdive this semester.
Our Process
The development of each section of this began with very different strategies, but in the end synchronized into a collaborative coding experience.
The Web Scraper was developed experimentally; however, its progression fits a linear path. First by poking arround the html of IMDb’s movie pages to understand the template that is used and the pattern behind how url’s were generated. From this a baseline web scraper was developed. While only able to scrape the one input site, it was able to scrape the information that we needed from the website, as well as had the generalized url ‘prefix’ and ‘suffix.’
The next task was finding the how these movies were organized. While we could theoretically try and scrape every movie url and attempt to sort them on the fly, IMDb has a preventative measures for DDoS attacks that prevent too many requests from a single ip over a period of time, so we need to be precise about which movies we’re scraping. This led us to IMDb’s interfaces datasets: documents that contain almost every movie that has every been produced and holds their title, genre(s), and call numbers that IMDb uses to refer to each movie. Using this dataset we can sort the movies by genre and store their corresponding call numbers for accessing from IMDb.
Now that we had a program capable of striping a website with a url as an input, and a program capable of generating any genre list of movies and returning a list of call numbers for the movies, all we needed was simple pickling program to store this data. While there was plenty more changes of this code throughout the project, they were largely for compatibility and redundancy for other systems and did not alter the Web Scraper’s overall flow of control. Additionally, we also had to limit our maximum amount of IMDb requests to under approximately 6,000 in an hour.
The Generator section was developed more methodically,
Implementation Information:
Dependencies:
- Numpy
- BS4 (BeautifulSoup)
- Scipy
- Scikit-learn
- Pillow
- H5py
- Pandas
- Virtualenv
- Tensorflow
- Keras
System Architecture:
The architecture for this project is easily split into two distinct sections: the Web Scraper and the Generator.
Webscraper.py
The Internet Movie Database (IMDb) is a progressively growing database storing loads of descriptive information about each movie, including cast members, synopsis, and plot summary. Due to how IMDb appends movies to its site, each movie is given a call number which the title is refered to as and used for the IMDb url. All of these call numbers, titles, genres and other descriptive labels are stored in IMDb’s downloadable “interface” dataset’s, ready to be scraped and organized by the Web Scraper.
Generator.py
After the information is scraped from IMDb by genre, the Generator portion handles cleaning and preparing of the data, eliminating extraneous possible vocab from the network; assembly of word sequences, creation of the model to be trained, and the training of the model from the word sequences. Once the model has been trained, the program is able to quickly output a new string of text from an input of “seed” text.
The first time that Generator.py is run it will create text sequences for the desired genre before creating and training the actual model. Since this program is attempting to create movies that adhere to specific genres, a new model will be created and trained for each specified genre. However, due to the time required to make a single model and to prevent from accidentally starting a 20+ hour run, for each model that you would like to create, you must change the “__GENRE_NAME_HERE__” string with the desired genre and datatype.
A Timely Note about the Generator:
By defualt, the network is set to have a batch size of 128 with 100 epochs and an output of 250 words which can be changed before the program is run to the user’s liking. While this seems meaningless to the onlooker, with the default settings each model (assumming you have ~10,000 sequences) will take over three (3) hours to train before remotely coherant text can be generated from it.
Project Results:
Samples of WebScraper Outputs:
Samples of Generator Ouputs:
The direct output was:
in the disembodied of the curse of the flying to join him in the mojave desert to mount sheik does in a machinegun executive entertainer furth the villain of the largest isle where the cult crimes carries the original white girl that serves as a false herd short betty works in the navy the kid and gets his sweetheart sidekick walker is sent to investigate the man veteran as sworn to win the plot of the other company is the perfect opportunity to introduce his fortune and heads for haverlyte the event the sioux attack to grass district is bobby leads islandfever they be a canadian hero the other ocean and her abandoned a young pony express rider reporter for the cantina the crook is caught in the south billie region a actual hero diana summoned and is completely than a british villain and a brave handsome take in a western café cigarette wanting a virginia bishop and cant be projected to believe the money of the royal canadian mounted dog california brander is know himself to the sleepy little are so of the deed helen store she is protected by bandit coast colony younger brother based hang he is murdered he meets her brother tom norris in and hatches he has broken team and blames fickle ann slaps the sheriff and lets them the archaeologist in marys cavalry faith has a young man falsely a vulture to make her resentment a hardworking mexican girl is overlooked to string harry
After ~5 minutes of light cleaning:
Disembodied by the curse, the Flying join him in the mojave desert to the mount Sheik. A machine gun executive entertainer, the villain of the largest isle where the cult carries out crimes has the original white girl that serves. A false herd passes, Short Betty works in the navy with the kid when he gets his sweetheart sidekick, Walker. They’re sent to investigate a veteran who has sworn to win the plot of the other company. This is the perfect opportunity to introduce his fortune and heads for Haverlyte, the event the sioux attack.
To the grass district, Bobby leads Islandfever, a canadian hero, to the other ocean and her abandoned young Pony Express Rider reporter for the cantina. Meanwhile, the crook is caught in the south billie region when an actual hero, Diana, is summoned. Everything is complete, a british villain and a brave handsome woman take a western café cigarette, wanting a Virginia Bishop.
Projected to believe the money of the royal canadian mounted dog, the california brander is known to himself. Sleepy Little has the deed to Helen’s store, but she is protected by the bandit coast colony, and her younger brother. Based and hanged, he is murdered before he meets her brother, Tom Norris. In the hatches he has broken teams and blames fickle Ann, as she slaps the sheriff. This lets the archaeologist in Mary’s Cavalry have faith, just as a young man falsely had in a vulture to make him resent. (A hardworking mexican girl is overlooked to string up Harry.)
Ethical Considerations:
The program that we created is easily seen as two pieces that work together to create a usable program. And while the first piece, the web scraper, is easily seen as ethically acceptable; the second half, the generator, is spotted in a grey area.
The generator side of the program is made to train a relatively lightweight model and produce more content of the same likeness. In order to make sure that our program was progressively training along the right path, we periodically would check its outputs with its inputs to make sure there was clear evidence of the program training. While making synopsis for movies that do not exist doesn’t seem like an ethical concern, it’s the training program itself that attracts the most concern. Instead of using the program on impersonal data, it could be applied to someone’s online presence or even online interactions.
Our program is nowhere near the level of replicating the choices of a real person in no other facet than writing; however, if someone had a large amount of data about a single person, then they could target them in concerning ways. The ethical concerns that our program could raise are minimized by the nature of our program. The generator is a relatively simple neural network that was only able to be created due to the widely available and easy to use keras libraries and documentation. While an individual could attempt to alter this code for malicious purposes, the effort expended to do so would be better placed in creating their own network that is tailored to their own liking.
README: Computer Generated Movie Titles and Summaries
Description
This project has two parts; the first is a web scraping document that accesses IMDb plot summary pages, scrapes text, and saves that text to files. The second part is a neural network that uses the saved text files as training data. It identifies the key aspects of the training data and uses those features to generate a product that imitates the original text.
Getting Started
In order to get to usable state you must install the required libraries using the command below:
$ sudo pip install -r requirements.txt
Once complete, go to IMDb’s Interfaces page and navigate to their datasets page, and download title.basics.tsv.gz. Once downloaded, extract the file to the program’s directory, and rename the file and extension to “basemoviedata.csv”.
(Note that If this is your first time using keras you will have to set up the virtual environment as well,
The instructions are found here.)
$ python3 -m venv kerasenv
$ cd kerasvenv kerasvenv
$ source bin/activate
Go back to the main directory and run the following command:
$ pip3 install keras
Usage:
In order to run this code you will need the basics label interface from IMDb’s downloadable interface files (they are quite large and manually unusable), ~2 GB of storage, and a decently capable system to run the program on; this largely stems from the training algorithm being computationally intensive.
- Go to IMDb’s Interfaces page and navigate to their datasets page, and download title.basics.tsv.gz. Once downloaded, extract the file to the program’s directory, and rename the file and extension to “basemoviedata.csv”.
- Run webscrapper.py with the genre(s) you’d like to scrape.
- Next, run Generator.py with the genre that you’d train the model for and run.
- Once the model(s) have been created you need only run
Generator.go()
with the genre name to get a new output.
LICENSE
MIT License
Copyright (c) 2020 Alexander Butler, Declan Ketchum
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.