Data Analyst: Lu Liu
Visual / UI Designer: Lu Liu
Software Development: Lu Liu
Duration: 3 month
Naming a movie requires deliberation because the title of a film plays a significant role in attracting audiences at first glance. The title can be
a question for the film to answer: What Ever Happened to Baby Jane? (Robert Aldrich, 1962); a plot summary: Alice Doesn't Live Here Anymore(Martin Scorsese, 1974); or reflections of characters’ feelings: Shame (Steve McQueen, 2011). Every single word in a title matters, even the usage of a symbol needs careful consideration.
Puzzle of Naming is an interactive data visualization to explore the hidden mechanism in terms of naming film titles, especially in the context of release time and relevant topics. The primary analysis approach is to investigate every word in the titles via various Natural Language Processing methods, such as Latent Dirichlet Allocation, and Doc2vec. All extracted information of the titles, such as length, part of speech, and semantics, are converted into visual representations.
Data Mining and Data Analysis
I grabbed the raw data from Kaggle, an online community of data scientists and machine learners. The dataset contains over 45,000 movies featured in the Full MovieLens dataset and detailed information in which 24 fields include ‘popularity’ ‘release dates’ ‘original language’, and etc. Python is the crucial language for data ETL (extract, transform, load) in this project to get data ready for visualization. After the pre-process of the dataset, the records used for visualization development are about 32,000.
I firstly finished Tokenization, POS (part-of-speech) Tagging, and Sentiment Analysis with NLTK (Natural Language Toolkit using Python). Due to the limitations (slow; doesn't provide neural network models) of the resources provided by NLTK, such as WordNet. The outputs got from this step is not reliable
Word2vec is a well-known technology in NLP, which is used to generate representation vectors out of words in multidimensional. Doc2vec is also an NLP tool heavily based on word2vec for presenting documents as a vector while preserving the word order in a document
I trained word vectors from my title dataset (by Gensim) and then intersected with a pre-trained word and phrase vectors trained on part of Google News dataset (about 100 billion words) for more precise result considering the sample size
One example result is the film with title "Bird " and its 5 most similar titles in the database
The Clay Bird
The Pink Panther
LDA (Latent Dirichlet Allocation)
Besides "word embeddings space" constructed by the similarity between titles themselves, I also used LDA Model to create a "topic space". LDA is a statistical topic model to investigate the latent topics from a collection of text documents, which assumes that the documents is a mixture of a small number of topics and each term's presence in that given corpus is attributable to one of the topics. After training the LDA model from my dataset including titles and overviews of films, I got two outputs: 1) Latent topics with keywords for individual topics 2) Titles and their attribution to each topic.
Below is the list of 8 topics generated from my LDA models with some of their keywords. The topic names listed on the left side were named subjectively.
Topic 0: Wonder / Imagination
Topic 1: SuperHero / Campaign
Topic 2: Teenagers / Growth
Topic 3: Music / Stage
Topic 4: Comedy / Hollywood
Topic 5: Narratives / History
Topic 6: Detectives / Suspects
Topic 7: Love / Family
The primary visualization consists of two interfaces: Overall Display and Detailed Information. The core visualization of this project developed in Processing
(a JAVA-based Integrated Development Environment ). And User Interface is designed by Sketch ()
The big picture is to display all the titles as dots in one window. The connection among titles themselves and the relationship to relevant topics are presented by the positions (coordinate) in the 3d world (using t-SNE to reduce dimensionality from 8 to 3). In addition, all titles under the same topic sharing one distinct color and will be covered within a Convex Hull. Users can check the detailed information of selected title by clicking it, including the overview of the corresponding film, genres, and similar titles (similar titles are connected by colorful gradient lines). A navigation system also set up in the application.
Evaluations & Observations
In order to aptly visualize all information within one application and learn more about the general interpretation perspectives of titles from the audience,
I planned two User Study. The Pre-User Study was planned at the very beginning of this project and the Post-User Study is conducted to understand the users' pain points during the exploring. Also, since there is no provided conclusion about any rules in terms of naming films within the application, the feedback from the user studies are valuable to serve as inspirations and directions for further research and studies of naming patterns.
For example, when playing with the timeline in the application, it is not hard to observe that the dominant topic changing all the time. The figure (draw with visualization tool Charticulator) below shows how the distribution of generated underlying topics shifting over the decades. A peak of topic Campaign / Superhero appears between 1940-1980. One idea to explain this peak is to check what happened in history back then, especially war-related events. And World War II and the Cold War are two assumptions to cause this increase from my perspective.
One observation is about the most dominant genres of the film titles under specific latent topics. It can be employed as an evaluation of the LDA model used in this project from a qualitative perspective because the inferred underlying topics seem well-interpreted by the pre-assigned genres. Besides, this match between genres and latent topics can be taken into consideration as a support for the statement that the titles and overviews of films well describe the films even though they are quite abstract and short sometimes.