Investigating Dialogue in Top Ten American Films

Unique words, Dirty words, and word Rummaging

This is a draft study of word counts in American film dialogue using a dataset built from community created SRT files in the OpenSubtitles repository. A selected list of films was generated from Wikipedia "year in film" articles from 1927 to 2022 using the top ten grossing movies for each year. As well Wikipedia 'decade in film' lists from the 1970s – 2010s rounded out the list. Adjusting for a number of factors – including removal of silent films, unavailable SRT files, and duplicate detections – in total 868 films were included in the study.

There are three interactive charts created from the dataset. The first showcases films ranked based on their total unique word counts and unique word count average per hour – selectable by decade. The second studies the infamous 'seven dirty words' which were part of a 1972 George Carlin comedy routine and were to become part of a Supreme Court decision that helped define the extent to which the federal goverment could regulate speech over public airwaves. And finally there is a tool that allows you to generate charts based on your own word searches.

The winner for top average unique word count of the 800+ films in this study is the 1934 film Cleopatra with 650 words per hour!

Written by Waldemar Young, Vincent Lawrence, and Bartlett Cormack, the Cecil B. DeMille historical epic probably benefits from a dozen or so Roman names including my favorite, Mark Antony's best friend and first general, Enobarbus. Say that three times in a sentence.

Also, bringing in the rear at an average of a mear 55 words per hour is the fantasy epic The Lord of the Rings: The Return of the King. I suppose this is unsurprising when you consider the length of that particular film, clocking in at 201 minutes, and the nature of fantasy dialogue being a bit less conversational. Most of your typical character get-togethers involve such high stakes that exchanges tend to be brief. And with this word study using NLTK's stopwords list to remove most common English language terms like "I," "the," "and," etc... You have famous turns of phrase being reduced to a single unique word.

"I can't carry it for you, but I can carry you."

Given how many genres are attached to a single film gathered from The Movie Database, it's somewhat impractical to color code by genre. But it's still useful to actually uncheck all genres except a particular one. So for example you can look at just Science Fiction, and you will find two comedies take the top spots for most unique words per hour with Son of Flubber, (1963) – 511 words and The Nutty Professor, (1996) – 469 words. And in the 'completely makes sense' place for fewest words in Sci-Fi was A Quiet Place Part II, (2021) at 82 words per hour.

This particular visualization is the original inspiration for this study given a question I was hoping to answer. The film industry began moving their blockbuster business away from rated R properties in the late 1990s and early 2000s. It was what I called 'The Little Mermaid Effect', when in 1989 a kid's movie could be the number six grossing film and studios realized how many potentially unsold tickets exist for an R rated summer blockbuster given kids are unable to attend.

I figured that at the same time there was a 'Die Hard' peak in 1988 which had dialogue with a prolific and enduring usage of profanity. But the data doesn't show this clearly with the top ten films since the 1990s. If you exclude the Bad Boys II, (2003) as an outlier, the numbers do definitely trend downward to the present. But those Bad Boys For Life, (2020) still keep messing with my theory.

To answer this question I need a lot more films from each year to get a better sense of trends in profanity usage. The chart intentionally cuts out all films prior to 1968 with zero profanity in top ten films prior to that year. And in an earlier study I had the Wolf of WallStreet, (2013) included, and it holds the F-bomb record at an astounding 500+ utterances.

I hope you find words that you find interesting to see their frequencies, and I'm sure this will improve if there are many more films included in the study. It's probably in this space where I see the most opportunity for advancement in the project. It would be compelling to build a community generated database of curated SRT files of as many films as possible and as many translations as possible to create a querable API of unique word counts and maybe phrase counts as well.

Such a project may have some copyright issues to contend with, but hopefully with the organization of the data around word search and not script / subtitle file hosting it will be considered transformative enough to allow fair use.

References

Data

Wikipedia film lists by year, 1984 in Film as an example
Wikipedia film lists by decade, 1980s in Film as an example
Python OpenSubtitles A python wrapper for the opensubtitles API
SRT parser modified from English Words from SRT files
– for film poster images and runtimes

Investigating Dialogue in Top Ten American Films

Unique words, Dirty words, and word Rummaging

English Word Diversity in Top American Films

Toggle Genres:

The Seven Dirty Words

The Seven:

Find Word Frequencies in Top Ten Films

The Words:

References

Data

Visualization