Investigating Dialogue in Top Ten American Films
Unique words, Dirty words, and word Rummaging
This is a draft study of word counts in American film dialogue using a dataset built from community created SRT files in the OpenSubtitles repository. A selected list of films was generated from Wikipedia "year in film" articles from 1927 to 2022 using the top ten grossing movies for each year. As well Wikipedia 'decade in film' lists from the 1970s – 2010s rounded out the list. Adjusting for a number of factors – including removal of silent films, unavailable SRT files, and duplicate detections – in total 868 films were included in the study.
There are three interactive charts created from the dataset. The first showcases films ranked based on their total unique word counts and unique word count average per hour – selectable by decade. The second studies the infamous 'seven dirty words' which were part of a 1972 George Carlin comedy routine and were to become part of a Supreme Court decision that helped define the extent to which the federal goverment could regulate speech over public airwaves. And finally there is a tool that allows you to generate charts based on your own word searches.