Comparing the First and Last Books of 20 Authors
By Julian Yanover
How do the first books of acclaimed writers, published in their youth, compare to their last works released 20, 40, or even 60 years later?
We wanted to find out if an author’s language saw changes in the lexicon they used over time. This meant we had to find a way to measure the quantity and variety of vocabulary in relation to the full content of the book.
Does vocabulary become enriched due to experience? Or does it decrease?
With these initial questions, we began our research using artificial intelligence and natural language processing software, new tools that make this previously unthinkable task possible.
This is what we found.
Contents
What We Analyzed
- 20 authors.
- 40 books.
We selected 20 authors, taking one of their earliest works and one of their last published works.
Thus, for instance, we examined Doris Lessing‘s “The Grass Is Singing” (1950) and “Ben, in the World” (2000), Ray Bradbury‘s “Fahrenheit 451” (1953) and “Farewell Summer” (2006), and Virginia Woolf‘s “The Voyage Out” (1915) and “Between the Acts” (1941).
It’s important to note that we always sought the longest time gap between books that we could find.
How We Processed It
- Database with 3,509,555 words.
- Natural language processing (NLP) software and artificial intelligence to identify and unify similar words through lemmatization.
- Our own scoring system to standardize criteria and evaluate lexical diversity.
The books were digitally stored, word by word, in a database.
We performed lemmatization of each word, which involves reducing all the variants that a word might have to obtain its base lemma. For instance, buy, buys and bought are grouped under the lemma buy. This technique is a tool of natural language processing that ensures the final result is more accurate by only counting lemmas and not all the variations of a word.
The first problem we faced was that, in longer books, there generally were more unique words, but a smaller percentage of these compared to the total. It’s natural: in a book of 100,000 words, we’ll encounter more terms, but at the same time, there’ll be more repetition than in a 5,000 words book.
We found the way to solve this in collaboration with artificial intelligence.
The AI presented us with some possible methodologies, like dividing the number of unique lemmas by the square root of the total words in the book or similarly dividing it by its logarithm. However, although the calculation was less influenced by the book’s length, it was still affected by the total number of words in each publication.
We also considered comparing only samples of the same size from each text, but this left in some cases 90% of the book out of the analysis.
Finally, we took the sliding window approach. We developed a script that analyzed each book’s language diversity in segments of 1,000 words and took the average variety of lemmas as the result score for each work. This way, the entire book was included in the analysis, and the factors that could skew the final figures were minimized.
In this manner, we obtained a lexical diversity score for each book, on a scale of 1 to 100, where a higher score indicates a broader use of language.
The Case of Agatha Christie
The case of Agatha Christie deserves special mention.
This detective novel author made headlines in 2009 when a group of researchers suggested, after analyzing several of her works and noting how her vocabulary narrowed in her books as she aged, that she must have suffered from Alzheimer’s.
This news was the inspiration for this article.
It also provided an opportunity to run the code we created on two of her books and see for ourselves whether we found any noteworthy results, and if they aligned with the findings of those researchers.
We analyzed “The Mysterious Affair at Styles“, her first Hercules Poirot novel published in 1920 when she was 30 years old, and “Elephants Can Remember” from 1972, also featuring detective Poirot, when she was 82.
While “The Mysterious Affair at Styles” achieved a lexical diversity score of 84.9, “Elephants Can Remember“, which was published 52 years later, only managed a score of 67.4, showing an extremely wide difference of 17.5 points between them.
As you will see in the conclusions below, within our study of 20 authors, the next largest difference recorded between a single writer’s works were 13 points, and it was an increase in Margaret Drabble’s language variety over the years, not a decrease like in Christie’s case.
Increased Lexical Diversity Over Time
Margaret Drabble stands out with the most significant increase in lexical diversity. “The Millstone“, published when she was just 26, had a score of 78.7. Years later, “The Sea Lady“, written at the age of 67, exhibited a substantial leap to 92.4, suggesting a significant broadening of her linguistic canvas over time.
Beryl Bainbridge also showed an upward trajectory in lexical diversity. Her early work “The Dressmaker“, published when she was 41 years old, had a lexical diversity score of 87.3. In contrast, “The Girl in the Polka Dot Dress“, published posthumously, registered a notable increase to 96.7, demonstrating a more varied use of language in her later years.
Similarly, Mary Shelley, the famed author of “Frankenstein“, showcased a remarkable growth in her lexical repertoire. Her debut novel, written at the young age of 21, had a lexical diversity score of 90.3. This score rose to an impressive 93.7 in her later novel “Falkner“, reflecting a more complex use of language as she matured.
Decreased Lexical Diversity Over Time
On the flip side, some authors saw a decline in lexical diversity as they aged.
Doris Lessing‘s “The Grass Is Singing“, written at 31, had a lexical diversity of 84.2, which fell to 73 when “Ben, in the World” was published as she was turning 81. This suggests a narrowing in the variety of her word choices in her latter years.
Kazuo Ishiguro also experienced a decline in lexical diversity over his career. His early work “A Pale View of Hills” scored 76.9 at age 28, but by “Never Let Me Go,” published when he was 51, his score had dipped to 72.9, hinting at a more restrained use of language in his more mature work.
Alice Munro, a master of the short story, exhibited a decline in lexical diversity between her early and later works. Her collection “Dance of the Happy Shades“, published when she was 37, had a lexical diversity score of 83.3. However, by the time “Dear Life” was published in 2012, when Munro was 81, the score had decreased to 79.4. While this decline is relatively modest, it could suggest a honing of her narrative voice and a possible shift towards a more concise and distilled mode of storytelling in her later years.
Up next, we display the full graph with all the authors researched.
Comparative Chart
(1890 – 1976)
(1931 – )
(1932 – 2010)
(1812 – 1870)
(1816 – 1855)
(1936 – )
(1919 – 2013)
(1862 – 1937)
(1818 – 1848)
(1904 – 1991)
(1926 – 2016)
(1819 – 1891)
(1775 – 1817)
(1954 – )
(1939 – )
(1939 – )
(1797 – 1851)
(1918 – 2006)
(1947 – )
(1933 – 2018)
(1920 – 2012)
(1947 – )
(1929 – 2018)
(1932 – 2018)
(1882 – 1941)
(1564 – 1616)
Full table
Author ▲ | Book | Published | Age | Total words | Unique words | Unique lemmas | Lexical Diversity |
---|---|---|---|---|---|---|---|
Agatha Christie | The Mysterious Affair at Styles | 1920 | 30 | 56415 | 5855 | 4605 | 84.9 |
Agatha Christie | Elephants Can Remember | 1972 | 82 | 23439 | 2669 | 2163 | 67.4 |
Alice Munro | Dance of the Happy Shades | 1968 | 37 | 77116 | 7772 | 6102 | 83.3 |
Alice Munro | Dear Life | 2012 | 81 | 90058 | 7677 | 5903 | 79.4 |
Beryl Bainbridge | The Dressmaker | 1973 | 41 | 47042 | 6175 | 4236 | 87.3 |
Beryl Bainbridge | The Girl in the Polka Dot Dress | 2011 | Posthumous | 47052 | 7162 | 4962 | 96.7 |
Charles Dickens | The Pickwick Papers | 1836 | 24 | 297901 | 19893 | 13998 | 90.4 |
Charles Dickens | Our Mutual Friend | 1865 | 53 | 324706 | 17474 | 12734 | 81.5 |
Charlotte Brontë | Jane Eyre | 1847 | 31 | 186223 | 13337 | 9593 | 91.1 |
Don DeLillo | Americana | 1971 | 35 | 125134 | 13289 | 10072 | 92.4 |
Don DeLillo | Falling Man | 2007 | 71 | 65940 | 7388 | 5629 | 87.9 |
Doris Lessing | The Grass Is Singing | 1950 | 31 | 29974 | 4411 | 3493 | 84.2 |
Doris Lessing | Ben, in the World | 2000 | 81 | 56700 | 4778 | 3497 | 73 |
Edith Wharton | The Valley of Decision | 1902 | 40 | 153357 | 14219 | 10431 | 96.6 |
Edith Wharton | The Gods Arrive | 1932 | 70 | 126124 | 11542 | 8841 | 88.4 |
Emily Brontë | Wuthering Heights | 1847 | 29 | 116505 | 9623 | 6832 | 96.1 |
Graham Greene | Stamboul Train | 1932 | 28 | 73180 | 7079 | 5300 | 86.4 |
Graham Greene | Monsignor Quixote | 1982 | 78 | 58270 | 5931 | 4604 | 80.3 |
Harper Lee | To kill a mockingbird | 1960 | 34 | 99261 | 9065 | 6972 | 85.5 |
Herman Melville | Moby Dick | 1851 | 32 | 208414 | 19698 | 14746 | 95.5 |
Jane Austen | Pride and Prejudice | 1813 | 38 | 122325 | 6750 | 4912 | 83 |
Kazuo Ishiguro | A Pale View of Hills | 1982 | 28 | 52289 | 4512 | 3225 | 76.9 |
Kazuo Ishiguro | Never Let Me Go | 2005 | 51 | 96375 | 5983 | 4397 | 72.9 |
Margaret Atwood | The Edible Woman | 1969 | 30 | 102417 | 9866 | 7386 | 90.9 |
Margaret Atwood | The Testaments | 2019 | 80 | 106975 | 10141 | 7696 | 86.5 |
Margaret Drabble | The Millstone | 1965 | 26 | 66526 | 6303 | 4889 | 78.7 |
Margaret Drabble | The Sea Lady | 2006 | 67 | 108003 | 13196 | 9622 | 92.4 |
Mary Shelley | Frankenstein | 1818 | 21 | 74936 | 7182 | 5271 | 90.3 |
Mary Shelley | Falkner | 1837 | 40 | 150084 | 12347 | 9456 | 93.7 |
Muriel Spark | The Comforters | 1957 | 39 | 60545 | 7580 | 4983 | 83.4 |
Muriel Spark | The Finishing School | 2004 | 86 | 30247 | 4478 | 3606 | 89.6 |
Paul Auster | City of glass | 1985 | 38 | 45768 | 5334 | 4200 | 81.6 |
Paul Auster | Sunset Park | 2010 | 63 | 81034 | 8989 | 6932 | 89.9 |
Philip Roth | Goodbye, Columbus | 1959 | 26 | 78964 | 7964 | 2696 | 82.1 |
Philip Roth | Nemesis | 2010 | 77 | 57591 | 7221 | 5634 | 87.2 |
Ray Bradbury | Fahrenheit 451 | 1953 | 33 | 45760 | 5225 | 3924 | 82.9 |
Ray Bradbury | Farewell Summer | 2006 | 86 | 26730 | 3897 | 2964 | 83.9 |
Stephen King | Carrie | 1974 | 27 | 60459 | 7814 | 6022 | 90 |
Stephen King | The Institute | 2019 | 72 | 175852 | 12758 | 9771 | 88 |
Ursula K. Le Guin | A Wizard of Earthsea | 1968 | 39 | 61409 | 5429 | 4007 | 80.3 |
Ursula K. Le Guin | The Other Wind | 2001 | 72 | 69283 | 6086 | 4618 | 79.8 |
V. S. Naipaul | The Mystic Masseur | 1957 | 25 | 61522 | 5348 | 4124 | 76 |
V. S. Naipaul | Magic Seeds | 2004 | 72 | 90197 | 7547 | 5728 | 79.7 |
Virginia Woolf | The Voyage Out | 1915 | 33 | 135994 | 10491 | 7760 | 87.8 |
Virginia Woolf | Between the Acts | 1941 | 59 | 44552 | 7212 | 5597 | 94.1 |
William Shakespeare | Macbeth | 1623 | Posthumous | 18120 | 3366 | 2790 | 91 |
Final thoughts
This analysis is not to suggest that a decrease in lexical diversity diminishes the quality of work—on the contrary, it may indicate a more focused and refined literary voice. Similarly, an increase does not necessarily equate to improved writing, but rather a diversification in language use. The shifts in lexical diversity are but one aspect of an author's evolving craft, reflecting changes in creative priorities, experiences, and possibly the influence of the times in which they write.
About
Julian Yanover has been a web developer for over 20 years and is the director of MyPoeticSide.com
MyPoeticSide.com covers all topics related to poetry and literature and also hosts a community of poets who share their work on the platform every day.
With more than 15 years of work, MyPoeticSide.com continues to expand its content to provide its users with objective and quality information. Contact us.