Comparing the First and Last Books of 20 Authors

By Julian Yanover

How do the first books of acclaimed writers, published in their youth, compare to their last works released 20, 40, or even 60 years later?

We wanted to find out if an author’s language saw changes in the lexicon they used over time. This meant we had to find a way to measure the quantity and variety of vocabulary in relation to the full content of the book.

Does vocabulary become enriched due to experience? Or does it decrease?

With these initial questions, we began our research using artificial intelligence and natural language processing software, new tools that make this previously unthinkable task possible.

This is what we found.

AI robot

What We Analyzed

  • 20 authors.
  • 40 books.

We selected 20 authors, taking one of their earliest works and one of their last published works.

Thus, for instance, we examined Doris Lessing‘s “The Grass Is Singing” (1950) and “Ben, in the World” (2000), Ray Bradbury‘s “Fahrenheit 451” (1953) and “Farewell Summer” (2006), and Virginia Woolf‘s “The Voyage Out” (1915) and “Between the Acts” (1941).

It’s important to note that we always sought the longest time gap between books that we could find.

How We Processed It

  • Database with 3,509,555 words.
  • Natural language processing (NLP) software and artificial intelligence to identify and unify similar words through lemmatization.
  • Our own scoring system to standardize criteria and evaluate lexical diversity.

The books were digitally stored, word by word, in a database.

We performed lemmatization of each word, which involves reducing all the variants that a word might have to obtain its base lemma. For instance, buy, buys and bought are grouped under the lemma buy. This technique is a tool of natural language processing that ensures the final result is more accurate by only counting lemmas and not all the variations of a word.

Database with full books

A glimpse at the database with over 3 million stored and processed words

The first problem we faced was that, in longer books, there generally were more unique words, but a smaller percentage of these compared to the total. It’s natural: in a book of 100,000 words, we’ll encounter more terms, but at the same time, there’ll be more repetition than in a 5,000 words book.

We found the way to solve this in collaboration with artificial intelligence.

The AI presented us with some possible methodologies, like dividing the number of unique lemmas by the square root of the total words in the book or similarly dividing it by its logarithm. However, although the calculation was less influenced by the book’s length, it was still affected by the total number of words in each publication.

We also considered comparing only samples of the same size from each text, but this left in some cases 90% of the book out of the analysis.

Finally, we took the sliding window approach. We developed a script that analyzed each book’s language diversity in segments of 1,000 words and took the average variety of lemmas as the result score for each work. This way, the entire book was included in the analysis, and the factors that could skew the final figures were minimized.

In this manner, we obtained a lexical diversity score for each book, on a scale of 1 to 100, where a higher score indicates a broader use of language.

The Case of Agatha Christie

The case of Agatha Christie deserves special mention.

This detective novel author made headlines in 2009 when a group of researchers suggested, after analyzing several of her works and noting how her vocabulary narrowed in her books as she aged, that she must have suffered from Alzheimer’s.

This news was the inspiration for this article.

It also provided an opportunity to run the code we created on two of her books and see for ourselves whether we found any noteworthy results, and if they aligned with the findings of those researchers.

We analyzed “The Mysterious Affair at Styles“, her first Hercules Poirot novel published in 1920 when she was 30 years old, and “Elephants Can Remember” from 1972, also featuring detective Poirot, when she was 82.

While “The Mysterious Affair at Styles” achieved a lexical diversity score of 84.9, “Elephants Can Remember“, which was published 52 years later, only managed a score of 67.4, showing an extremely wide difference of 17.5 points between them.

As you will see in the conclusions below, within our study of 20 authors, the next largest difference recorded between a single writer’s works were 13 points, and it was an increase in Margaret Drabble’s language variety over the years, not a decrease like in Christie’s case.

Increased Lexical Diversity Over Time

Margaret Drabble stands out with the most significant increase in lexical diversity. “The Millstone“, published when she was just 26, had a score of 78.7. Years later, “The Sea Lady“, written at the age of 67, exhibited a substantial leap to 92.4, suggesting a significant broadening of her linguistic canvas over time.

Beryl Bainbridge also showed an upward trajectory in lexical diversity. Her early work “The Dressmaker“, published when she was 41 years old, had a lexical diversity score of 87.3. In contrast, “The Girl in the Polka Dot Dress“, published posthumously, registered a notable increase to 96.7, demonstrating a more varied use of language in her later years.

Similarly, Mary Shelley, the famed author of “Frankenstein“, showcased a remarkable growth in her lexical repertoire. Her debut novel, written at the young age of 21, had a lexical diversity score of 90.3. This score rose to an impressive 93.7 in her later novel “Falkner“, reflecting a more complex use of language as she matured.

Decreased Lexical Diversity Over Time

On the flip side, some authors saw a decline in lexical diversity as they aged.

Doris Lessing‘s “The Grass Is Singing“, written at 31, had a lexical diversity of 84.2, which fell to 73 when “Ben, in the World” was published as she was turning 81. This suggests a narrowing in the variety of her word choices in her latter years.

Kazuo Ishiguro also experienced a decline in lexical diversity over his career. His early work “A Pale View of Hills” scored 76.9 at age 28, but by “Never Let Me Go,” published when he was 51, his score had dipped to 72.9, hinting at a more restrained use of language in his more mature work.

Alice Munro, a master of the short story, exhibited a decline in lexical diversity between her early and later works. Her collection “Dance of the Happy Shades“, published when she was 37, had a lexical diversity score of 83.3. However, by the time “Dear Life” was published in 2012, when Munro was 81, the score had decreased to 79.4. While this decline is relatively modest, it could suggest a honing of her narrative voice and a possible shift towards a more concise and distilled mode of storytelling in her later years.

Up next, we display the full graph with all the authors researched.

Comparative Chart

Lexical Diversity Score
50
60
70
80
90
100
Agatha Christie
(1890 – 1976)
The Mysterious Affair at Styles (1920)84.9
Elephants Can Remember (1972)67.4
Alice Munro
(1931 – )
Dance of the Happy Shades (1968)83.3
Dear Life (2012)79.4
Beryl Bainbridge
(1932 – 2010)
The Dressmaker (1973)87.3
The Girl in the Polka Dot Dress (2011)96.7
Charles Dickens
(1812 – 1870)
The Pickwick Papers (1836)90.4
Our Mutual Friend (1865)81.5
Charlotte Brontë
(1816 – 1855)
Jane Eyre (1847)91.1
Don DeLillo
(1936 – )
Americana (1971)92.4
Falling Man (2007)87.9
Doris Lessing
(1919 – 2013)
The Grass Is Singing (1950)84.2
Ben, in the World (2000)73
Edith Wharton
(1862 – 1937)
The Valley of Decision (1902)96.6
The Gods Arrive (1932)88.4
Emily Brontë
(1818 – 1848)
Wuthering Heights (1847)96.1
Graham Greene
(1904 – 1991)
Stamboul Train (1932)86.4
Monsignor Quixote (1982)80.3
Harper Lee
(1926 – 2016)
To kill a mockingbird (1960)85.5
Herman Melville
(1819 – 1891)
Moby Dick (1851)95.5
Jane Austen
(1775 – 1817)
Pride and Prejudice (1813)83
Kazuo Ishiguro
(1954 – )
A Pale View of Hills (1982)76.9
Never Let Me Go (2005)72.9
Margaret Atwood
(1939 – )
The Edible Woman (1969)90.9
The Testaments (2019)86.5
Margaret Drabble
(1939 – )
The Millstone (1965)78.7
The Sea Lady (2006)92.4
Mary Shelley
(1797 – 1851)
Frankenstein (1818)90.3
Falkner (1837)93.7
Muriel Spark
(1918 – 2006)
The Comforters (1957)83.4
The Finishing School (2004)89.6
Paul Auster
(1947 – )
City of glass (1985)81.6
Sunset Park (2010)89.9
Philip Roth
(1933 – 2018)
Goodbye, Columbus (1959)82.1
Nemesis (2010)87.2
Ray Bradbury
(1920 – 2012)
Fahrenheit 451 (1953)82.9
Farewell Summer (2006)83.9
Stephen King
(1947 – )
Carrie (1974)90
The Institute (2019)88
Ursula K. Le Guin
(1929 – 2018)
A Wizard of Earthsea (1968)80.3
The Other Wind (2001)79.8
V. S. Naipaul
(1932 – 2018)
The Mystic Masseur (1957)76
Magic Seeds (2004)79.7
Virginia Woolf
(1882 – 1941)
The Voyage Out (1915)87.8
Between the Acts (1941)94.1
William Shakespeare
(1564 – 1616)
Macbeth (1623)91
50
60
70
80
90
100

Full table

Author Book Published Age Total words Unique words Unique lemmas Lexical Diversity
Agatha Christie The Mysterious Affair at Styles 1920 30 56415 5855 4605 84.9
Agatha Christie Elephants Can Remember 1972 82 23439 2669 2163 67.4
Alice Munro Dance of the Happy Shades 1968 37 77116 7772 6102 83.3
Alice Munro Dear Life 2012 81 90058 7677 5903 79.4
Beryl Bainbridge The Dressmaker 1973 41 47042 6175 4236 87.3
Beryl Bainbridge The Girl in the Polka Dot Dress 2011 Posthumous 47052 7162 4962 96.7
Charles Dickens The Pickwick Papers 1836 24 297901 19893 13998 90.4
Charles Dickens Our Mutual Friend 1865 53 324706 17474 12734 81.5
Charlotte Brontë Jane Eyre 1847 31 186223 13337 9593 91.1
Don DeLillo Americana 1971 35 125134 13289 10072 92.4
Don DeLillo Falling Man 2007 71 65940 7388 5629 87.9
Doris Lessing The Grass Is Singing 1950 31 29974 4411 3493 84.2
Doris Lessing Ben, in the World 2000 81 56700 4778 3497 73
Edith Wharton The Valley of Decision 1902 40 153357 14219 10431 96.6
Edith Wharton The Gods Arrive 1932 70 126124 11542 8841 88.4
Emily Brontë Wuthering Heights 1847 29 116505 9623 6832 96.1
Graham Greene Stamboul Train 1932 28 73180 7079 5300 86.4
Graham Greene Monsignor Quixote 1982 78 58270 5931 4604 80.3
Harper Lee To kill a mockingbird 1960 34 99261 9065 6972 85.5
Herman Melville Moby Dick 1851 32 208414 19698 14746 95.5
Jane Austen Pride and Prejudice 1813 38 122325 6750 4912 83
Kazuo Ishiguro A Pale View of Hills 1982 28 52289 4512 3225 76.9
Kazuo Ishiguro Never Let Me Go 2005 51 96375 5983 4397 72.9
Margaret Atwood The Edible Woman 1969 30 102417 9866 7386 90.9
Margaret Atwood The Testaments 2019 80 106975 10141 7696 86.5
Margaret Drabble The Millstone 1965 26 66526 6303 4889 78.7
Margaret Drabble The Sea Lady 2006 67 108003 13196 9622 92.4
Mary Shelley Frankenstein 1818 21 74936 7182 5271 90.3
Mary Shelley Falkner 1837 40 150084 12347 9456 93.7
Muriel Spark The Comforters 1957 39 60545 7580 4983 83.4
Muriel Spark The Finishing School 2004 86 30247 4478 3606 89.6
Paul Auster City of glass 1985 38 45768 5334 4200 81.6
Paul Auster Sunset Park 2010 63 81034 8989 6932 89.9
Philip Roth Goodbye, Columbus 1959 26 78964 7964 2696 82.1
Philip Roth Nemesis 2010 77 57591 7221 5634 87.2
Ray Bradbury Fahrenheit 451 1953 33 45760 5225 3924 82.9
Ray Bradbury Farewell Summer 2006 86 26730 3897 2964 83.9
Stephen King Carrie 1974 27 60459 7814 6022 90
Stephen King The Institute 2019 72 175852 12758 9771 88
Ursula K. Le Guin A Wizard of Earthsea 1968 39 61409 5429 4007 80.3
Ursula K. Le Guin The Other Wind 2001 72 69283 6086 4618 79.8
V. S. Naipaul The Mystic Masseur 1957 25 61522 5348 4124 76
V. S. Naipaul Magic Seeds 2004 72 90197 7547 5728 79.7
Virginia Woolf The Voyage Out 1915 33 135994 10491 7760 87.8
Virginia Woolf Between the Acts 1941 59 44552 7212 5597 94.1
William Shakespeare Macbeth 1623 Posthumous 18120 3366 2790 91

Final thoughts

This analysis is not to suggest that a decrease in lexical diversity diminishes the quality of work—on the contrary, it may indicate a more focused and refined literary voice. Similarly, an increase does not necessarily equate to improved writing, but rather a diversification in language use. The shifts in lexical diversity are but one aspect of an author's evolving craft, reflecting changes in creative priorities, experiences, and possibly the influence of the times in which they write.

About

Julian Yanover has been a web developer for over 20 years and is the director of MyPoeticSide.com

MyPoeticSide.com covers all topics related to poetry and literature and also hosts a community of poets who share their work on the platform every day.

With more than 15 years of work, MyPoeticSide.com continues to expand its content to provide its users with objective and quality information. Contact us.