We compared 25 works on World Book Day and English Language Day

By Julian Yanover

On April 23, commemorating the death of William Shakespeare, both World Book Day and English Language Day are celebrated. We combined these two celebrations in this analysis, which reviews the vocabulary of 25 books to identify the most complex and accessible texts for readers.

With this goal in mind, we used artificial intelligence and natural language processing software to dissect each work and obtain answers.

This is what we found.

AI robot

What We Analyzed

We’ve curated a list of 25 of the most acclaimed English-language books, striving for a mix that showcases diversity in terms of time periods and stylistic approaches.

How We Processed It

  • Database with 4,493,150 words.
  • Natural language processing (NLP) software and artificial intelligence to identify and unify similar words through lemmatization.
  • Our own scoring system to standardize criteria and evaluate lexical diversity.

The books were digitally stored, word by word, in a database.

We performed lemmatization of each word, which involves reducing all the variants that a word might have to obtain its base lemma. For instance, buy, buys and bought are grouped under the lemma buy. This technique is a tool of natural language processing that ensures the final result is more accurate by only counting lemmas and not all the variations of a word.

Database with full books

A glimpse at the database with over 4 million stored and processed words

The first problem we faced was that, in longer books, there generally were more unique words, but a smaller percentage of these compared to the total. It’s natural: in a book of 100,000 words, we’ll encounter more terms, but at the same time, there’ll be more repetition than in a 5,000 words book.

We found the way to solve this in collaboration with artificial intelligence.

The AI presented us with some possible methodologies, like dividing the number of unique lemmas by the square root of the total words in the book or similarly dividing it by its logarithm. However, although the calculation was less influenced by the book’s length, it was still affected by the total number of words in each publication.

We also considered comparing only samples of the same size from each text, but this left in some cases 90% of the book out of the analysis.

Finally, we took the sliding window approach. We developed a script that analyzed each book’s language diversity in segments of 1,000 words and took the average variety of lemmas as the result score for each work. This way, the entire book was included in the analysis, and the factors that could skew the final figures were minimized.

In this manner, we obtained a lexical diversity score for each book, on a scale of 1 to 100, where a higher score indicates a broader use of language.

Up next, we display the full results.

Ranking

The Girl in the Polka Dot Dress -
Beryl Bainbridge Score 87
47052 words
7162 unique words
4962 unique lemmas
The Valley of Decision -
Edith Wharton Score 87
153357 words
14219 unique words
10431 unique lemmas
Wuthering Heights -
Emily Brontë Score 87
116505 words
9623 unique words
6832 unique lemmas
Moby Dick -
Herman Melville Score 86
208414 words
19698 unique words
14746 unique lemmas
Americana -
Don DeLillo Score 84
125134 words
13289 unique words
10072 unique lemmas
Jane Eyre -
Charlotte Brontë Score 82
186223 words
13337 unique words
9593 unique lemmas
Macbeth -
William Shakespeare Score 82
18120 words
3366 unique words
2790 unique lemmas
The Edible Woman -
Margaret Atwood Score 82
102417 words
9866 unique words
7386 unique lemmas
The Pickwick Papers -
Charles Dickens Score 82
297901 words
19893 unique words
13998 unique lemmas
Frankenstein -
Mary Shelley Score 82
74936 words
7182 unique words
5271 unique lemmas
Carrie -
Stephen King Score 81
60459 words
7814 unique words
6022 unique lemmas
The Voyage Out -
Virginia Woolf Score 79
135994 words
10491 unique words
7760 unique lemmas
Stamboul Train -
Graham Greene Score 78
73180 words
7079 unique words
5300 unique lemmas
To kill a mockingbird -
Harper Lee Score 77
99261 words
9065 unique words
6972 unique lemmas
The Mysterious Affair at Styles -
Agatha Christie Score 77
56415 words
5855 unique words
4605 unique lemmas
The Grass Is Singing -
Doris Lessing Score 76
29974 words
4411 unique words
3493 unique lemmas
The Comforters -
Muriel Spark Score 75
60545 words
7580 unique words
4983 unique lemmas
Dance of the Happy Shades -
Alice Munro Score 75
77116 words
7772 unique words
6102 unique lemmas
Pride and Prejudice -
Jane Austen Score 75
122325 words
6750 unique words
4912 unique lemmas
Fahrenheit 451 -
Ray Bradbury Score 75
45760 words
5225 unique words
3924 unique lemmas
Goodbye, Columbus -
Philip Roth Score 74
78964 words
7964 unique words
2696 unique lemmas
City of glass -
Paul Auster Score 74
45768 words
5334 unique words
4200 unique lemmas
A Wizard of Earthsea -
Ursula K. Le Guin Score 73
61409 words
5429 unique words
4007 unique lemmas
The Millstone -
Margaret Drabble Score 71
66526 words
6303 unique words
4889 unique lemmas
A Pale View of Hills -
Kazuo Ishiguro Score 69
52289 words
4512 unique words
3225 unique lemmas

Key takeaways

Top 3 books

The top three scorers in lexical diversity are all works by female authors: Beryl Bainbridge, Edith Wharton, and Emily Brontë. This might suggest a particularly rich use of language among female authors in different periods, highlighting their contribution to the literary landscape with a dense and varied vocabulary.

Also, despite the varying lengths of their narratives, from Bainbridge's relatively short "The Girl in the Polka Dot Dress" to Wharton's lengthier "The Valley of Decision," these authors manage to maintain a high level of lexical diversity. This showcases their skill in crafting detailed and complex narratives without diluting the richness of their language.

Moreover, the diversity in publication dates among the top-scoring novels suggests that lexical richness is not confined to a specific era. For instance, "Wuthering Heights" was published in the 19th century, while "The Girl in the Polka Dot Dress" is a more contemporary work. This cross-era representation underscores the timeless value of a rich vocabulary in storytelling.

For a beginner reader, it might be a good option to start with the books at the bottom of the ranking table and gradually move up, slowly making the reading more complex.

 

Interactive table

Click on any column header to arrange the data in the order you choose.

Book Author Published Age Total words Unique words Unique lemmas Vocabulary Score
A Pale View of Hills Kazuo Ishiguro 1982 28 52289 4512 3225 69
A Wizard of Earthsea Ursula K. Le Guin 1968 39 61409 5429 4007 73
Americana Don DeLillo 1971 35 125134 13289 10072 84
Carrie Stephen King 1974 27 60459 7814 6022 81
City of glass Paul Auster 1985 38 45768 5334 4200 74
Dance of the Happy Shades Alice Munro 1968 37 77116 7772 6102 75
Fahrenheit 451 Ray Bradbury 1953 33 45760 5225 3924 75
Frankenstein Mary Shelley 1818 21 74936 7182 5271 82
Goodbye, Columbus Philip Roth 1959 26 78964 7964 2696 74
Jane Eyre Charlotte Brontë 1847 31 186223 13337 9593 82
Macbeth William Shakespeare 1623 Posthumous 18120 3366 2790 82
Moby Dick Herman Melville 1851 32 208414 19698 14746 86
Pride and Prejudice Jane Austen 1813 38 122325 6750 4912 75
Stamboul Train Graham Greene 1932 28 73180 7079 5300 78
The Comforters Muriel Spark 1957 39 60545 7580 4983 75
The Edible Woman Margaret Atwood 1969 30 102417 9866 7386 82
The Girl in the Polka Dot Dress Beryl Bainbridge 2011 Posthumous 47052 7162 4962 87
The Grass Is Singing Doris Lessing 1950 31 29974 4411 3493 76
The Millstone Margaret Drabble 1965 26 66526 6303 4889 71
The Mysterious Affair at Styles Agatha Christie 1920 30 56415 5855 4605 77
The Pickwick Papers Charles Dickens 1836 24 297901 19893 13998 82
The Valley of Decision Edith Wharton 1902 40 153357 14219 10431 87
The Voyage Out Virginia Woolf 1915 33 135994 10491 7760 79
To kill a mockingbird Harper Lee 1960 34 99261 9065 6972 77
Wuthering Heights Emily Brontë 1847 29 116505 9623 6832 87

About

Julian Yanover has been a web developer for over 20 years and is the CEO of MyPoeticSide.com

MyPoeticSide.com covers all topics related to poetry and literature and also hosts a community of poets who share their work on the platform every day.

With more than 15 years of work, MyPoeticSide.com continues to expand its content to provide its users with objective and quality information. Contact us.