We compared 25 works on World Book Day and English Language Day
By Julian Yanover
On April 23, commemorating the death of William Shakespeare, both World Book Day and English Language Day are celebrated. We combined these two celebrations in this analysis, which reviews the vocabulary of 25 books to identify the most complex and accessible texts for readers.
With this goal in mind, we used artificial intelligence and natural language processing software to dissect each work and obtain answers.
This is what we found.
What We Analyzed
We’ve curated a list of 25 of the most acclaimed English-language books, striving for a mix that showcases diversity in terms of time periods and stylistic approaches.
How We Processed It
- Database with 4,493,150 words.
- Natural language processing (NLP) software and artificial intelligence to identify and unify similar words through lemmatization.
- Our own scoring system to standardize criteria and evaluate lexical diversity.
The books were digitally stored, word by word, in a database.
We performed lemmatization of each word, which involves reducing all the variants that a word might have to obtain its base lemma. For instance, buy, buys and bought are grouped under the lemma buy. This technique is a tool of natural language processing that ensures the final result is more accurate by only counting lemmas and not all the variations of a word.
The first problem we faced was that, in longer books, there generally were more unique words, but a smaller percentage of these compared to the total. It’s natural: in a book of 100,000 words, we’ll encounter more terms, but at the same time, there’ll be more repetition than in a 5,000 words book.
We found the way to solve this in collaboration with artificial intelligence.
The AI presented us with some possible methodologies, like dividing the number of unique lemmas by the square root of the total words in the book or similarly dividing it by its logarithm. However, although the calculation was less influenced by the book’s length, it was still affected by the total number of words in each publication.
We also considered comparing only samples of the same size from each text, but this left in some cases 90% of the book out of the analysis.
Finally, we took the sliding window approach. We developed a script that analyzed each book’s language diversity in segments of 1,000 words and took the average variety of lemmas as the result score for each work. This way, the entire book was included in the analysis, and the factors that could skew the final figures were minimized.
In this manner, we obtained a lexical diversity score for each book, on a scale of 1 to 100, where a higher score indicates a broader use of language.
Up next, we display the full results.
Ranking
Key takeaways
The top three scorers in lexical diversity are all works by female authors: Beryl Bainbridge, Edith Wharton, and Emily Brontë. This might suggest a particularly rich use of language among female authors in different periods, highlighting their contribution to the literary landscape with a dense and varied vocabulary.
Also, despite the varying lengths of their narratives, from Bainbridge's relatively short "The Girl in the Polka Dot Dress" to Wharton's lengthier "The Valley of Decision," these authors manage to maintain a high level of lexical diversity. This showcases their skill in crafting detailed and complex narratives without diluting the richness of their language.
Moreover, the diversity in publication dates among the top-scoring novels suggests that lexical richness is not confined to a specific era. For instance, "Wuthering Heights" was published in the 19th century, while "The Girl in the Polka Dot Dress" is a more contemporary work. This cross-era representation underscores the timeless value of a rich vocabulary in storytelling.
For a beginner reader, it might be a good option to start with the books at the bottom of the ranking table and gradually move up, slowly making the reading more complex.
Interactive table
Click on any column header to arrange the data in the order you choose.
Book ▲ | Author | Published | Age | Total words | Unique words | Unique lemmas | Vocabulary Score |
---|---|---|---|---|---|---|---|
A Pale View of Hills | Kazuo Ishiguro | 1982 | 28 | 52289 | 4512 | 3225 | 69 |
A Wizard of Earthsea | Ursula K. Le Guin | 1968 | 39 | 61409 | 5429 | 4007 | 73 |
Americana | Don DeLillo | 1971 | 35 | 125134 | 13289 | 10072 | 84 |
Carrie | Stephen King | 1974 | 27 | 60459 | 7814 | 6022 | 81 |
City of glass | Paul Auster | 1985 | 38 | 45768 | 5334 | 4200 | 74 |
Dance of the Happy Shades | Alice Munro | 1968 | 37 | 77116 | 7772 | 6102 | 75 |
Fahrenheit 451 | Ray Bradbury | 1953 | 33 | 45760 | 5225 | 3924 | 75 |
Frankenstein | Mary Shelley | 1818 | 21 | 74936 | 7182 | 5271 | 82 |
Goodbye, Columbus | Philip Roth | 1959 | 26 | 78964 | 7964 | 2696 | 74 |
Jane Eyre | Charlotte Brontë | 1847 | 31 | 186223 | 13337 | 9593 | 82 |
Macbeth | William Shakespeare | 1623 | Posthumous | 18120 | 3366 | 2790 | 82 |
Moby Dick | Herman Melville | 1851 | 32 | 208414 | 19698 | 14746 | 86 |
Pride and Prejudice | Jane Austen | 1813 | 38 | 122325 | 6750 | 4912 | 75 |
Stamboul Train | Graham Greene | 1932 | 28 | 73180 | 7079 | 5300 | 78 |
The Comforters | Muriel Spark | 1957 | 39 | 60545 | 7580 | 4983 | 75 |
The Edible Woman | Margaret Atwood | 1969 | 30 | 102417 | 9866 | 7386 | 82 |
The Girl in the Polka Dot Dress | Beryl Bainbridge | 2011 | Posthumous | 47052 | 7162 | 4962 | 87 |
The Grass Is Singing | Doris Lessing | 1950 | 31 | 29974 | 4411 | 3493 | 76 |
The Millstone | Margaret Drabble | 1965 | 26 | 66526 | 6303 | 4889 | 71 |
The Mysterious Affair at Styles | Agatha Christie | 1920 | 30 | 56415 | 5855 | 4605 | 77 |
The Pickwick Papers | Charles Dickens | 1836 | 24 | 297901 | 19893 | 13998 | 82 |
The Valley of Decision | Edith Wharton | 1902 | 40 | 153357 | 14219 | 10431 | 87 |
The Voyage Out | Virginia Woolf | 1915 | 33 | 135994 | 10491 | 7760 | 79 |
To kill a mockingbird | Harper Lee | 1960 | 34 | 99261 | 9065 | 6972 | 77 |
Wuthering Heights | Emily Brontë | 1847 | 29 | 116505 | 9623 | 6832 | 87 |
About
Julian Yanover has been a web developer for over 20 years and is the CEO of MyPoeticSide.com
MyPoeticSide.com covers all topics related to poetry and literature and also hosts a community of poets who share their work on the platform every day.
With more than 15 years of work, MyPoeticSide.com continues to expand its content to provide its users with objective and quality information. Contact us.