This past Fall, Datamuse hired three talented computer science students from the University of Waterloo in Ontario, Canada. Over the course of their three-month co-op they improved our word-finding apps and built a new game. These projects required a diverse set of skills, spanning topics in natural language processing (NLP), data visualization, distributed data processing, design, and web application development. In this post, we will showcase eight of the projects the students completed during the Fall term. We're grateful to Hannah, Max, and Nicole, for all their hard work over the past few months!
1. Concept clusters
Peter Mark Roget, an English physician in the 19th century, is the mastermind behind the thesaurus that we all know and love. Outside of his medical practice, he dedicated his life to curating and categorizing words, eventually publishing the "Thesaurus of English Words and Phrases" which grouped 15,000 words into more than 1,000 categories. This work laid the foundation for the synonym dictionaries that writers use today to find alternative words. While the internet now provides many tools (including our own OneLook Thesaurus) that can find synonyms and related words, it's hard to find the kind of organized taxonomy that Roget created. That's why we added "concept clusters" to OneLook – groups of related words and phrases that are automatically derived from data. If you've used OneLook, you may have noticed these clusters at the bottom of every results page and in every word card that appears when you click on a result. Last Fall, our co-op students Max and Nicole used modern AI tools (specifically the GPT-3 large language model from OpenAI) to come up with new titles for our 8,000+ clusters covering nearly a million words and phrases. They evaluated the titles for accuracy, specificity, and conciseness, resulting in a set of titles that had 70% fewer duplicates than before. They also used hierarchical clustering techniques to organize the concept clusters into 30 broad top-level subjects, like biology and art, which are now displayed on the OneLook homepage in our new "subject index." It's now easier to explore words in highly specialized topics such as "turning point in life" (with 54 words) or "delivering a sermon" (50 words) or "manipulating audio signals" (92 words).
2. Word usage timelines
When learning a new word or phrase, it's helpful to understand not only its meaning, but also how, when, where, and why it's used. The "when" aspect is especially important for fiction writers who want to avoid using words that are out of place for the time period they are writing about. To help with this, Nicole and Max used data from Google Books NGrams to build a lightweight interactive timeline that shows how frequently a word has been used in the past. These timelines can be found in the "word cards" on OneLook Thesaurus or certain parts of RhymeZone when you click on a word. By selecting one of the bars on the timeline, you can see a word that is more appropriate for the selected time range, if applicable.
3. Verbloom
The success of Wordle has shown that there’s a tremendous appetite for fun, quick word games that are updated daily. We’d like to create such a game that also teaches vocabulary in some way: most word games, after all, are more focused on letters than on meanings. Verbloom is a game that asks you to find the 3 words that are most related to each other within a crowd of unrelated words. Finding the answer will often lead you to discover surprising meanings for words you thought you knew well. The entire co-op team – Hannah, Max, and Nicole – collaborated to create this game, from its beautiful visual design, to its front-end, back-end, and daily game generation algorithm. We hope you’ll enjoy a game or two. We expect to offer subject-themed versions of this game for educators in 2023.
4. Idiom usage examples
RhymeZone and OneLook, like many dictionaries, provide usage examples that show how a word is used in context. These examples come from real sources such as famous quotes, books, and Wikipedia articles, and are chosen based on research we conducted in 2020 to best illustrate the meaning of the word. However, until recently, our usage examples only worked for single-word terms and certain two-word compounds, and did not include idiomatic phrases like "time of one's life" or "bee in your bonnet" or "small potatoes." These idioms are particularly important for our non-native English speaking readers, so adding usage examples for them was a top priority. Max and Hannah reimplemented our data pipeline to cover idiomatic phrases and variations of phrases in an efficient manner. Now, when you search for "time of one's life," you will find a usage example featuring George from Harry Potter: Goblet of Fire saying "Yeah, we're having the time of our lives here."
5. Word card redesign
How can we organize the most important information about a word in a way that promotes learning and discovery? "Word cards" appear when you click on a word on OneLook Thesaurus or in the Thesaurus tab on RhymeZone, and in the future they will be available in other parts of our apps. Word cards display definitions, usage examples, and links to additional resources for learning. Nicole redesigned the look and feel of the word cards, giving them a fresh, icon-based layout. The cards also have interactive elements, such as the ability to reorder usage examples based on the selected definition by clicking on it.
6. Word of the Day
Several dictionary websites offer a “word of the day” feature. Ours will be different from others – our words will be selected based on what’s currently happening in world news. Nicole implemented a data pipeline that crawls recent news stories to find words that are both interesting and meaningfully connected to recent news, maximizing the chance that a word is selected which teaches the user something new. Look for this feature in early 2023.
7. Humour ranking
What words are inherently funny? Why is “gobbledygook” a funnier way to say “nonsense” than “nonsense” itself? Why is “yam” a funny vegetable, but “celery” is not? Believe it or not, NLP researchers have considered these questions. We wanted to add the ability to sort words by funniness to our recently-added list of sort orders in OneLook Thesaurus (and in RhymeZone’s thesaurus tab). So Max implemented a model from this paper that ranks words by their inherent funniness, and hilarity ensued! Try it by selecting the “Closest meaning first” select menu on the Thesaurus after you do a search and then selecting "Most funny sounding". (Max also created a standalone website called FunnyBone that shows off the work; check it out!)
8. OneLook Thesaurus API tests and explainer video
OneLook Thesaurus has grown in complexity over the years as it offers ever more ways to find words, phrases, and ideas. To help us rein in this complexity, Hannah created a battery of acceptance tests that exercises different aspects of our backend, helping us prevent software regressions from reaching our production service. This has already been useful in keeping the service stable, and it will be invaluable as we add more languages in 2023. She and the others also helped edit this video that goes over some of the quiet joys of using the English version of OneLook Thesaurus. Give it a watch!