When is Fuzzy Search Too Fuzzy? Elephind Tells Us!

Picture

Approximately two weeks ago in my never-ending quest for more resources and repositories to research genealogy through newspapers, I wrote an article, Elephind – One To Watch  about a dynamic service that searches multiple online newspaper repositories at one time. I had considerable interest and have been in contact with them to discover more about their capabilities, technologies, etc.

Subsequent to reading my post, several genealogy bloggers have also written short articles about Elephind – attempting to inform their readership about this site. One of the articles was quite interesting in its comparison of results of a search using Elephind and the same search using the Library of Congress – Chronicling America newspaper research site. As a result of a comment penned by the Elephind folks in that article, I suggested that they write a guest post describing search technologies and the impact of fuzzy search on newspaper research.

So I am pleased to present the following guest post from Meredith Palmer of DL Consulting, the creators of Elephind:
______________________________________________________________________________________________

Search engine logic:  When is “fuzzy” search a bit too fuzzy?

By Meredith Palmer, DL Consulting

As a genealogist, family historian, or ancestor hunter, the almighty search engine is likely your most important forensic tool in the initial stages of discovery. These days, almost every website you come across, especially websites housing historical records, operates under some type of search functionality. It is the only way to wade through pages of documents efficiently and organize information in ways that make finding that hidden gem possible.

Faceted, federated, fuzzy…these are all terms to describe the various functions search engines can perform and each type has an important role to play in returning relevant results. Faceted search is something you are probably already familiar with as it is the function of filtering search results by criteria, such as date, title, or subject. Federated search is like an “uber-search”. It allows you to search multiple searchable resources with a single query. And finally, fuzzy search is…well…fuzzy.  It is a technique that helps us out by searching for words similar to the word we query, broadening the search to include likely alternative spellings. But, is it really helpful?

That depends. As Stefan Boddie recently described in response to a blog by Phillip Trauring on his website, www.bloodandfrogs.com, sometimes it can be very helpful but it may also be too clever for its own good, generating lots of false positive results. As Stefan points out, fuzzy search is useful in sorting through poor OCR text because it is intended to find close matches, assuming the resulting words are distorted versions of your query. The problem is the results are likely to include distorted versions of words similar to but not the same as your query. Suddenly, you have hundreds of thousands of results to sift through.

Chronicling America is an example of a website which uses fuzzy searching, for at least certain searches. To see how the search function on this site would behave I experimented with a search for my grandfather’s family name. 60 pages of results were returned, most matching a word similar to Meerse including “Monroe”, “Messrs”, “Melbourne”, “course”, “license” and many others. In a way, fuzzy search is like the wide angle lens on a camera. Turning the lens widens your view of the landscape including everything around you. If that’s too much to look at all at once, you need to dial back the focus, if the website you are using allows you to do that.

All historical news aggregators, such as Chronicling America, Trove, Papers Past, CDNC, as well as the pay per view news banks perform searches in slightly different ways. To make searching these sites easier and to provide a fast, federated search across all of them, Stefan Boddie and his colleagues built Elephind.com. Elephind incorporates many different digital newspaper collections and allows a user to query all of them at once. As Stefan says to Mr. Trauring, “Our goal is to make the search functions in Elephind better than those in the underlying collections like Chronicling America and Trove…” Therefore, the site is not set to conduct fuzzy search except on request.

If you would like to test out the search capabilities of Elephind and are interested in giving your feedback, I will pose the same questions Stefan asked Mr. Trauring. Do you think fuzzy search as implemented in Chronicling America is a good idea? Would it be better if searches were for exact matches by default, with some sort of “search for similar words” option the user could choose when desired? What other search features would make it easier to use these collections? Feel free to leave a reply here or contact me at Meredith@dlconsulting.com with your comments.



Join the Conversation

2 Comments

  1. I find your site no better than the “fuzzy” search. My query was “Newnom” and as with every other site it mistook “Newsom” for “Newnom”. Why OCR is incapable of distinguishing N form S is beyond me.
    It would also be nice if the search results were highlighted on the page so we don’t have to read every word on the page.

Leave a comment

Your email address will not be published. Required fields are marked *