Original Photo by JohnStover
There are literally hundreds and thousands of ‘search engines’ out there. Some of these search engines are for finding stuff on the Internet, like Google, Bing and Yahoo. Some search engines are more specialized, like the search box you see on a single web site that searches only that single website. Search is an incredibly complex topic that has an astounding number of factors that contribute to finding that single important piece of content that you are trying to find. Frankly, Google spoiled all of us. I expect to find exactly what I’m looking for out of the millions of pages of stuff all over the internet by simply typing a single word into a single little box. If I don’t find what I want on the first page of results, I might try changing my search a little bit or adding two words, but I won’t keep trying for long.
The Internet contains at least 27.5 billion pages, as of Tuesday, 03 August, 2010, according to http://www.worldwidewebsize.com. Not only do I expect to find exactly what I want on the Internet, but if I use the search on your website, I get EXTREMELY frustrated when it doesn’t find exactly what I want when I want. How is this possible? I know what I want is on your website somewhere. Figure out what I want and show it to me! And please do it in under a second if it’s not too much trouble!
In the beginning, search was simple. Search was based on keyword matching. If I typed in a keyword, the ‘search engine’ scanned the content and found instances of that word and showed me hyperlinks with those results. I could search for ‘blog’ and the search would show me any page that had the word ‘blog’ in it. That was perfect! It’s all anyone needed. Then websites started to grow in complexity. Soon, each website had thousands of pages. If I did a simple keyword search, I would get hundreds of results. This wasn’t useful anymore. Search had to get better.
Search introduced major improvements. Boolean search operators were introduced. I could search for “SharePoint AND WordPress”. I could search for “SharePoint NOT WordPress”. I had some control on what I was searching for exactly. I also got search result sorting. I could sort all of the results to see the most recently created pages at the top. After all, if the page was newer then it clearly was more relevant, right?
That statement introduces a very important topic: RELEVANCE. Relevance denotes how well the results meet the need of the user searching; see the all-knowing Wikipedia for more details at http://en.wikipedia.org/wiki/Relevance_(information_retrieval). Relevance is determined by the search algorithm. That’s right; a computer programmer wrote a mathematical formula that uses the available information to determine the relevance of the content to your search word. In reality, that algorithm was written by a very large team of programmers, analysts, mathematicians, executives and many others. And the search is getting more complicated and far better every day.
Most modern search engines are comprised of two different primary components: the INDEX and the QUERY. The index is just like the index at the back of a book. Rather than scanning all of the content in real time, the search engine builds a big index of all of the content. This is much faster than scouring through the content in real time. Furthermore, the index can be optimized for the type(s) of searches being performed. Your individual website search is responsible for searching your website. Facebook search searches Facebook – the profiles, comments, photos, tags, etc. Google and Bing try to search everything – your website, my website, her website, their website. Your website search should search ALL of your content – web pages, HTML, PDF files, Word docs, PowerPoint files, Excel files, images, comments. The index should include ALL of your content.
So how is the index built? Usually indexes are built by a Web crawler – some type of automated software that scours all of the links and content on your site. The index uses the concept of word breaker to look for different words. In the English language, there are many characters that break words apart. Spaces, hyphens, periods, colons, semicolons, exclamation points all separate words in English. When you get into multi-lingual content, the story gets even more complicated because other languages don’t even use the same characters. So the crawler goes through all of the content and builds this enormous index for use in queries. The index contains the words, counts, metadata, information about where the words were found, information about the pages, information about the documents, titles, cached portions of pages and much more.
When a user enters a query, the search engine uses it’s algorithm to provide the most relevant information possible. What determines relevancy? There are many factors that should determine relevancy…
- Content Type. What type of content is the word found on? PowerPoint files typically have fewer words. If your keyword is one of the 20 words on a slide, that file is likely more relevant than a Word document or web page that has 2000 words.
- Location. If your keyword is found on the homepage or main landing page it is likely more relevant than if the page is found 30 nodes away through some obscure navigation.
- Popularity and linking. How popular is the page? How many other pages and documents link to the page? How frequently is the page visited?
- Analytics. How frequently is the page visited with similar queries? If 50 other people searched for the same keyword(s) you searched for, which pages did they eventually go to?
- Words. How many times is the keyword on the page? How many
- Metadata. Is your keyword in the metadata or just the main content area? Is your keyword in the page title?
- Language Detection. Is my browser set to Spanish? Should documents in Spanish show up with a higher ranking in the search results?
- Variants (Word Stemming). What if I search for the word “Flying”? Should the search engine also search for Fly and Flew and Flown? What if it’s a different language? Should the search engine be aware of other word variations?
- Human Influence. What about best bets, synonyms and keyword mapping. If someone is on the Association site and searches for the word Meeting, do you want to artificially influence the search results to show ‘Sign up for the Annual Conference’ as the first result? I bet the conference organizers do!
As you can see, the effectiveness of the search engine depends on the ability to determine relevance and then use that relevance to rank the search results. Modern search engines are available both inherently integrated and completely independent from your website content management technology. WordPress, for example, has a built in search that is pretty simple (and thus largely ineffective). It’s great for finding a keyword, but I would hardly call it a search engine. Both Microsoft and Google provide real search solutions. The have solutions for you at every level: your desktop, your enterprise, your website, and the Internet. We are focusing primarily on your website and to a lesser extent your enterprise. The Google Search Appliance provides a great solution that provides excellent relevancy that can be customized for your particular web site needs. The Google Search Appliance and Google Mini require annual maintenance fees.
Microsoft provides a free solution to search for your website and for the enterprise. That’s right; Microsoft provides enterprise level search capabilities for FREE. Microsoft Search Server 2010 Express provides the search capabilities described in this overview for FREE. While this solution may not be the perfect fit for every website, I think it is at least worth evaluating. You can download the software for free, install it, and configure it in a matter of minutes. If it works for you, implementing it with your website is as simple as replacing the search box.