There are literally hundreds and thousands of âsearch enginesâ out there. Some of these search engines are for finding stuff on the Internet, like Google, Bing and Yahoo. Some search engines are more specialized, like the search box you see on a single web site that searches only that single website. Search is an incredibly complex topic that has an astounding number of factors that contribute to finding that single important piece of content that you are trying to find. Frankly, Google spoiled all of us. I expect to find exactly what Iâm looking for out of the millions of pages of stuff all over the internet by simply typing a single word into a single little box. If I donât find what I want on the first page of results, I might try changing my search a little bit or adding two words, but I wonât keep trying for long.
The Internet contains at least 27.5 billion pages, as of Tuesday, 03 August, 2010, according to http://www.worldwidewebsize.com. Not only do I expect to find exactly what I want on the Internet, but if I use the search on your website, I get EXTREMELY frustrated when it doesnât find exactly what I want when I want. How is this possible? I know what I want is on your website somewhere. Figure out what I want and show it to me! And please do it in under a second if itâs not too much trouble!
In the beginning, search was simple. Search was based on keyword matching. If I typed in a keyword, the âsearch engineâ scanned the content and found instances of that word and showed me hyperlinks with those results. I could search for âblogâ and the search would show me any page that had the word âblogâ in it. That was perfect! Itâs all anyone needed. Then websites started to grow in complexity. Soon, each website had thousands of pages. If I did a simple keyword search, I would get hundreds of results. This wasnât useful anymore. Search had to get better.
Search introduced major improvements. Boolean search operators were introduced. I could search for âSharePoint AND WordPressâ. I could search for âSharePoint NOT WordPressâ. I had some control on what I was searching for exactly. I also got search result sorting. I could sort all of the results to see the most recently created pages at the top. After all, if the page was newer then it clearly was more relevant, right?
That statement introduces a very important topic: RELEVANCE. Relevance denotes how well the results meet the need of the user searching; see the all-knowing Wikipedia for more details at http://en.wikipedia.org/wiki/Relevance_(information_retrieval). Relevance is determined by the search algorithm. Thatâs right; a computer programmer wrote a mathematical formula that uses the available information to determine the relevance of the content to your search word. In reality, that algorithm was written by a very large team of programmers, analysts, mathematicians, executives and many others. And the search is getting more complicated and far better every day.
Most modern search engines are comprised of two different primary components: the INDEX and the QUERY. The index is just like the index at the back of a book. Rather than scanning all of the content in real time, the search engine builds a big index of all of the content. This is much faster than scouring through the content in real time. Furthermore, the index can be optimized for the type(s) of searches being performed. Your individual website search is responsible for searching your website. Facebook search searches Facebook â the profiles, comments, photos, tags, etc. Google and Bing try to search everything â your website, my website, her website, their website. Your website search should search ALL of your content â web pages, HTML, PDF files, Word docs, PowerPoint files, Excel files, images, comments. The index should include ALL of your content.
So how is the index built? Usually indexes are built by a Web crawler â some type of automated software that scours all of the links and content on your site. The index uses the concept of word breaker to look for different words. In the English language, there are many characters that break words apart. Spaces, hyphens, periods, colons, semicolons, exclamation points all separate words in English. When you get into multi-lingual content, the story gets even more complicated because other languages donât even use the same characters. So the crawler goes through all of the content and builds this enormous index for use in queries. The index contains the words, counts, metadata, information about where the words were found, information about the pages, information about the documents, titles, cached portions of pages and much more.
When a user enters a query, the search engine uses itâs algorithm to provide the most relevant information possible. What determines relevancy? There are many factors that should determine relevancyâŚ
As you can see, the effectiveness of the search engine depends on the ability to determine relevance and then use that relevance to rank the search results. Modern search engines are available both inherently integrated and completely independent from your website content management technology. WordPress, for example, has a built in search that is pretty simple (and thus largely ineffective). Itâs great for finding a keyword, but I would hardly call it a search engine. Both Microsoft and Google provide real search solutions. The have solutions for you at every level: your desktop, your enterprise, your website, and the Internet. We are focusing primarily on your website and to a lesser extent your enterprise. The Google Search Appliance provides a great solution that provides excellent relevancy that can be customized for your particular web site needs. The Google Search Appliance and Google Mini require annual maintenance fees.
Microsoft provides a free solution to search for your website and for the enterprise. Thatâs right; Microsoft provides enterprise level search capabilities for FREE. Microsoft Search Server 2010 Express provides the search capabilities described in this overview for FREE. While this solution may not be the perfect fit for every website, I think it is at least worth evaluating. You can download the software for free, install it, and configure it in a matter of minutes. If it works for you, implementing it with your website is as simple as replacing the search box.
Hi John. Thank you for this post. fascinating and relevant (heh) information.
I also read both of your posts on installing Search Server Express 2010. Loved them. Thank you. Very helpful. The only thing you seemed to have left out in both of those posts (and this one) is the final step of HOW to actually integrate SSE functionality into a web site. You say it’s a simple as replacing the search box? Can you give a quick example of how that would look in practice? I am not certain I understand that steps.
Thanks a bunch, your posts are very informative.
-Steve