One of the key potential uses of Search Server 2010 Express is to provide a great search engine for your existing public facing website. I work with a lot of different associations that run a lot of different CMS platforms. While I’m a huge fan of utilizing the CMS capabilities of SharePoint 2010 for a variety of reasons, there isn’t a single platform that is right for everyone. There isn’t a single auto make and model for everyone, and there isn’t a single pair of shoes that will work for everyone, so why would the CMS industry be any different? However, a powerful search IS relevant to everyone (pun intended!).
In Part 1, we walked through a generic install. Once you have the Search Server 2010 Express up and running, it is extremely simple to configure a new content source. If you are jumping directly from the vanilla install, you should see a screen that will link you directly to the Search Administration page.
If you are just jumping in to Central Admin, the link path that you’ll need to get to the Search Administration page is under Application Management, click on Manage Service Applications, and then click on Search Service Application. While the concept of Service Applications is beyond the scope of this particular post, know that in larger environments (such as SharePoint 2010) you can run multiple Search Service Applications.
In the left nav, under Crawling, click Content Sources. You will be linked to Manage Content Sources page. You can use this page to add, edit, or delete content sources, and to manage crawls.
Before we go any further, what is a Content Source? For that matter, what is Content? In the context of Microsoft SharePoint and Search Servers, Content is any item that can be indexed. This can be HTML,a Web page, a Microsoft Office Word document, a text file, a PDF file, business data, or even an e-mail message. Content lives somewhere, such as a Web site, file share, a Notes database, a SQL database, or SharePoint site. A Content Source specifies the settings that define what content should be indexed and on what schedule it should be crawled.
You should notice on the Manage Content Sources page that there is at least one Content Source already defined: Local SharePoint sites. Using the wizard to manage the install that we followed in Part 1, all local SharePoint sites are already defined as a Content Source.
In order to create a new Content Source (such as our external site), click the New Content Source at the top. You will see the Add Content Source Page:
Content Source Name – A title that you are giving as a reference to manage this Content Source.
Content Source Type – Type of Content that you will be crawling. This is an important setting because it instructs the crawler on not only the type of content that will be located there, but also how to actually communicate with the Content Source. For example, communicating with a File Share utilizes a completely different protocol than communicating with a web site. The default types of Content Sources supported listed here. Note that I said ‘default’. You can work with vendors or write your own custom interface to crawl and index content types not specified out of the box. Also note that if you select different types, the Crawl Settings change to specify different details for the specific type of Content Source you are specifying.
- SharePoint Sites
- Web Sites
- File Share
- Exchange Public Folders
- Line of Business Data
- Custom Repository
Start Addresses – the URLs the search system should start crawling. For SharePoint sites and Web sites, these are traditional URLs. For File Shares, these will be UNC paths that are accessible from the server. You can supply more than one Start Address for a Content Source. If, for example, I wanted to include a single Content Source to manage various SusQtech websites that I am crawling, I could add http://www.susqtech.com/, http://www.sharepointacademy.org, http://www.sharepointconference.org, and http://www.thesug.org. I can then manage all of these URLs as a single Content Source. I could also opt to create multiple Content Sources so that I can manage each of the crawl schedules and details independently.
Crawl Settings – used to specify the behavior of crawling for this Content Source.
Crawl Schedules – used to schedule the crawls for this Content Source. This allows you to configure 2 different crawl schedules: full and incremental. Why would you ever want an incremental instead of a full? Incremental crawls are supposed to only crawl content modified since the last crawl and thus take less bandwidth, server memory, and CPU cycles. I typically configure these schedules with a Full crawl on the off hours on the weekend and Incremental crawls every night during the week. Keep in mind that you may need more frequent incremental crawls – such as every hour for your public facing website if you are continuously adding new content.
Content Source Priority – normal or high. The crawler will prioritize ‘high’ items when you have multiple content sources that must be crawled.
Start Full Crawl – a checkbox to start a full crawl immediately.