Microsoft Search Server 2010 Express Part 2: External Content Source

One of the key potential uses of Search Server 2010 Express is to provide a great search engine for your existing public facing website.  I work with a lot of different associations that run a lot of different CMS platforms.  While I’m a huge fan of utilizing the CMS capabilities of SharePoint 2010 for a variety of reasons, there isn’t a single platform that is right for everyone.  There isn’t a single auto make and model for everyone, and there isn’t a single pair of shoes that will work for everyone, so why would the CMS industry be any different?  However, a powerful search IS relevant to everyone (pun intended!). 

In Part 1, we walked through a generic install.  Once you have the Search Server 2010 Express up and running, it is extremely simple to configure a new content source.  If you are jumping directly from the vanilla install, you should see a screen that will link you directly to the Search Administration page.

image
If you are just jumping in to Central Admin, the link path that you’ll need to get to the Search Administration page is under Application Management, click on Manage Service Applications, and then click on Search Service Application.  While the concept of Service Applications is beyond the scope of this particular post, know that in larger environments (such as SharePoint 2010) you can run multiple Search Service Applications.

image

In the left nav, under Crawling, click Content Sources.  You will be linked to Manage Content Sources page.  You can use this page to add, edit, or delete content sources, and to manage crawls.

image

Before we go any further, what is a Content Source?  For that matter, what is Content?  In the context of Microsoft SharePoint and Search Servers, Content is any item that can be indexed.  This can be HTML,a Web page, a Microsoft Office Word document, a text file, a PDF file, business data, or even an e-mail message.  Content lives somewhere, such as a Web site, file share, a Notes database, a SQL database, or SharePoint site.  A Content Source specifies the settings that define what content should be indexed and on what schedule it should be crawled.

You should notice on the Manage Content Sources page that there is at least one Content Source already defined: Local SharePoint sites.  Using the wizard to manage the install that we followed in Part 1, all local SharePoint sites are already defined as a Content Source. 

In order to create a new Content Source (such as our external site), click the New Content Source at the top.  You will see the Add Content Source Page:

image

Content Source Name – A title that you are giving as a reference to manage this Content Source.

Content Source Type – Type of Content that you will be crawling.  This is an important setting because it instructs the crawler on not only the type of content that will be located there, but also how to actually communicate with the Content Source.  For example, communicating with a File Share utilizes a completely different protocol than communicating with a web site.  The default types of Content Sources supported listed here.  Note that I said ‘default’.  You can work with vendors or write your own custom interface to crawl and index content types not specified out of the box.  Also note that if you select different types, the Crawl Settings change to specify different details for the specific type of Content Source you are specifying.

    • SharePoint Sites
    • Web Sites
    • File Share
    • Exchange Public Folders
    • Line of Business Data
    • Custom Repository

Start Addresses – the URLs the search system should start crawling.  For SharePoint sites and Web sites, these are traditional URLs.  For File Shares, these will be UNC paths that are accessible from the server.  You can supply more than one Start Address for a Content Source.  If, for example, I wanted to include a single Content Source to manage various SusQtech websites that I am crawling, I could add http://www.susqtech.com/, http://www.sharepointacademy.org, http://www.sharepointconference.org, and http://www.thesug.org.  I can then manage all of these URLs as a single Content Source.  I could also opt to create multiple Content Sources so that I can manage each of the crawl schedules and details independently.

Crawl Settings – used to specify the behavior of crawling for this Content Source.

image

Crawl Schedules – used to schedule the crawls for this Content Source. This allows you to configure 2 different crawl schedules: full and incremental.  Why would you ever want an incremental instead of a full?  Incremental crawls are supposed to only crawl content modified since the last crawl and thus take less bandwidth, server memory, and CPU cycles.  I typically configure these schedules with a Full crawl on the off hours on the weekend and Incremental crawls every night during the week.  Keep in mind that you may need more frequent incremental crawls – such as every hour for your public facing website if you are continuously adding new content. 

Content Source Priority – normal or high.  The crawler will prioritize ‘high’ items when you have multiple content sources that must be crawled.

Start Full Crawl – a checkbox to start a full crawl immediately.

By John Stover

John Stover Bio.

9 comments

  1. have you ever tried to index moss 2007 or wss 3.0 content?

    My idea is, to extend an existing moss 2007 environment, using search server 2010…

  2. I need to crawl a SQL server database but I am not sure that we can do this with Search Server 2010.

    1. Can we crawl a SQL server database? If yes than what will be the procedure to do that?

    Please help me for this.

    Thanks & Regards.
    Vikas Chandgothia

  3. Yes you can crawl external data sources. In order to do this, you need to configure external business connectivity services to expose the SQL data. Business Connectivity Services (BCS) enables SharePoint integration with external data, including line of business applications. BCS builds on top of the Business Data Catalog (BDC) technology delivered in Microsoft Office SharePoint Server 2007. The simplest situation would be to use SharePoint Designer 2010 to map the external data and then the Search will crawl and expose it very easily!

  4. I have been trying to setup Search Server Express 2010 on my SPFoundation 2010 server running on WinServer2008R2 to index my Exchange 2010 Public Folders.

    I don’t seem to be able to specify the http://…. path correctly for the indexer to find the public folders. I have tried many different formats and all return the same error. I have tried specifying a rule and get same error.
    I have also Set the “Default content access account” to be the domain administrator and my own accounts (with domain admin rights) and always errors out after about 1 minute with the following:

    Error: Access is denied. Verify that either the Default Content Access Account has access to this repository, or add a crawl rule to crawl this repository. ….

    My public folder path in Outlook is: “Public Folders\All Public Folders\TeamScope CRM

    Q1) What should the format/syntax be for indexing this Public Folder and all subfolders?

    Q2) How do I reset the “Default content access account” back to “NT AUTHORITY\NETWORK SERVICE”? I have no idea what the password is and when I specify this account is requires that I know the password.

  5. John,
    Great coverage of what looks like is getting very little attention on the web.
    Similar to the comment about adding search server to foundation, have you tried this yet? I tried it on a test server today and it seems my site collection in my default site was overwritten with the search center. There must be a way to avoid that.
    Thanks,
    Tom

  6. Hi, John.

    After the initial setup of MS Search Server 2010 Express, I add a content source (http://www.nais.org) and run a full crawl. But, the crawl always stops after two minutes. It only gets about five or six pages before the status swithces to “Completing” and then “Idle.” The same thing happens when I change content source to a different URL in the same domain (http://sss.nais.org).

    The most information I have found on the problem is here
    http://social.technet.microsoft.com/Forums/en/sharepoint2010setup/thread/9ae514a1-3d19-4d85-b458-6be9743e1d7b
    but it’s not definitive.

    Have you run across this problem before?

    Thanks,
    – Cameron

  7. It’s important to note that SharePoint is capable of solving many business problems, but in many cases it may require configuration. SharePoint is capable of indexing and searching business data – but you must configure the business connectivity services (or business data connectivity in prior version). SharePoint is capable of indexing email and file shares – but you have to configure it. On the same note, you’re correct that you need to configure SharePoint for PDF indexing (using the Adobe iFilter you mentioned or others, like Foxit).

Leave a comment

Your email address will not be published.