Enterprise Search Engine Insights – Indexing

Enterprise Search refers to the software tools for making unstructured content from multiple enterprise-type sources, such as databases and intranets, searchable to users who are looking for information in these sources. Although publicly available information may be combined with internal unstructured information, the Enterprise Search Engine operates primarily on internal enterprise content.

Enterprise Search can be contrasted with web search and desktop search:

  • Web search applies search technology to documents en web pages on the open web
  • Desktop search applies search technology to the content on a single computer (which may comprise mirrored shared network drives)

Enterprise search systems index data and documents from a variety of sources such as:

  • file systems
  • intranets
  • document management systems
  • content management systems
  • e-mail
  • databases

Structuring data for enterprise search

Many organizations try to structure their unstructured information with tags, taxonomies, ontologies and categories. The objective of adding a structure layer is to improve the search results by allowing the user to include the tags and categories in his or her query. Our position on this is that tagging is costly, time consuming, prone to errors. Tagging is unnecessary when you have an unsupervised learning engine like Alexandria.Works.

In most Enterprise Search systems, the indexing of the information goes through several steps (see Figure 1 below):

  • Content submission: The mechanism to make the information that need to be made searchable available to the Enterprise Search Engine. This exists in 2 models: push and pull. In the push model, a source system pushes new content directly to the Search Engine (e.g. through an API). This model is used when real-time indexing is important because the source content is modified frequently and when these changes must be searchable immediately. In the pull model, the Search Engine itself gathers content from sources using a connector. This means that the frequency with which the index is updated is lower than in the push model and consequently that pre time will pass before the new content is searchable. Pull models in Enterprise Search typically run overnight.
  • Pre-processing: The content from the different sources may have many different formats or document types, such as PDF, XML, HTML, Office document formats or plain text. The pre-processing phase processes the incoming documents to plain text using document filters.
  • Indexing: During the actual indexing phase, the Search Engines analyses the documents and converts them to a structure that is searchable. The way in which these indexes are structured vary widely depending on the type of engine. After this processing phase the content becomes searchable.
  • Query Processing: Using a specific user interface or an API, queries are submitted to the Search Engine. The engine uses the submitted words or phrases to query the index and return a result set.
  • Ranking: Ranking is a post-processing step during which the search engine determines the sequence in which the search results should be presented to the user. Ranking can be based on several parameters like the location in the document where query words are found (e.g. in the title or file name), age of the document (more recent documents might be promoted) or other metadata, trending topics in the content management system, etc. The concept of the half-life of knowledge plays a role in this context. (We will write about the half-life of information in a future blog post, in the mean time you can find its definition here.)
Indexing and Query Steps

Figure 1 The Processing Phases of Enterprise Search

Security and authorizations

Security and authorizations to access documents in the result set can be handled in 2 ways by the Search Engine:

  • with early binding
  • with late binding

Early binding means that authorization and access levels are checked at indexing stage, i.e. before the index is  built. This means that only the documents to which the user (who is indexing) has access, are indexed? This results in faster processing at query time but may be less accurate if the access levels were changed in the source system after the indexing phase. It also means that one index per user or per group of users needs to be created.

Late binding means that authorization and access levels are checked at query stage, meaning that first the full result set of the query is computed and after that the user’s rights on every document in the result set are checked. This method slows down the processing of the query but ensures up-to-date security.

If you want to stay informed about this Enterprise Search Engine Insights series, just subscribe to our Newsletter, and we will keep you posted.

Request a demo

Schedule a custom demo. We will show you a customized version of the application, fitted to your specific needs, at your convenience.

Or, download our White Paper.

Request a demo

Schedule a custom demo. We will show you a customized version of the application, fitted to your specific needs, at your convenience.

Or, download our White Paper.

Intake call
Upload test set

Free of charge!