This blogpost on Enterprise Search is for Knowledge Managers and other people with responsibilities in making information more available in their organizations. In this article we list the most common types of search and explain the differences. After reading this blogpost you will have the right information to make an informed decision on which Enterprise Search Engine to choose in different contexts.
Table of Contents
For the enterprise to gain the greatest leverage from its information assets, knowledge workers must be able to share and reuse the appropriate data in a timely way. Perhaps the biggest, very tangible but harder-to-quantify threat lurking in the dark is the opportunity cost to the organization, not shown in a cash outflow but implied by not allocating resources to the best alternative due to a knowledge deficit.
The inability to locate and retrieve mission-critical enterprise information because of inadequate search should create a clear economic incentive. However, while the cost of not finding information is well-documented, it is hidden within the enterprise and therefore rarely perceived as having an impact on the bottom line.
- Wasted Time – Time wasted by knowledge workers through inefficient searching for information
- Wasted resources – Unused or underused valuable knowledge resources (often paid subscriptions). Content loses its informative value over time
- Work duplication – Duplicating work because (1) not knowing where to find the original work, (2) not knowing that the work already exists
- Knowledge Drain – Losing information insight caused by employee turnover (the collective memory simply vanishes)
- Opportunity Costs – Profit wasted or cost incurred by making decisions on incorrect, incomplete or outdated information
The status quo is changing. Driven by the sheer size of the ongoing data explosion and the scarcity of qualified data scientists, organizations acknowledge that the time wasted to searching ought to be decreased. But above all, the insight grows that coping with inefficiencies caused by unnecessary intellectual rewrites, substandard performance and the inability to detect essential sources of knowledge cannot be delayed.
Consequently, having a powerful search solution is becoming more and more an important element of the digital transformation. ‘Digital Labor’ is one of the four pillars of digital transformation. And finding information easily in the collective memory of the enterprise is essential. In this blogpost we give an overview of current search mechanisms with their advantages and disadvantages.
Enterprise search refers to the software tools for making content from multiple enterprise-type sources, such as databases and intranets, searchable to users who are looking for information in these sources.
Enterprise search can be contrasted with web search and desktop search:
- Web search applies search technology to documents on the open web.
- Desktop search which applies search technology to the content on a single computer.
Enterprise search systems index data and documents from a variety of sources such as:
- file systems
- document management systems
Some enterprise search systems process only structured data or only unstructured information, or both.
In most Enterprise Search systems, the processing of the information goes through several steps:
- Content submission: the mechanism to make the information that need to be made searchable available to the Enterprise Search; this exists in 2 models: push and pull. In the push model, a source system pushes new content directly to the Search Engine (e.g. through an API). This model is used when real-time indexing is important. In the pull model, the Search Engine itself gathers content from sources using a connector.
- Pre-processing: The content from the different sources may have many different formats or document types, such as PDF, XML, HTML, Office document formats or plain text. The pre-processing phase processes the incoming documents to plain text using document filters.
- Indexing: During the indexing phase, the Search Engines analyses the documents and converts them to a structure that is searchable. The way in which these indexes are structured vary widely depending on the type of engine. After this processing phase the content becomes searchable.
- Query Processing: using a specific user interface or an API, queries are submitted to the Search Engine. The engine uses the submitted words or phrase to query the index and return a result set.
- Ranking: Ranking is a post-processing step during which the search engine determines the sequence in which the search results should be presented to the user. Ranking can be based on several parameters like the location in the document where query words are found (e.g. in the title or file name), age of the document (more recent documents might be promoted) or other metadata, trending topics in the content management system, etc.
Security and authorizations to access documents in the result set can be handled in 2 ways by the Search Engine: with early binding or late binding:
- Early binding means that authorization and access levels are checked at indexing stage. This results in faster processing at query time but may be less accurate if the access levels were changed in the source system after the indexing phase.
- Late binding means that authorization and access levels are checked at query stage, meaning that first the result set of the query is computed and after that the user’s rights on every document in the result set is checked. This method slows down the processing of the query but ensures up-to-date security.
Traditional Search Algorithms
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval. In this model, a text is represented as the bag (multi set) of its words, disregarding grammar and even word order but keeping multiplicity.
In the query processing phase the Bag-of-word Search Engines look for the exact query words in the documents in your corpus. They assume that the sequence in which words appear has no importance. They consider an OR-relationship between the words you entered, whereas you probably intended to look for content in which all the entered words appear together.
You expect that the sequence in which words are entered is relevant and determines the outcome of the search. The standard search mechanisms that we all know and use every day assume that you know all relevant words in the documents you are searching in. When we don’t find what we are looking for, we try different words, synonyms or related concepts. Experienced search users will also try combinations of words with AND’s en OR’s. But most user don’t even know that feature.
How does it work – Example
Query: “orange fox”
Corpus: document D1 in the corpus contains a sentence “The quick brown fox jumps over the lazy dog.”, document D2 in the corpus contains the sentence “The orange rolled down the stairs.”
Outcome: the search engine returns document D1 in the result set because it found a match between “fox” in the query and “fox” in the document, but also document D2 because of the word “orange”
Shortcomings: the user was probably looking for “orange fox” as a whole, and therefore neither D1 nor D2 is a correct result.
Document Tagging is a technique to enhance documents with metadata. Tags are keywords you assign to files. Document Tagging was introduced to overcome the limitations of structuring documents in folders. A document can be stored in 1 folder only, whereas it can have many different tags simultaneously.
Later the technique was also adopted by some Enterprise Search mechanisms to increase the quality of the search results: tags help to determine how relevant a document is to the submitted query. Document search results become consist and relevant saving time and generating useful business knowledge.
But here too the exact match between the words used in the queries and the tags used on the documents is important to produce meaningful results.
How does it work – Example
Preparation: at some point in the past the organization decided to structure its information; in order to do that, a list of tags was set up to reflect the content of the documents; consequently every document in the organization’s content management systems was tagged with 1 or more of these tags
Query: “forest tag: wildlife”
Corpus: document D in the corpus contains a sentence “The quick brown fox jumps over the lazy dog.” AND was tagged with “wildlife” and “pangram”
Outcome: the search engine returns document D in the result set because it found a match between the tag “wildlife” in the query and the tag on the document; notice that the search word “forest” does not appear in the document itself
Shortcomings: it is not easy to set up consistent tag lists in an organization and it takes a lot of discipline and time to tag all documents correctly.
Taxonomy is a practice in which objects are arranged and classified to provide order.
In the context of Document Tagging, Taxonomies are used to ensure that documents are tagged (classified) in a meaningful way when added or checked into the file system or content management software with the objective of retrieving the documents more easily. Building and managing one or more Taxonomies is considered essential in the process of managing knowledge and making information more accessible. Enterprise Taxonomies are hierarchical classifications of entities of interest to the enterprise, organization or administration. Setting up Taxonomies requires specific skills and methods to avoid clutter.
Using Taxonomies in Enterprise Search helps to overcome the limitations of exact word matching: using the structure in a Taxonomy, a search engine can also look for related words when executing the query.
How does it work – Example
Preparation: at some point in the past the organization decided to structure the terms relevant to the business domain in structured lists in which every branch contains related words (synonyms or other); e.g.
Laziness: Passive, Lazy, Lethargic, Sleepy, Sloth, Potato
Dog: Dog, Puppy, Bitch, Stud dog, Canine, Lassy, Pavlov, Jofi the Chow Chow
Query: “Lethargic canine”
Corpus: document D in the corpus contains a sentence “The quick brown fox jumps over the lazy dog.”
Outcome: with the help of the taxonomy, the search engine finds the equivalence between “lethargic” and “Lazy” in the “Laziness” branch of the taxonomy, and the equivalence between “Canine” and “Dog” in the “Dog” branch of the taxonomy; consequently the search engine decides that “lazy dog” in the corpus corresponds to “lethargic canine” and returns the document in the result set
Shortcomings: it is not easy to set up consistent taxonomies in an organization.
An ontology, sometimes called ‘knowledge graph’, is a controlled vocabulary used to describe a specialized field. It basically posits insight into the data from an expert who records the features deemed necessary to work with, say, papers covering investment banking, and who details the different relationships between the attributes that will be associated with each term. Setting up an ontology or a taxonomy will take anything between a few weeks to many months depending on the business case. Afterwards constant maintenance is needed in order not to become obsolete quickly. Its success crucially depends on the willingness and the discipline inside the company to use the scheme in a consistent way when adding new documents to the system. In the day-to-day reality of an organization this appears to be too bold an assumption, again and again.
How does it work – Example
Preparation: in many cases ontologies are not created by individual organizations but are either publicly available or can be purchased if a specific business domain needs to be covered. In both cases tweaking will be necessary to match the needs of the enterprise.
In the example The Ontology defines relations like ‘lives in’, behaves like, parent-child, etc.
Query: “Playful kit”
Corpus: document D1 in the corpus contains a sentence “The quick brown fox jumps over the lazy dog.”
Outcome: with the help of the ontology, the search engine finds the equivalence between “kit” and “fox” through the parent-child relationship, and the equivalence between in the “Laziness” branch of the taxonomy, and the equivalence between “playful” and “jumping” in the “behaves like” relationship of the ontology; consequently the search engine decides that “fox jumps” in the corpus corresponds to “playful kitten” in the query and returns the document in the result set
Shortcomings: it is hard to set up a complete ontology for each knowledge domain or to find one for the knowledge domains of your organization in the public domain
What do these methods have in common?
All tag lists, taxonomies and ontologies are constructed by experts who select today the features and concepts they consider necessary to work in the future. This means that these mechanisms require continuous maintenance to ensure that they reflect the up-to-date knowledge on the subjects. But not only the lists, taxonomies and ontologies themselves need to be maintained: the organization must also make sure that all the information and documents in the company’s content management systems are tagged and classified correctly and are kept up-to-date.
Another problem with the Bag-of-Words techniques is that traditional search assumes that one knows how to phrase a question. What if a knowledge worker is unfamiliar with the data? What if more is needed than a simple answer to a straightforward keyword question, if the user needs to explore different documents in different sets of data, using a vastly varying terminology. You may not notice it anymore because you are so used to it, but before formulating a query we all try to imagine the terminology a database uses to tag the information we need. And as a result of that, we often spent too much time searching around before we discover the page we are looking for.
Imagine a city that makes all its communal decrees available to the public, to allow them to query them and know what to do and what not to do. Imagine that a citizen would like to know if he can use his drone in city area. Which word(s) should that citizen use in his search query? The word ‘drone’, right? Wrong: the administrative language used the city’s administration almost invariably talks (and writes) about ‘remotely operated independent flying objects’. That is quite different from ‘drone’. As a result, the traditional search engine will not return the relevant documents, and the citizen will have a problem finding an answer to his question.
(If you’re interested to know how this can be solved: keep on reading.)
Some Search Engines deliver personalized search results. The results one user sees are different from what other users see, even when they search for the same words. E.g. users see results based on their previous activity or trending content in the content management system. Users should be aware that they receive only information and opinions that conform to and reinforce their own beliefs. They might sit in a ‘filter bubble’.
Players in the Traditional Enterprise Search Market
In the course of the past decade we observed many moves in the Enterprise Content Market when it comes to Search. In 2011 Hewlett-Packard acquired Autonomy (which previously bought Verity and Interwoven); Oracle acquired Endeca. However, the acquiring firms didn’t put a lot of investment into these technologies to bring them to the next level. While this squeeze started, other vendors like Attivio basically abandoned their original enterprise search approach. The takeover of Hewlett Packard Enterprise by Micro Focus in 2016 allowed HP to offload the remaining bits from its unfortunate purchase of Autonomy.
Google stopped supporting the Google Search Appliance in the first half of 2019.
Every data store comes with its search engine. Over the past years SOLR/Lucene has gained the upper hand as the main third-party tool in enterprise search. SOLR/Lucene is a flexible open-source library supported by or built into many well-known applications such as Facebook, Foursquare, GitHub, Mozilla, Netflix, Quora. A full SOLR/Lucene suite named ‘Elasticsearch‘ is available under the terms of an Apache License. LucidWorks is a well-known enterprise search technology company, marketing specifically the SOLR/Lucene solution. Although Lucene is certainly best of breed in its category, it still relies on the bag-of-words paradigm and carries the inherent limitations of the technology. Function words (stop words) are usually removed. It basically uses a statistical toolbox on a reverse index of the most informative words. To increase the search performance, support of external, manually created taxonomies or ontologies is required.
Microsoft acquired FAST in 2008 to support its development of performant search and retrieval solutions for SharePoint. Metadata is a key cog in the FAST search engine. Administrators need to develop pertinent metadata and ensure that every user consistently labels the content in the approved manner to get full benefit of the search function. In 2015 Microsoft added Equivio, a text analytics service for legal and compliance to its portfolio. In the same year Microsoft released Delve, now rebranded to “Intelligent Search & Discovery”. Although Microsoft continues to invest, large number of users are disappointed with the results, according to recent surveys (2017).
The 2017 Forrester Research/ARMA International Survey explains: “It is simply not realistic to expect broad sets of employees to navigate extensive classification options while referring to a records schedule that may weigh in at more than 100 pages.” Unfortunately, it is only through the consistent use of tags that SharePoint search or similar engines hope to improve quality. The concerns with tags and taxonomies are reflected in Gartner’s Hype Cycle of 2017. ‘Enterprise Taxonomy and Ontology Management’ is positioned in the dreaded Trough of Disillusionment. (In the 2018 version both Taxonomies and Ontologies disappeared altogether from the chart.)
Intelligent Search by Alexandria.Works
Text as a network
Alexandria.Works (AW) is a completely different type of search engine.
Alexandria.Works is a Natural Language Processing (NLP) information discovery tool. It is based on Eric Van Horenbeeck’s PhD in computational linguistics and was re-written by Tom Pauwaert, MSc and Robbe Block, PhD to resolve the technical challenges of translating a complex network and its interactions into a performant, scalable architecture.
Alexandria.Works sees text as a complex network, in contrast to the mainstream bag-of-words methods. All mainstream solutions have taxonomies, tags, training sets or database schemes to improve the simple word matching technique. Alexandria.Works doesn’t cut away anything. Alexandria.Works discovers the semantic relations in unstructured text at indexing time. Subjects of discourse (topics) spanning several documents are constructed at the same moment. No schedule is imposed on the data. The user’s current information need will determine what is relevant, his query will activate the appropriate relations. That’s why Alexandria.Works has no need for schemes, rules, nor tags and is language independent.
This results in a completely different user experience. No need to guess the search words. Simply type your question. In all probability, this will be close to the data you’re looking for. Why is this so?
Language is not a game where users generate random tokens when they want to convey a meaningful message. If you would look from a distance at a network of, say a million newspaper articles, you would observe several dense areas. Zooming in on in one of them, you would guess that it’s about foreign policy, even when you don’t recognize the persons mentioned there. Moving to another area, you would notice it to be about sports, again without knowing what an umpire is. If you were looking for a pancake recipe, you would leave this area quickly, scanning for a place where food is the main topic. That’s how Alexandria.Works roughly works.
So, starting with the information contained in the query, Alexandria.Works will land somewhere in the network. From there on the software-intelligence scouts blindingly fast the surrounding area to find correlating topics with the user’s query, collecting along the way the information it needs to assemble a result set. At the outcome of this process the user obtains topics and documents relevant to the query. In the case of the city council decisions mentioned above, Alexandria.Works did find the relevant piece, even when it didn’t contain the word ‘drone’.
On top of that, Alexandria.Works also brings you to the pertinent sections in the text. Imagine that your search leads to one or more 100-page documents. Some standard search tools expect you to look up each document for the relevant parts. Not with Alexandria.Works. We highlight the proper paragraphs.
Alexandria.Works has proven to make a big difference in the following use cases:
- Citizen Chat Bot: in the case of the making the communal decisions accessible to citizens through a chatbot, a comparative test showed that Alexandria.Works outranked a traditional taxonomy-based search engine with an accuracy of 94% versus 83%, three times less faulty answers
- Customer Service: a professional services organization compared Alexandria.Works with a handcrafted in house built search tool; Alexandria.Works did 50% percent better
- European Council: policy officers tested successfully Alexandria.Works to increase their efficiency when writing advices and reports
- Maintenance Management: Maintenance departments suffer severely from the brain drain caused by the Baby Boomer Generation leaving the company; we see it as collective memory loss; Alexandria.Works improves the collective memory of Maintenance Departments by making all unstructured maintenance information accessible to younger generations
- Knowledge Management: in general, Alexandria.Works helps organizations to start harvesting the knowledge that is already present in the vast set of documents in the content management systems without having to invest in structuring that information
Sounds too good to be true? The Proof of the Pudding is in the Eating
The best way to understand the power of Alexandria.Works is by seeing it with your own eyes, optionally using it with your own set of documents. You can do exactly that by scheduling a demo.Schedule your demo here!
You can also order a Proof-of-Concept, an extended tryout of Alexandria.Works, based on a larger set of your documents and executed by a project team of testers.Order your Proof-of-Concept now!