What is difficult about Intranet Search?
At one level, search may seem very simple – you give users a search box; look through the content on your intranet for matching items; and then present these as a list. Not only that, you can buy off-the-shelf search engine technology from various providers. Surely we can just buy the Google search service, point it at our intranet and the job is done?
In reality almost every aspect of the process is difficult, and in some ways creating a good intranet search is harder than creating a search on the internet.
Content Volume & Quality
On the internet, there is an abundance of content; much of it is in a readily accessible form (primarily HTML); content is linked together in a web of connections that indicates relevance; and a good proportion of the content is kept up-to-date and carefully looked after (companies have a competitive pressure to ensure their online presence is of a good quality). The primary challenges on the Intranet are dealing with the sheer volume of material and indexing new material quickly.
In contrast, an intranet is usually relatively small, meaning there is not a wealth of content to play with. Content is often in an unhelpful format – particularly in the form of documents like PDFs, Word and Excel. Content is often poorly linked together, if at all, leaving few clues as to the genuine significance of any of the content. Finally, content is often out-of-date and very poorly managed. In contrast to their external content, companies feel little pressure to manage the content they present to employees.
Socially-generated intranet content is often better, since it is more likely to be in HTML; is more likely to be interlinked; there is more of it; and it can be of higher quality – directly addressing employees real concerns.
However, even here there is an achilles heel: whilst it is easy for employees to create content, there is typically no incentive (and no one feels sufficient ownership) to ever delete content. This means that socially generated content is likely to fall in quality over time as the system gradually gets weighed down by out of date content.
These factors mean that intranet content is challenging as a base for building search.
Building an Index
In order to search the content on an intranet, you first need to create an ‘index’ – a database containing details of all the content and which is optimised for searching. When a search is run, software looks through this index to find matches, rather than looking through the intranet itself. However, building this index is challenging.
Some of your intranet may be accessed via a crawler (a piece of software that follows links and automatically collects information about each page that it finds), but not all content will be reliably found like this.
Content like a directory of employees, or historical news articles, is probably best sourced directly from the database that it is held in, as there may be no links leading to the pages that they are displayed on. Content from a Wiki, or a collaboration platform may be best sourced from the underlying content management service, as this may have more information about the content than is contained in the HTML for each page.
Connecting to these different sources is complex and time consuming; care must be taken to ensure that the data is refreshed regularly so that it does not become out of date; and server space must be found to host the index.
Searching and Ranking
Assuming that you can create a single index containing all of these disparate sources, you then need to be able to search through that content for matches and rank these so as to show the best content at the top of the results.
In theory, finding matches is easy enough, and is largely solved by using an off-the-shelf search provider. Results are returned if they contain all of the words typed in by the user, with some potential complications in the shape of stemming (looking for plurals and other variants of the search terms); and synonyms (e.g. looking for ‘organize’ as well as ‘organise’).
In contrast, ranking is very difficult. Whilst off-the-shelf providers will contain their own ranking solutions, there are several major challenges.
First, ranking typically needs to be configured – there are multiple options for how it can be set up and what values to use for various parameters. It is not at all clear what the best settings may be, or how to derive them.
Second, ranking is not ‘magic’ – it uses some fairly basic rules to help determine which documents are the most important. For example, items will be boosted if they contain the search terms multiple times, rather than just once (frequency); and they will rank higher if the search terms appear near to each other in the text (distance). Statistics such as the number of visitors that a page receives could also be used, if they are available. Famously, Google was founded on the concept of ‘Page Rank’ – that a page was important if it was linked to by other pages, and that it was even more important if those linking pages were themselves high in incoming links.
However, these metrics can prove challenging on an intranet – particularly for some content types. For example, intranet pages are often not well linked together, undermining the value of page rank, while content like employee details is not amenable to metrics like word frequency and distance – a listing for ‘John Smith’ is unlikely to contain the words twice and it won’t mean anything if it does.
The third major challenge is that it is extremely difficult (if not impossible), to realistically compare the rank of different types of content. For example, if someone searches for a person’s name, how can you possibly determine whether the best result is the directory entry for that person, a news story about them, or a blog published by them? It depends on what the user is looking for, and they haven’t specified that in the search term.
Note that this isn’t a problem of a lack of capability in the search engine, or of poor quality data in the index. The problem is that the query simply doesn’t contain the information needed to make a determination. We could make a judgement call in tweaking the ranking – perhaps deciding that searching for contact details is one of the main use cases for the search, and therefore always ranking directory results higher. However, what if there are 50 directory results for that name? They will block out all other result types.
In practice, creating connections to all of the possible intranet data sources; determining an effective update schedule; and maintaining an index may be prohibitively complex or expensive – especially given that intranet searches are often poorly funded.
A cheaper option is to avoid creating a central index and develop a federated search instead. In this approach a central service receives the user’s query and then distributes it to existing search capabilities in each of the content areas. The results from these services are then combined together to create a single results page for the user.
This approach has the advantage of significantly reducing cost and complexity, but imposes additional challenges. The quality of the overall result is now dependent upon the action of a number of independent search capabilities, which probably use different technologies, and vary in their quality and performance.
Rather than having access to the underlying content and determining the final ranking from that, we are now in the position of receiving only a limited set of information from each search service and having to make a determination based on that. There is no way to decide what the scores from each service really mean and how they can be compared. This makes it even less likely that we can effectively combine these various sources into a single satisfying result.
We have been conditioned by the success of internet searches like Google into the belief that search is a solved problem – that it is easy to create high-quality search results; that search services can be bought off-the-shelf; and that search can be deployed easily. In reality, search remains complex; it is challenging to implement successfully; and there are particular reasons why intranet searches are hard to do well. If we want to create a good solution, then we must be prepared to invest in it.