Monthly Archives: August 2017

autocomplete

Make your own search engine with elasticsearch

In this article you can see how to use elasticsearch to create a fast search engine capable of deep text search, working with terrabytes of data.

We are going to build a search engine based on the living people category of wikipedia, store the data in elasticsearch, test the speed and relevance of our queries and also create an autocomplete suggestion query.

Pre-requisites

You already have elasticsearch and kibana installed.

Install Pywikibot

Pywikibot enables you to easily download the contents of wikipedia articles.  If you have access to a different source of data, then you can use that instead.

Instructions to install pywikibot are here https://www.mediawiki.org/wiki/Manual:Pywikibot/Installation

Configure pywikibot to use wikipedia.

This is done by running the setup script

python pwb.py generate_user_files

The script is interactive and enables you to define the type of wiki you want to access.  In our case, choose wikipedia.

Install Python Libraries
pip install elasticsearch

 

Create a Mapping in Elasticsearch

The mapping tells elasticsearch what sort of data is being stored in each field, and how it should be indexed.

The following command can be pasted directly into  the Kibana Dev Tools console

This creates a mapping for document type “wiki_page” in the index “wikipeople” with four text categories (full url, title, categories,text) and one special field called suggest which will be used for autocomplete function (more on that later).  Note also that we have specified that the text field uses an english language analyser.   (as opposed to French,Spanish or any other language).

 

Create Pywikibot script

In the directory where you installed Pywikibot, you will find a subdirectory “/core/scripts”

In the scripts directory create a new script called wikipeopleloader.py

You can then get pywikibot to run your script using the following command (from ../pywikibot/core directory)

python pwb.py wikipeopleloader.py

The output from the screen will reveal any errors, if all is going well, you should see how the script downloads pages from wikipedia and loads them into elasticsearch.   The speed of download will depend on your machine, in my case one or two pages per second.  For testing you can abort the script (ctrl Z) after a minute or so.

Elasticsearch Search engine Query

Below is an example elasticsearch query and the beginning of the response.

The “source” part of the command specifies that we exclude the text of the page to keep the size of the response down.

The query searches for the terms american football and bearcats in the title, category and body of the text.  However it gives greater weight to the score if these terms are found in the category and title (as determined by the values “boost” in the search query).

The highlight part of the command also returns the detail of the where the search term has been found.   This can be seen in the part of the response labeled “highlight”.  This makes it very easy to display the context of the search term to the user to enable them to see whether they are interested in the results.

 

Autocomplete suggestions Using ElasticSearch and Jquery

In our mapping we created a special field called “suggest” based on the page title.  This enables us to display an “autocomplete” suggester as the user types into the search box.  Autocomplete queries are optimized to provide very quick responses.   A sample query and response would be as follows:

The query returns suggestions where the title starts with the letters we have introduced in our query.    This would enable us to create autocomplete funcionality with jquery or similar.