Categorization and tagging articles with Machine Learning

Several customers have approached us with requirements for categorization and tagging of articles in order to provide a consistent, automated tags for news or knowledge base articles.

Automated article tagging has many advantages:

  • Immediate tagging (no waiting)
  • Time savings
  • Consistent tagging from defined set of tags
  • Human tagging checks can be incorporated to improve quality of tagging

The approach consists of “teaching” the machine learning algorithm by manually tagging a number of articles with a category.  This requires a minimum of one article for each tag.  A data pipeline can be created which automatically tags any new articles received with a number of article tags and relevance scores.   As the process continues, any new articles that are added to the process can help improve the quality of results as they are incorporated into the machine learning process.

At smart factory we have created such a process for a number of customers including news agencies, and information brokers.

idea product recommendation

Product Recommendation with machine learning using ElasticSearch

Elasticsearch provides us with powerful machine learning based tool for product recommendation for e-commerce.

What do we know about our customers?

It is relatively easy to know which products our customers have either bought, or clicked upon in the past.   Using elastic search we are able to leverage this data to recommend to these users other products which have been of interest to different users who have showed an interest in the same products.

For example if our user has clicked upon a book about medieval French history,  then it would seem obvious that we can show the user the most popular books in the category of medieval French history.   However this approach of simply repeating products in the same category may become tedious, and we may miss many interesting possibilities to offer users products across categories.  For example if  the user buys a camera we would possibly want to offer the user a book on photography.

Machine Learning Product Recommendation

Elasticsearch provides us the possibility to recommend products to users based on what other users who bought the same products as them have purchased.

Term aggregation

For example we have an index which contains all of our users, with all of the products they have purchased as shown in the mapping below.

If a user has bought a polaroid camera, then we can look up the most popular products purchased by other users who brought the same polaroid camera.  This list is likely to include products which are directly related to a polaroid camera (such as polaroid film), and indirectly related (books on photography).

Significant Term Aggregation

A term aggregation may indicate products which are popular with polaroid users, but some of these may be completely unrelated to polaroid cameras, (eg. Mobile telephones) simply because they are popular with everyone (including people who buy polaroid cameras).  If we want to avoid this, then we can use the significant terms aggregation, which will return products which are significantly more popular with polaroid camara buyers compared with our customer set as a whole.

The example below shows a significant term aggregation



This approach is interesting because as our volume of data increases, the quality of our recommendations improves:  as time goes by, we will learn more about our specific users and  at the same time grow our database of user preferences.






Make your own search engine with elasticsearch

In this article you can see how to use elasticsearch to create a fast search engine capable of deep text search, working with terrabytes of data.

We are going to build a search engine based on the living people category of wikipedia, store the data in elasticsearch, test the speed and relevance of our queries and also create an autocomplete suggestion query.


You already have elasticsearch and kibana installed.

Install Pywikibot

Pywikibot enables you to easily download the contents of wikipedia articles.  If you have access to a different source of data, then you can use that instead.

Instructions to install pywikibot are here

Configure pywikibot to use wikipedia.

This is done by running the setup script

python generate_user_files

The script is interactive and enables you to define the type of wiki you want to access.  In our case, choose wikipedia.

Install Python Libraries
pip install elasticsearch


Create a Mapping in Elasticsearch

The mapping tells elasticsearch what sort of data is being stored in each field, and how it should be indexed.

The following command can be pasted directly into  the Kibana Dev Tools console

This creates a mapping for document type “wiki_page” in the index “wikipeople” with four text categories (full url, title, categories,text) and one special field called suggest which will be used for autocomplete function (more on that later).  Note also that we have specified that the text field uses an english language analyser.   (as opposed to French,Spanish or any other language).


Create Pywikibot script

In the directory where you installed Pywikibot, you will find a subdirectory “/core/scripts”

In the scripts directory create a new script called

You can then get pywikibot to run your script using the following command (from ../pywikibot/core directory)


The output from the screen will reveal any errors, if all is going well, you should see how the script downloads pages from wikipedia and loads them into elasticsearch.   The speed of download will depend on your machine, in my case one or two pages per second.  For testing you can abort the script (ctrl Z) after a minute or so.

Elasticsearch Search engine Query

Below is an example elasticsearch query and the beginning of the response.

The “source” part of the command specifies that we exclude the text of the page to keep the size of the response down.

The query searches for the terms american football and bearcats in the title, category and body of the text.  However it gives greater weight to the score if these terms are found in the category and title (as determined by the values “boost” in the search query).

The highlight part of the command also returns the detail of the where the search term has been found.   This can be seen in the part of the response labeled “highlight”.  This makes it very easy to display the context of the search term to the user to enable them to see whether they are interested in the results.


Autocomplete suggestions Using ElasticSearch and Jquery

In our mapping we created a special field called “suggest” based on the page title.  This enables us to display an “autocomplete” suggester as the user types into the search box.  Autocomplete queries are optimized to provide very quick responses.   A sample query and response would be as follows:

The query returns suggestions where the title starts with the letters we have introduced in our query.    This would enable us to create autocomplete funcionality with jquery or similar.






MQTT and Kibana – Open source Graphs and Analysis for IoT

Following my previous article on how to interface MQTT with ElasticSearch, here, I am belatedly following up with an article on how you can use Kibana to graph the data.


You should have run through my tutorial on MQTT with Elastic Search, so that you actually have some data to look at.

Installing Kibana

To avoid compatibility issues, you should ensure that you are working with a version of Kibana compatible with your elastic search installation.  The easiest way to ensure this is by updating both.   I won’t repeat the instructions which are available on the Kibana web site.

Kibana and ElasticSearch are two parts of a single product offer, so there is very little difficulty in getting them to work together.

MQTT data format for Kibana

Kibana is ideal for working with time series data. The only tricky thing that I found using Kibana was to get it to interpret time data as a time, rather than string, or numeric.   For this you need to create a Mapping for your elastic search index, which in other words tells elasticsearch that the data you are sending is to be stored and interpreted as a time rather than string or integer.

mappingJson={"mappings": {
  "json": {
    "properties": {
      "timestamp": {
        "type": "date"

      "type": "float"


The above mapping will tell elasticsearch to expect data with three elements, a timestamp and a float value called dataFloat.  Most importantly this will cause elasticsearch to try to interpret the timestamp field as a time rather than storing it as a string.

Analysing MQTT Data with Kibana

Once you have got elasticsearch to interpret your data as a timestamp, you are able to take advantage of all of the functionality of Kibana that comes out of the box, this includes counts, averages values, derivatives and many others.


The set up we described is great for prototyping in a closed environment, but as we have been developing the project, we found ourselves hampered by the lack of security features on Kibana.  It is possible to provide basic login functionality using NGINX, but we could not find an easy way to provide restricted access to data according to account (this is a paid feature in elastic/Kibana).   For this reason we have started to use Grafana with InfluxDB as an alternative.

Zibawa Open Source Project

Zibawa is a project which brings together a number of open source tools to produce a secure IoT system , fully open source from device to dashboard.  The project includes device manager, device and user security management (LDAP), queue management and monitoring (RabbitMQ), Big data storage and api (InfluxDB) and Dashboards (Grafana).


More information

Zibawa Open IoT project source code








How to automate and track email campaigns on google analytics using thunderbird

In this article we will show you how to create a personalized email campaign and create a tracker on google analytics so that you can see who has clicked on the links in your campaign.  We will use opensource mail client Thunderbird, with a plugin called Mail Merge, and google analytics.


You need to have set up google analytics on your web site.

Install thunderbird

Install mail merge plug in for thunderbird

From thunderbird, go to   Tools>Addons>Extensions   then search for Mail merge plugin.

After installing restart thunderbird.

Create a csv file of your contact list

You can create a csv file using excel, or LibreOffice Calc, using save as csv option.

Each column will be some data you want to include in the marketing mail.

Each row will be transformed into a single email.

The CONTACT_ID should be a code number which we will use to identify the customer and will appear in google analytics.

Take care to remember exactly the syntax used for the first row, these will be our PLACEHOLDERS below.

Screenshot from 2017-05-24 12-09-16


Create a mail in thunderbird

Use the format {{<place holder>}}  where <place holder> is the Column Header of the data in the csv file you want to merge into your mails.    The text must match perfectly.

In my example I am using {{EMAIL}} for the users email.(see below)


Screenshot from 2017-05-24 10-37-14

We strongly recommend you include a text or link to enable people to UNSUBSCRIBE from your mailing list.

Add a custom link with google analytics tracking code to your email

Now edit the links to include the tracking code as follows:

The format is:{{CONTACT_ID}}

You need to replace the parts in red with your web landing page, campaign name and keyword.  If you have two or more links in your email, then by using two different keywords you can tell which link the user has pressed.

The CONTACT_ID placeholder will pull the CONTACT_ID column that you put in your csv file earlier just like we did for the email address, so do not change this.

Note! The contact_ID must NOT be anything you can directly identify a user with (such as an email or name) since that contravenes privacy laws in many countries.


Create the mails using mail merge

When you have finished editing, press


Screenshot from 2017-05-24 10-45-11

Source- csv

Deliver mode: send later

csv – select the csv file you created ealier

Character set, Field delimeter and Text Delimiter should be the same as you used when saving the csv file. (Recommended UTF-8, Tab, “)

Rest of values- leave blank

Click OK and one mail will be created for each line in your csv file in your OUTBOX.

Check the contents of your marketing campaign

Before sending, make sure that the mails have been created as you expected in your OUTBOX.  Check several mails, and pay attention to how the place holders and links have been built.

Check the links by clicking on them yourself.


Send your campaign mails

Outbox, right click , send unsent messages

Check the results in google analytics

You will probably need to wait at least 24 hours before any results show up.  You can see them in your google analytics account.  Analytics>Acquisition>Campaigns.

Screenshot from 2017-05-30 16-13-58mail_campaigns



Here you will see the Campaign properties we included in the link for every time a link is clicked.   If you want to know who the user is, you need to add a secondary dimension, which you will find under Advertising>Ad Content or Advertising>Keyword.

This way we know not only how many users clicked on our campaign mail, but also who they are, so we can follow up with those customers.



Introducing the Open Source IoT stack for Industrie 4.0

The open source IoT stack is a set of open source software which can be used to develop and scale IoT in a business environment.  It is particularly focused towards manufacturing organizations.

Why Open Source?

Continue reading “Introducing the Open Source IoT stack for Industrie 4.0” »


Storing IoT data using open source. MQTT and ElasticSearch – Tutorial

Why ElasticSearch?

  • Its open source
  • Its hugely scaleable
  • Ideal for time series data

It is part of the elasticsearch stack which can provide functionality for the following:

  • Graphs (Kibana)
  • Analytics (Kibana)
  • Alarms

What is Covered in This article

We are going to set up a single elasticsearch node  on a Linux Ubuntu 16.04 server and use it to collect data published on a Mosquitto MQTT server.  (It assumes you already have your MQTT server up and running.)

For full information and documentation, the IoT open source stack project is now called Zibawa and has a project page of its own -where you will find source code, documentation and case studies.

Installing ElasticSearch

Create a new directory myElasticSearch

mkdir myElasticSearch
cd myElasticSearch

Download the Elasticsearch tar :

curl -L -O

Then extract it as follows :

tar -xvf elasticsearch-2.4.1.tar.gz

It will then create a bunch of files and folders in your current directory. We then go into the bin directory as follows:

cd elasticsearch-2.4.1/bin

And now we are ready to start our node and single cluster:


To store data we can use the command

curl -XPOST 'localhost:9200/customer/external?pretty' -d '
"name": "Jane Doe"

To read the same data we can use

curl -XGET 'localhost:9200/customer/external/1?pretty'

If you can see the data you created, then elasticSearch is up and running!

Install the Python Client for elasticsearch

pip install elasticsearch

Install the PAHO mqtt client on the server

pip install paho-mqtt

Create a Python MQTT client script to store the MQTT data in elastic search

Use the script which uses both the MQTT Paho and ElasticSearch python libraries.  You will need to modify the lines at the top depending upon the port and IP address of your MQTT installation.

You can download the file from

Or if you have GIT installed use:

git clone

The script should be installed into a directory on the same server as you have ElasticSearch running.

Run the Python MQTT client we just downloaded


To view the data we just created on elasticsearch

curl 'localhost:9200/my-index/_search?q=*&pretty'

We are now storing our MQTT data in elasticsearch!
In the next few days I will publish how to view MQTT data in Kibana where we will make graphs, and analyse the MQTT data.

Further Information


Zibawa – Open source from device to Dashboard.  Project, applications, documentation and source code.


Running as a service on Linux I didnt use this, but probably should have!


ElasticSearch Python Client


Agile Digital Transformation – Low risks high ROI

An inspirational video from CEO Peter Schroer of ARAS , a manufacturer of Product Lifecycle Management Software (PLM). Schroer advocates the use of an “agile” approach to factory digital transformation, the same as that used in software development. Strategically it is a mistake to draw up a large complex project, since during the execution of the project many things will change , invalidating many of our original plans:

  • Business objectives change
  • Changes in the business environment (economic down/up turn),
  • Changes in company ownership(mergers / aquisitions)
  • Changes in customer requirements
  • Changes in legal requirements to be met
  • New technology appears

If we limit ourselves to the implementation of simple, achieveable tasks which have a direct and immediate impact on the business, which are achievable over a number of “sprints” (where a sprint is managed in weeks not months) then we have far greater probability of success.

That doesn’t mean that we have no longer term vision or road map of where we are going in the long term, rather it means that we do not devote a large amount of resource to detailed planning or execution of the long term, or paralysis by analyisis.

This approach has important consequences:

We must accept that a single software system or platform is not going to be able to continually adapt to our changing business requirements. This means that communication between systems and the use of open source software is an essential part of the strategy to avoid lock in to a single system.

Large upfront license payments are rarely compatible with this philosophy for the same reason. Software as a service which enables us to pay as a funcion of number of users or volume of transactions is much more compatible with this approach.

We must be prepared to test, put a foot in the water, test, then build on what works, and abandon what fails.  As Schroer says, “Do something we know, make it work, then go onto the next thing”.

You will find the video on agile plm implementation here.


Build a community around your product

If digital transformation is about finding new ways to engage your customers, then there are few better ways than to build a customer portal.  Software companies have for years published a knowledge base and user community forum, but it is only more recently that manufacturers have started to follow the same trend.  Continue reading “Build a community around your product” »


Build your own customer service portal


A customer service portal will enable your staff to set up a repository of information to help users install, use and service their products, using a software similar to that used in Wikipedia.  This, together with a user forum will enable your customers to ask questions and receive answers in a web format which enables other users to see the answers independent of whether your customer support staff are in the office.

Furthermore, these tools (optionally) can permit the empowerment of other people outside of your organization to contribute to the knowledge base about your products by adding their own comments and participating in the discussion.   Building a community around your products is a great way of increasing your customer and user loyalty to your products, improving your customer service, and also will help increase the visibility of your products on the web, since google and most other search engines rank web sites on the basis of the volume of useful content on your web site.

Our service includes:

  • Setup and configuration of knowledge base and user community forum.
  • User training
  • Telephone help and support (12 months)