ICT BI-Analysis

When working in a research-focused project, everybody knows how to analyze data in systems widely used in the scientific community. In this section we want to present a different approach to data analyss which might help to vulgarize results from research-related projects.

We all do know the problem:

projects with a heavy research focus do very interesting work, but often the results are made available only to a very small group of people. Those people will be in research themselves or at least be able to read scientific publications of international standard. Even if there are people specifically present in the project for distribution and dissemination, more often than not they do not get the results in nicely looking, yet simple grahics, which can be understood by the target public.

How to best distribute insight to the target public - variety is king.

As a consequence there are scientific publications which do of course have their own value and then a truck load of target public information which is more or less directly understandable and hence useful to those concerned. In our project we have taken multiple approaches to ake sure the results are disseminated adequately, using trainings, broadcasting though different channels and village-based knowledge structures. This is obviously very useful and a good approach to follow.

Yet there is the nagging feeling that it would be good if not-so-data-analysis-fluent people could retrieve insight from data themselves and distribute it with the support of simple graphics. From the world of traditional companies a technology called BI (Business Intelligence) trickles down trying to fill this void. In fact also in companies there is the problem that much data is available, but one cannot wait for complex analysis to be run by data specialists. BI claims to give the "ordinary user" the power to do straightforward data analysis themselves and to present it in single graphics or even dashboards composed of many graphics and indicators.

What is BI?

BI - huh?

Simply put BI is in the context of this project a kind of data modeling and processing standard which allows tools adhering to it to access data without much programming need. Once the data have been read as a model, they can be analysed in a multidimensional data set using methods which are well known to both spreadsheet users as well as data analysts.

There are many BI tools from many different vendors. While we have many years of experience with BI systems of vendors like SAP, IBM and Oracle, we have settled in this projects onto PowerBI from microsoft. This is mainly related to the fact that attractive license conditions are available if you want to make only a small scale use of the project. Also the interface makes you think a lot of Excel, which will help with adapting to the new tool. Just to be very clear - we are not advocating any vendor here, we are just reporting about the choices we made and why we made them. Any decent BI vendor will do as well!

What is a BI model?

Forget about multidimensionallity - reality is much simpler

BI models are traditionally built on top of DWHs (Data Warehouses or Marts). While in the DWH you concentrate on clean data from different sources and an efficient (i.e. relational) way to store data, a BI model puts the focus onto simple dataswer structures which can be kept and processed in memory very efficiently. That makes sense, since BI tools aspire to answer your questions quickly and interactively. But BI models can be built also without any underlying DWH. We will look into this possibility.

The core data are contained in what is called a fact table. Fact tables are pretty trivial objects: for each data block there is one row and in this row there there are columns which are either a link to a so-called dimension or a number, which is often called a measure. Let us look at the following, pretty trivial fact table.

Example of a fact table
Farmer ID	Province	District	Income Onsite	Income Offsite	Profit Onsite	Shares Machinery	Milk Production
17	P_A	D_1	200	150	100	yes	yes

A very simple fact table.

This table will in general have many lines, in our example one line for each questionnaire answered by a farmer. BI does not care about line numbers, so there is no need for any. The order of the lines is in general also of no importance.

In the first column we do have the farmer ID. It could also just be a line number, if you want to be able to reference the line (BI does not need it, as pointed out above). iI you do not want or need to identify your farmers, you could also just leave the column away,

Dimensions and hierarchies

The next two columns are references to dimensions. There is the Province dimension and the District dimension. Both will - not surprisingly - hold a list of provinces respectively districs. In the columns there will be either the name of a province/ district, or a Province_ID/ District_ID which references the elements. Taking provinces and districts together makes sense, since then you can build a hierarchy. A hierarchy is just what it claims to be: a dependency between levels. In our case there are Districts belonging to Provinces (which could belong to countries, if we wanted to add one more level). The nice thing about hierarchies is that you can get aggregate numbers about different hierarchy levels.

Executing operations on the data set

What a BI system now does - to put it very simply - is to do selection of data using the dimension colums and calculations using the measure columns. If you select P_C as a province, the BI system will just answer questions about this province. If you ask it to add up all onsite income for this province, it will add up the Income Onsite column for all farmers living in the P_C province. Of course instead of adding up you could also have asked for the average, or the top and bottom 5%, or whatever question you are interested in. You could also find out, if farms sharing machinery are more profitable by selecting "yes" and then "no" for the Shares Machinery column.

Of course you are right to argue that the comparisons made here are not so useful. Maybe tops and flops should be calculated based on the farm surface. After all it is not fair to compare a small with a big farm. BI tools give you the possibility to do such calculations. So if you add the surface of a farm into the fact table, you can create a calculated column, which would then contain "Profit per Hectar" and consitute a much better column for top/ flop analysis.

Interested in a real-world example?

If you are interested into seeing what can be the result of such analysis, once you put more data into your fact tables and use more dimensions, you can have a look into the data from our project here. These are, please keep that in mind, very simple, straigthforward charts which do not have the complexity of work done in the research context. But this is what BI is good at - feeling your way around your data set, interactively getting a feel for it and then digging deeper.