SQL / BI Case

Let's imagine an example where a company would have employees or contractors who have a daily cost, a cost that for various reasons varies over the course of the day. In practice, this information is stored in a software database and should be transferred to a datawarehouse in order to have a history and to allow the company to have a clear and simple overview of the situation.

The exercise consists of retrieving and storing this information, then setting up a dashboard system using " Power BI ". An additional reason to take again in details the methods used, is that I realized that quite a lot of people and professionals in "Data Engineering" would have done that in a way, certainly functional, by using a method such as: External script such as " PHP " or " Python ", launched by a " Cron " task which depends on the operating system to then store the information line by line and day by day ... However, it is fundamental to be able to be practical and efficient, by limiting resources and execution time, both of which have significant impacts, hence my proposal for a much more direct and efficient alternative!

Let's get down to business!


1. The data source

As it can be seen in the following image, this is a database containing the names of people with an amount that can vary during the day, usually every 6 hours.


2. Creation of the " Datawarehouse "

For the data to keep and use, the human resources manager has a small local machine that he would like to use, this hardware running on Linux so I will install a database using " MariaDB ", similar to " MySQL ", generic and also in order to stay in pure open-source and lightness. Installation of the necessary packages:

sudo apt-get update && sudo apt-get upgrade
sudo apt-get install mariadb-server -y
sudo apt-get install mariadb-plugin-connect -y

As you may have seen in the previous point, I have installed a " Maria DB: Connect " plug-in, which will also have to be activated via a simple SQL command:

INSTALL SONAME 'ha_connect';

Once the system is in place, I create a database and a specific table using an internal history system: " System versioning " which will keep track of changes made to records according to periods to allow to find previous values, this by storing them in a specific way. This device is also available under " Microsoft SQL Server " via "Temporal tables". (Info link).

USE Salary_Tracking;

CREATE TABLE Salary_History (
   ID INT PRIMARY KEY,
   LastName VARCHAR(255),
   FirstName VARCHAR(255),
   Salary DECIMAL(20, 2),
   start_timestamp TIMESTAMP(6) GENERATED ALWAYS AS ROW START,
   end_timestamp TIMESTAMP(6) GENERATED ALWAYS AS ROW END,
   PERIOD FOR SYSTEM_TIME(start_timestamp, end_timestamp)
) WITH SYSTEM VERSIONING;


3. Gatering of the remote data

A script in a third party programming language, linked to a task that executes it, generating a good dose of code, hoping that everything works? No thanks! I prefer to use a single system to do the whole process, so I stay in the chosen database system!

To do this, a script allows you to create a separate table locally, all this thanks to the " Connect " plugin which will connect to a remote table in order to take over its structure and content (possibility of modifying the columns if necessary), and which will allow, during a selection request made on this local table, to have a transparent update of the data.

CREATE TABLE `Remote_source`
engine=CONNECT table_type=MYSQL
connection='mysql://user:pass@host/database/table';

Finally, all that's left to do is to insert these remote data in the table containing the history, by performing an insertion based on the selection of my previously linked table:

INSERT INTO `Salary_History` (ID, LastName, FirstName, Salary) (SELECT * FROM `Remote_source`)
ON DUPLICATE KEY UPDATE Salary = `Remote_source`.`Salary`;


4. Updating data

Of course, running a data insertion query manually is not sustainable in the long run. Therefore, I will use another possibility of the database engine: the execution of automatic tasks that can be defined as needed, here an execution every six hours and one minute from a certain date, in order to retrieve and store the data that has been modified in the remote table.

CREATE EVENT IF NOT EXISTS `Update_History`
    ON SCHEDULE EVERY 6 HOUR STARTS '2020-03-01 00:01:00'
    DO
        INSERT INTO `Salary_History` (ID, LastName, FirstName, Salary) (SELECT * FROM `Remote_source`)
        ON DUPLICATE KEY UPDATE Salary = `Remote_source`.`Salary`;

Once this recurring task is defined, don't forget to activate it, or even to check if it and the events are in place:

SET GLOBAL event_scheduler = ON; # Start Scheduler
SHOW EVENTS; # Give complete information on a process
SHOW PROCESSLIST; # Show the processes that are registered

A SQL query , SELECT * FROM Salary_History; , allows you to consult each user's data at the present time:

While another , SELECT * FROM Salary_History FOR SYSTEM_TIME ALL; , allows you to view the history :


5. Making the history available

Already thinking about the next step under " Power BI " for the desired visualizations, the complete and raw historical data as it is currently presented will not be sufficiently well presented and therefore requires some improvements. Rather than doing it in a third party tool, it seems useful to me to perform one last task under SQL in order to continue to maximize performance and to avoid further manipulations which will be very simple here. In order to deliver this data I will ask the database management system (DBMS) to create a view that will contain my complete dataset with: the day of the year, the date, and an average salary for the day.

CREATE VIEW `Salary_View` AS
    SELECT ID, LastName, FirstName, DAYOFYEAR(start_timestamp) AS `DayOfYear`, DATE(start_timestamp) as `DayDate`, AVG(Salary) AS `DayAverageSalary`
    FROM `Salary_History` FOR SYSTEM_TIME ALL GROUP BY DAYOFYEAR(start_timestamp), ID;


6. Viewable data

Let's move on to the visualization tool that will allow the user to easily consult his data. I won't describe the whole process of creating a data source making a direct query on the database (hence the existence of the view defined in the previous point), nor the verification of the dataset or the way to create each element, but I will let you discover the final result expected.