Populating Data Warehouse Via ETL - MifosBI_1.2 Onwards

General

The Mifos data warehouse is populated by an "ETL" program.  Although it can be run manually, it is usually scheduled to run each day just after midnight (although this isn't mandatory... you could skip weekends, holidays or run only once a week depending on how up-to-date you need the data warehouse to be).

In linux, the ETL would be scheduled via cron.  In windows, via the task scheduler.

There are about twelve Mifos standard reports that rely on the Mifos data warehouse e.g. MFI Progress, Loan Officer Detailed, Loan Officer Summary, Due vs Collected. 

When entering parameters to run these reports you will see a note saying 'Data for this report last updated on <date>'.

There are two ETL options available.

Full ETL

This option rebuilds the Mifos data warehouse from scratch each run.  It is fine for test situations and for small MFIs.  As a rule-of-thumb, if an MFI has about 10,000 clients and has been running on Mifos for 2 years, the Full ETL should take between 2 and 3 hours.  The ETL should run on the same machine as the mysql database.  If it doesn't the additional network traffic will slow down the process dramatically... by up to 4 times in the case of using Amazon RDS. 

It will move all the data, including 'todays' into the data warehouse.

Run in linux using:

bash "full path down to"/ETL/MifosDataWarehouseETL/full_reload.sh DW_DB_Name PDI_HOME MYSQL_USER_DETAILS

Where
DW_DB_Name is the name of your mifos data warehouse database
PDI_HOME is the path of your PDI installation e.g /opt/pentaho/data-integration/
MYSQL_USER_DETAILS is the mysql username and password (if required) details e.g. -uroot -prootpwd OR -uroot

Run in windows using:

full_load.bat DW_DB_NAME

Where
DW_DB_Name is the name of your mifos data warehouse database

You will need to create the bat file yourself.  Here is an example of the contents.

SET PDIPATH=C:\Users\John\Desktop\PentahoPlatform\pdi-ce-4.0.0-stable\data-integration\
SET DWPATH=C:\dev\businessIntelligenceRepo\bi\ETL\MifosDataWarehouseETL\
echo DROP DATABASE IF EXISTS %1; > dropdb.sql
echo CREATE DATABASE %1;  > createdb.sql
mysql -uroot -pmysql < dropdb.sql
mysql -uroot -pmysql < createdb.sql
mysql -uroot -pmysql %1 < %DWPATH%load_mifos_datawarehouse.sql
mysql -uroot -pmysql %1 < %DWPATH%load_mifos_datawarehouse_stored_procedures.sql
mysql -uroot -pmysql %1 < %DWPATH%load_ppi_poverty_lines.sql
mysql -uroot -pmysql %1 --default_character_set utf8  < %DWPATH%load_dw_ppi_survey.sql
%PDIPATH%kitchen.bat /file:%DWPATH%DataWarehouseInitialLoad.kjb /level:Basic > transformationLog_%1.log
EXIT

Incremental  ETL

This will populate the data warehouse with all data up to the end of yesterday.  It only adds data that has been changed or added since the last run.  

The incremental ETL is limited to data 'up to the end of yesterday' because Mifos auditing is generally at the day level.  However, the reports based on the data warehouse shouldn't require 'right now' data (Note: the default date for date parameters may be later than the "last updated on" date stated on the parameter page... in this case you'll need to pick dates no later than the "last updated on" to ensure correct loan and savings balance data)

Run in linux using:

bash "full path down to"/ETL/MifosDataWarehouseETL/incremental.sh PDI_HOME

where PDI_HOME is the path of your PDI installation e.g /opt/pentaho/data-integration/.

Note that no database name or user details are required.  These get picked up from the definition of DestinationDB in data-integration/simple-jndi/jdbc.properties.

Run in windows using:

incremental_load.bat

You will need to create the bat file yourself.  Here is an example of the contents.

SET PDIPATH=C:\Users\John\Desktop\PentahoPlatform\pdi-ce-4.0.0-stable\data-integration\
SET DWPATH=C:\dev\businessIntelligenceRepo\bi\ETL\MifosDataWarehouseETL\

%PDIPATH%kitchen.bat /file:%DWPATH%DataWarehouseIncrementalLoad.kjb %1  /level:Basic > transformationLog_incremental.log

EXIT

Yes, it is possible to pass a date parameter (yyyy-mm-dd) to the incremental ETL. This is an "up to" date.
e.g. 2010-01-01 would build the data warehouse up to 1st Jan 2010.

However, this shouldn't be done in normal usage. The parameter is used for testing and, potentially, when building up the data warehouse for a large MFI (i.e. when tweaking and handholding the ETL).

Mifos 2.2 (Application) Release Changes

The Incremental ETL doesn't derive its full benefit until database changes slated for Mifos 2.2 are implemented.  However, these changes can be implemented on top of mifos 2.1 Mifos 2.2 Potentially Large DB Changes and PreUpgrade Option

No Built-in Recovery Mechanism Yet for Incremental ETL

If the incremental ETL fails you need to get to a clean starting point before running it again. Currently, there is no out-of-the-box way to do this provided.  However, there are a number of ways to approach this depending on the size of your data.

  1. Start Again
    Just delete the data warehouse and it will build from scratch up to yesterday.  This is the easiest option for smaller MFI's where the 'full' ETL doesn't take too long to run.
  2. Restore
    Any backup taken after the last successful run (data warehouse table stg_etl_run_history will show this) will do.  The backup can be added to run after a successful ETL.  Whether the restore is done from a mysqldump backup or by copying files (or some other way) will depend on how you backup mysql data.  This is slightly more admin work but will be faster than option 1 above (especially if you can just copy files).
  3. Customized Recovery (probably fastest but more admin work)
    Add a job to run after successful ETL to
    • Backup tables dim_, dw_, stg_etl_run_history and ppi_category_likelihood_bands (with a 'drop table' option if not just copying files so that the restore recreates them).  Ignore all stg_* tables other than stg_etl_run_history as they get truncated during ETL.
      Then to recover
    • Restore the backed-up tables
    • Run a SQL script to delete data in 'transaction' tables (hist_*, fact_* except for fact_ppi_survey_results which is recreated each time) that was created after the stg_etl_run_history.etl_complete_to_date of the last successful ETL.  hist_* tables have as_of_date, fact_* tables have created_date_key except for fact_client_attendances which as a meeting_date_key.  Example partial delete script is:
/*note that as_of_date is in yyyy-mm-dd format whereas 'key' dates are in yyyymmdd format*/
delete from hist_savings_balances where as_of_date > '2011-08-01';
delete from fact_loan_disbursals where created_date_key > '20110801' ;
delete from fact_client_attendances where meeting_date_key > '20110801';

Switching between the Options

The "Full ETL" option will always wipe out current information.  So, nothing needs to be done to switch from Incremental to Full.

To switch from Full to Incremental, first drop the mifos datawarehouse (usually called mifos_dw)

drop database mifos_dw;

The first run of the Incremental will build all the data up to yesterday.

Potential Customising For Larger MFI's

There are tweaks and trade-offs that can be made when running the ETL that may be relevent for larger MFI (50K+ clients).  These are best discussed on a case-by-case basis via the mifos-users list.

Troubleshooting