Scalability and Architecture Discussion

The ProblemContents::
As at 1st Jan 2010 GK is using Mifos v1.4 and has about 370K clients. Mifos, in its current configuration, is currently near full throttle at about 300K clients. Having exceeded that number, Mifos is under a great deal of pressure right now at GK. GK is expected to reach 450K in March 2010 and 1M around March 2011.

The following problems areas have been identified:

Scaling

  • Mifos current configuration in GK (hardware/software) doesn't scale beyond ~300,000 clients, but we need to scale to 1M by March 2011.

Clean Design

  • It's too hard to modify Mifos, so we can't add features fast enough or at a low enough cost.

What we know

GK will increase customer to 1M around March 2011 (2/3 times increase) In achieving this the number of branches is expected to increase from current 160 -> 200/300 (2 times)
John W: I thought there was more than 160 now... though the GK 2009 db figure for Office is only 107
The GK growth figures on the database from 2008-2009 showed what areas were growing on Database:
csv-table:: 2009 versus 2008 growth figures:header: "Table Name", "2009", "2008", "% Growth"
:widths: 30, 10, 10, 10

"financial_trxn", 115M, 64M, 77.33 "account_trxn", 52M, 27M, 88.10 "loan_trxn_detail",45M,26.4M, 72.15 "fee_trxn_detail",9.7M,9M, 10.64 "customer_trxn_detail", 7M, 2M, 264.77 "savings_trxn_detail", 0, 0, 0 "customer_schedule", 42.8M, 20M, 117.49 "loan_schedule", 42.7M, 21M, 105.05 "saving_schedule", 100,0,10 "customer_fee_schedule", 0,0,0 "loan_fee_schedule", 4,4,0 "customer", 346K, 194K, 78.13 "account_payment", 36M, 17M, 110.32 "account", 2M, 933k, 115.16 "loan_account", 1.6M, 748k,122.26 "customer_account", 346k, 189k, 82.96 "savings_account", 2, 1, 100.00

Looks like customer base is increasing and most usuage is around loan accounts (creation, disbursement and scheduled repayments) which results in tracking of financial transactions for General Ledger.

  • 50% of data in mifos database is related to GL transactions
  • 20% of data in mifos database is related to schedules (customer or loan)

Outstanding Questions

  1. Where are the bottlenecks in the system currently? #. Is a bottleneck going to happen with the number of incoming http requests to tomcat. #. Is it on the database? #. Is it the application processing on the server?
  2. What is this current level of usage/load
  3. What are the most common requests (e.g collection sheet etc) #. What functionality is mostly used by GK branches? #. How can we measure this? #. Need to know things like requests per day from branches - could use logs possibly #. Need to find out GKs peak usage and see how many and what requests these are...
  4. Why do GK need to restart tomcat/application? #. How many times a day/week do they do this? #. Is it after a practicular kind of activity? #. We should probably monitor tomcat throughout lifecycle.
  5. What timeframe is the application disabled to allow batch jobs to run?
  6. Are the infrastructure details mentioned here still accurate?
  7. What load and data size can one box handle? (standalone tomcat application server with database on its own server) #. Use performance testing against production-like data with production-like load to see when things start to diminish / detiorate.

Deployment Architecture

see infrastructure details

GK's current architecture is:

  • Tomcat running on its own server (application server)
  • Batch jobs running on (application server)
  • MySQL 5 running on its own server (database server)
* Reporting runs against production database (application server, database server)

How do we proceed?

We need to identify just where the scaling and design problems exist.

Metrics:

Simple metrics need to be put in place on the GK production server(s) that show transaction and reporting response times throughout the day. Batch figures are already available. These figures can be used to confirm the impact of changes.

Indicators:

Although many areas of Mifos would and will benefit from scalability improvements, the performance lab assessment will center on the collection sheet process which is very heavily used at GK.

It would be helpful to get accurate detail on the most used functionality. It is more than likely 'Collection Sheet' but it would be useful to see what other product features are heavily used. We could then inspect and test these areas to check for any possible problems.

What we are currently thinking:

  1. Turn on Access Log on tomcat at GK to help give us an idea of the way the application is used at GK (John also created filter to refine these results but this will have to be deployed on GK's app server)
  2. Based on results gathered from Access Logs and customer feedback, identify areas of application to focus on. Then performance test the version of mifos that is currently in GK: #. Create test enviroment that replicates GK's hardware/software configuration #. Create Dataset (or use existing one) that is similar to GK's in size and data distribution. #. Consider using Glassbox within performance lab #. Create JMeter tests scripts for identified problem areas to load test application.

Use data returned from JMeter, Access Logs and Glassbox to generate metrics and benchmarks and to identify areas of the application to approach.

  • Repeat all of the above steps for the latest version of trunk. This will help us assess changes currently incorporated into 'trunk'. These changes are expected to increase throughput by between 33-66% to bring Mifos capacity somewhere within the 400k-500k client range). see Note below.

NOTE: changes currently incorporated into 'trunk'

These are changes made by John around collection sheet save plus some prefetching into hibernate session cache. There's been a number of performance related changes in the collection sheet save but two are significant.

  1. The use of 'bag' instead of 'set' in hibernate configuration for a few tables that are only added to during the process. The account_payment table is an example of this. Previously, using 'set', when adding an account_payment, hibernate retrieved all account_payments for an account... now it doesn't.
  2. Secondly, the prefetching tries to get as much of the customer, account, schedule, fees and other info upfront in a couple of queries to avoid lots of database requests.]

Quick Wins

Metrics and Indicators

  1. Turn on Access Logs
  2. Use Glassbox

Hardware

Sungard have already run tests against some hardware changes that make sense to apply right away as a 'buffer'.

In load testing collection sheets, results showed a 2-3 fold throughput improvement by using a dual processor database server and doubling memory to 8G. If this type of hardware upgrade was introduced to GK now it should alleviate some of the GK performance problems until the number of clients reaches around 500-700K clients.

Reporting

Putting back on the 'mirror' machine for reporting makes obvious sense (if reporting is using up alot of resources)

Compression

Turn on compression using tomcats proprietary approach or using a Serlvet Filter (more portable but that doesn't really matter for mifos).

This will result in better end user experience especially for mifos users with poor internet connectivity (which is probably most of the user base)

Batch Jobs

  1. ApplyCustomerFeeTask
  2. GenerateMeetingsForCustomerAndSavingsTask

GenerateMeetingsForCustomerAndSavingsTask in praticular is taking a long to finish (sometimes up to 7hours) and that as GK scales it may no longer be able to finish within out of business hours. A volunteer has already offered a quick solution to this batch recognising that it spent a large amount of its time in Holidays retrieval. We propose implementing this as a quick win by refactor code in question that is responsible for calling HolidayUtils. see MeetingBO.getAllDates(int o)

Proposed Work

Phase 1

Time: Feb 1 - April 30Business Goal: Cater for 500K Clients (Is this correct?)* Begin investigating the proposal to RedesignAccountsSchedulesAndMeetings. The investigation is very likely to result in a list of incremental changes (probably looking at customer_schedule removal early on). Before taking on the risk of change, we would simulate and verify the effect of these changes in the performance testing lab (set up by Jeff and Sungard).

  • Begin investigating the proposal to Decouple General Ledger (GL) from Mifos. Again, if there are some incremental changes that would achieve quick wins we would do them first in this phase and we would simulate in the performance testing lab prior to development.

Phase 2

Time: May 1 - Jul 31Business Goal: Cater for 1 to 1.5M Clients

Phase 3

Time: Aug 1 - Oct 31Business Goal: Cater for 1.5M+ Clients