Disaster Recovery / Contingency Planning and Testing

Problems occur in production, and you must be prepared for them.  Review this checklist and make sure you have considered, DOCUMENTED, and TESTED the response to any problems that may occur.

Testing the Process

Testing should include:

  • Letting database backup scripts run and perform a full restore. Note the amount of time the process takes and any scope for improvement in the steps involved.
  • Verifying O/S backup restore.
  • Verifying all phone numbers on contact lists.
  • Noting the amount of time elapsed starting from the outage to the full recovery.

Documenting the Process

  • Compile a list of emergency contact information, including vendors and contacts at the financial institution (Who should be called in the event of an outage?)
  • Instructions for restoring a system should be detailed and clear. Try to write them for a new user.
  • Make sure a procedure exists for updating existing disaster recovery documentation as new scripts are added/removed (setting up a monthly email reminder to make these updates is often a good idea).
  • Good versus bad examples <we will document some examples of good versus bad documentation here in the future>

Problems and Responses

Types of problems that can occur, and how to prepare for them:

  • Natural disasters (e.g. flooding of main office or branch offices), fires, and political emergencies
    • OFF SITE DATA STORAGE: make sure data is stored at a location aside from the head office, for BOTH the database and the OS. <link>
    • Well-documented procedures for recovery steps <link>
    • Contact list <link to template>
  • Security breach of Mifos server and/or act of sabotage by staff
    • What are the processes for immediately changing passwords? Are they documented?
    • What needs to be evaluated for your organization (check accounts, database evaluation)?

  • Failure or loss of Mifos server, database and/or server disk storage
    • Make sure certain scripts are running for database backups.
    • Make sure certain scripts are running for OS backups.
    • Make sure certain backup and recovery procedures are DOCUMENTED IN DETAIL, such as:
      • Restoring the OS
      • Restoring the database
      • Mifos configuration settings
      • Any custom scripts that may be running
      • Verification test plan (10-12 trials to make to ensure system is functioning properly and stable)
    • Make sure all the scripts are stored in a documented location and include instructions for recreating the production setup.

  •  Loss of Internet access/power at main office or branch office
    • What procedures need to be in place for working around the problem at a branch? Document and test.
    • What procedures need to be in place at the Head Office in the event of an extended power outage?

  • Loss of key staff members
    • Staff turnover is a fact of running an organization. Make sure all procedures, instructions, and important details are documented and available to newer staff members who may be forced to troubleshoot or process a recovery. 

Documentation

All system administration and maintenance processes should be well-documented.