Data Migration Import Type Design Decision
Data Migration Import Type Design Decision
To facilitate the migration of legacy data into Mifos, this project will propose an import data format for importing data into Mifos. The first step is to figure out what type of format to use. The following import types are being considered for use in this project: Excel, CSV, XML, SQL.
Some factors to consider when choosing the type of data to import are:
Control over format – does Mifos have control over the content, structure and size of this data type or can a third party make changes that affect it?
Self documenting – can the content of the data type be understood by examining the data.
Richness of expression --can all the data relationships be easily expressed.
Enforces data consistency – does the data type allow for enforcement of data consistency by simply having import data that is a valid instance of the data format.
Availability of tools for generating and manipulating format – are tools for manipulating the import format widely available for both importing the data and generating migration data in the import format.
Time and effort to implement import – how much time and effort would be required to implement data import functionality using this format.
Time and effort to generate data – how much time and effort would be required to for a Mifos specialist to generate import data using this format.
Import Type Discussion
csv-table:: Summary of Import Type Pros and Cons:header: " ", "CSV", "SQL", "Excel", "XML"
:widths: 15, 4, 4, 4, 4
"Control", "``", "/", "``", "``" "Self-Documenting", "``", "/", "/", "``" "Expressiveness", "``", "``", "``", "``" "Data Consistency", "``", "``", "``", "``" "Tools", "``", "``", "``", "``" "Import time and effort", "``", "``", "/", "``" "Generation time and effort", "/", "/", "/", "/"
XML
Pros:
Defining a custom XML format gives Mifos complete control over the format.
A well written XML document is self-documenting.
An XML document based on XML-schema or RelaxNG is very expressive in terms of data relationships and structures.
An XML document based on XML-schema or RelaxNG allows many data constraints to be enforced.
Many tools and libraries are available for reading, generating and manipulating XML data.
Since constraints can be well expressed and many tools are available, development time of the import functionality will be less than with other data types.
Cons:
Since an XML can express many constraints and contains more structure, producing a valid XML document could take more effort than with a format like CSV. However no matter what the format, in the end the data still needs to be consistent, so an XML document might just force the issue sooner rather than doing multiple passes to make it right with another less constrained format like CSV.
Excel
Current versions of Excel save spreadsheet data in XML format, so the Excel and XML data types are closely related. The difference with relying on the Excel XML schema is that the format is predefined and oriented towards saving spreadsheet data.
Pros:
Since the Excel data type is XML it is self documenting to an extent, but it is geared towards the spreadsheet model that it was written for.
Since Excel is XML, the broad set of XML tools available can be used with it.
Since Excel is XML, XML tools and libraries can be used to read it and there are tools and libraries specific to reading Excel files.
Getting data into the columnar and less constrained format of Excel might be easier for Mifos specialists (though again even if the format enforces fewer constraints the data still must be consistent).
Cons:
The schema used to define the XML format for Excel is controlled by Microsoft and geared towards generic spreadsheet data rather than Mifos data.
Since the format is limited by how Excel saves data and oriented towards columnar spreadsheet data , expressing various relationships would be more difficult than in generic XML.
Since less consistency checking is possible in Excel XML as opposed to generic XML, more time and effort would be required to implement data import than for generic XML.
CSV (comma separated value)
Pros:
The CSV data type is probably the simplest form. All things being equal, a simple format is a good thing and there are many tools and libraries that support reading and writing CSV files.
A CSV file would give Mifos complete control over the format, since we would define the meaning of the file contents.
CSV is a simple format to understand and generate so it might be easier for Mifos specialists to work with than XML or SQL.
Cons:
The CSV format it is not self documenting. CSV files usually contain only data and although they could contain descriptive headers, it is difficult to enforce that CSV data generators create the correct headers.
CSV is not very expressive. Generally fixed columns are used, and if not, then the data becomes more difficult to understand and more error prone to interpret.
CSV files do not have a built in means to insure consistency. Consistency checking must be done after reading in the data.
Reading CSV files is easy, but implementing the required consistency checks would be time consuming.
SQL
There is a question as to what is meant by "SQL" as a data type for migration data. The could mean either data residing in a database instance or data in SQL dump files. Migration work that has been done uses a running MySQL database to hold the import data which is migrated from the import tables to the Mifos tables.
Pros:
SQL is a mature and expressive language.
SQL provides precise consistency checks and data typing.
A well commented schema could provide reasonable documentation, a SQL dump file would is somewhat self documenting.
There are plenty of tools available for manipulating and accessing SQL data.
Cons:
If SQL dump files are used, then we don't have control over the format. If a SQL instance is used, we still don't control the format, but the question is less relevant.
In order to express all the constraints we might want in SQL, we would need to generate a schema that would be similar in many ways to the existing Mifos schema. The time and effort required to construct this and the code required to read and migrate the data would be significant.
Since SQL is more structured and can express many constraints it would likely require more effort to work with than CSV or Excel, though it would retain the benefit of having less chance of import errors after data has successfully been loaded.
Import Type Conclusions
Using an XML document as the import data type with a custom schema defined for Mifos is the most promising approach for data migration and will be used as the basis for this project.
A Mifos XML data import schema should be able to be defined such that a Mifos specialist who generates an XML import file that conforms to the Mifos import schema can be confident that the data can be imported with few if any remaining problems to address.
The wide range of tool support for XML means that XML data import should be efficient to implement in Mifos and that a Mifos specialist will have a range of tools to choose from that can help in generating data in the XML format. A well defined Mifos XML schema should serve as a reference document for understanding what the requirements and constraints are for migrating data into Mifos. It could also serve in the future as a data exchange format for both import and export.
A concern related to choosing XML as an import type is that since it is very structured, it potentially requires more effort on the part of the Mifos specialist to generate it. However because XML is so widely used, tools should be available to help a Mifos specialist generate the XML. As part of this project the Kettle ETL tool being used by MFTech will be examined to understand how it could potentially help a Mifos specialist generate data import XML. Also, this added structure should help focus a specialist on what transformations are required to migrate data into Mifos and help make clear up front what effort will be required for data migration.
XML can be a verbose format, so data import files for big data sets could be quite large. Given that hard disk space is relatively inexpensive now, this should not be an issue. However, it would be worth checking on the size of existing MFI databases and doing a rough calculation on the expected size of XML import files to confirm this.