Collection sheet entry log data mining

This is a historical record of findings as well as the original brainstorm on what data should be mined from collection sheet entry logs.

Findings

cards 1862 and 2086 closed.
The cseDataMiner spits out some generic aggregated statistics and time-of-day vs. time for CSE submit
- 5% of submissions encounter database errors and rollback (previously silent error)
Van analyzed the time-of-day vs. time for CSE submit comma-separated values data in this spreadsheet finding that CSE submit times increase dramatically even as the number of submits remains constant, hovering around 100 submits per 15-minute increment.
Initially, it was thought that first getting everything into a database then running SQL queries would be the best analysis path, but this may violate YAGNI, the Java code could be simplified to spit out CSV which could be crunched much further in Excel
Quantizing (say, providing a CSV of time-of-day vs. submit times in 15-minute chunks) could be done in the Java code.

(for historical reference)

time-of-day vs. time to save collection sheet data
time-of-day vs. average time between page hits
apparent resubmits - same officeID, centerID, and attendance data
failed submits - "after saveData()" not seen for a particular sessionID/branchID/centerID
count of (previously) silent errors during callees of BulkEntryAction.saveData()
- for instance, "unable to save loan account(s). personnel ID=" ...
submit times: min, max, mean, median, standard deviation

same sessionID, different userID
flow through process (load → get → preview → create) doesn't progress forward in time
any correlations in errors vs. branch/user/office/center

dev environment: Java, Maven, Eclipse
start with a single POJO with a main() method
executing
- java -jar cse_spelunker.jar DIR LOGFILES...
  - first arg: directory to store report file
  - remaining args: log files
assume unlimited memory
- if this becomes a problem, consider
  - cache or database as intermediate storage
  - replace simple data structures with ones backed by disk cache or in an embedded database
  - key sessions by an int rather than the session string
  - use strict multiple-phase approach, eg: first suck in matched log lines, next analyze, next report results; phases use intermediate forms (a pipeline)
slurp in all CSE-like log lines
- but count errors
while slurping, build data structures for analysis phase
- main data structure is an array of arrays
- also count previously-silent errors
parse dates to date objects
- joda time needed/easier?
- ignore timezone, initially
after last log line, run analyses that require all data
- like what?
reporting
- spit out CSV at least, perhaps simple graphs
  - open flash chart?
- HTML report (all files written to single directory)
  - Freemarker templates?
unit tests
- - none identified yet