CSE Data Mining
Collection sheet entry log data mining
This is a historical record of findings as well as the original brainstorm on what data should be mined from collection sheet entry logs.
Findings
The cseDataMiner spits out some generic aggregated statistics and time-of-day vs. time for CSE submit
5% of submissions encounter database errors and rollback (previously silent error)
Van analyzed the time-of-day vs. time for CSE submit comma-separated values data in this spreadsheet finding that CSE submit times increase dramatically even as the number of submits remains constant, hovering around 100 submits per 15-minute increment.
Initially, it was thought that first getting everything into a database then running SQL queries would be the best analysis path, but this may violate YAGNI, the Java code could be simplified to spit out CSV which could be crunched much further in Excel
Quantizing (say, providing a CSV of time-of-day vs. submit times in 15-minute chunks) could be done in the Java code.
Brainstorm
(for historical reference)
Useful stats we'd like gathered
time-of-day vs. time to save collection sheet data
time-of-day vs. average time between page hits
apparent resubmits - same officeID, centerID, and attendance data
failed submits - "after saveData()" not seen for a particular sessionID/branchID/centerID
count of (previously) silent errors during callees of BulkEntryAction.saveData()
for instance, "unable to save loan account(s). personnel ID=" ...
submit times: min, max, mean, median, standard deviation
Possibly useful stats we'd like gathered
total centers modified
incomplete entries (not every page was hit)
submits per sessionID
Error conditions to capture when parsing logs
same sessionID, different userID
flow through process (load → get → preview → create) doesn't progress forward in time
any correlations in errors vs. branch/user/office/center
Development approach
dev environment: Java, Maven, Eclipse
start with a single POJO with a main() method
executing
java -jar cse_spelunker.jar DIR LOGFILES...
first arg: directory to store report file
remaining args: log files
assume unlimited memory
if this becomes a problem, consider
cache or database as intermediate storage
replace simple data structures with ones backed by disk cache or in an embedded database
key sessions by an int rather than the session string
use strict multiple-phase approach, eg: first suck in matched log lines, next analyze, next report results; phases use intermediate forms (a pipeline)
slurp in all CSE-like log lines
but count errors
while slurping, build data structures for analysis phase
main data structure is an array of arrays
also count previously-silent errors
parse dates to date objects
joda time needed/easier?
ignore timezone, initially
after last log line, run analyses that require all data
like what?
reporting
spit out CSV at least, perhaps simple graphs
open flash chart?
HTML report (all files written to single directory)
Freemarker templates?
unit tests