CSE Data Mining

Collection sheet entry log data mining

This is a historical record of findings as well as the original brainstorm on what data should be mined from collection sheet entry logs.

Findings

  • cards 1862 and 2086 closed.
  • The cseDataMiner spits out some generic aggregated statistics and time-of-day vs. time for CSE submit
    • 5% of submissions encounter database errors and rollback (previously silent error)
  • Van analyzed the time-of-day vs. time for CSE submit comma-separated values data in this spreadsheet finding that CSE submit times increase dramatically even as the number of submits remains constant, hovering around 100 submits per 15-minute increment.
  • Initially, it was thought that first getting everything into a database then running SQL queries would be the best analysis path, but this may violate YAGNI, the Java code could be simplified to spit out CSV which could be crunched much further in Excel
  • Quantizing (say, providing a CSV of time-of-day vs. submit times in 15-minute chunks) could be done in the Java code.

Brainstorm

(for historical reference)

Useful stats we'd like gathered

  • time-of-day vs. time to save collection sheet data
  • time-of-day vs. average time between page hits
  • apparent resubmits - same officeID, centerID, and attendance data
  • failed submits - "after saveData()" not seen for a particular sessionID/branchID/centerID
  • count of (previously) silent errors during callees of BulkEntryAction.saveData()
    • for instance, "unable to save loan account(s). personnel ID=" ...
  • submit times: min, max, mean, median, standard deviation

Possibly useful stats we'd like gathered

  • total centers modified
  • incomplete entries (not every page was hit)
  • submits per sessionID

Error conditions to capture when parsing logs

  • same sessionID, different userID
  • flow through process (load → get → preview → create) doesn't progress forward in time
  • any correlations in errors vs. branch/user/office/center

Development approach

  • dev environment: Java, Maven, Eclipse
  • start with a single POJO with a main() method
  • executing
    • java -jar cse_spelunker.jar DIR LOGFILES...
      • first arg: directory to store report file
      • remaining args: log files
  • assume unlimited memory
    • if this becomes a problem, consider
      • cache or database as intermediate storage
      • replace simple data structures with ones backed by disk cache or in an embedded database
      • key sessions by an int rather than the session string
      • use strict multiple-phase approach, eg: first suck in matched log lines, next analyze, next report results; phases use intermediate forms (a pipeline)
  • slurp in all CSE-like log lines
    • but count errors
  • while slurping, build data structures for analysis phase
    • main data structure is an array of arrays
    • also count previously-silent errors
  • parse dates to date objects
    • joda time needed/easier?
    • ignore timezone, initially
  • after last log line, run analyses that require all data
    • like what?
  • reporting
    • spit out CSV at least, perhaps simple graphs
      • open flash chart?
    • HTML report (all files written to single directory)
      • Freemarker templates?
  • unit tests
      • none identified yet