CopyCat — xiaohe(Carol)yang

S3 Workload Simulation Model - CopyCat

Data is S3 audit log files (5GB~30GB each, around 300 such files total) to extract useful fields. Parser parsed each 3GB chunk in 2 minutes. CTGAN and FP Growth algorithms are used to explore audit log data. CTGAN used to generate synthetic data from audit log dataset, FP growth to mine the frequent S3 operation sequences inside the audit logs.

Statistical S3 workload simulation model: takes audit logs as input, group them by object, and collects the sequence of operations to each object in a cache. After cache limit reached, push the sequenced found to another storage structure that stores the frequency of this sequence, and statistics of the inter arrival-time and object size. This statistics is used to ensure similarity of sequences grouped together.