background VMs could increase/decrease activity at periodic intervals and
appear periodic. Figure 6 delineates viewpoint for whole platform, the
percentage of total (left), first-party (middle), and third-party (right) core
hours in each class. The “Unknown” class represents the VMs that do not last 3
successive days. Reviewing the classification per subscription again
demonstrates that most subscriptions behave consistently. Focusing on
subscriptions should increase prediction accuracy, as they behave consistently
VM inter-arrival times:
Figure 7 portrays the arrival time series at hourly granularity.
Arrivals are exceptionally diurnal with lower load on weekends, regardless of
the type of workload. For resource management, it is important for the VM
scheduler must be optimized for high throughput.To create these predictions, Resource Central (RC) is introduced
as a system for ingesting VM telemetry, learning from past VM behaviors,
producing models that can foresee these behaviors, and executing the models
(i.e., giving forecasts) when client systems request them. Although we
concentrate on predicting VM behaviors, but RC is general and can be used for
learning/predicting server impacts too, such as hardware failures.
3.1. RC use-cases:
Smart VM scheduling:
Before choosing servers to run a set of new VMs, the VM scheduler
can contact RC for predictions of the VMs’ expected resource utilizations which
will enable scheduler to select servers to adjust the disk IOPS load.
Smart cluster selection:
The cluster selection system can query RC for a prediction of
maximum deployment size so as it becomes easy to select a cluster that will
likely have enough resources.
Smart power oversubscription and capping:
At the point when the power draw is going to surpass a circuit
breaker limit, the system can query RC for predictions of VM workload
interactivity, so that interactive and delay-insensitive workloads are isolated
in different sets of servers.
Scheduling server maintenance:
When a server begins to get out of hand, system can query RC for
the expected lifetime of the VMs running on the server which will determine
when maintenance can be scheduled, and whether VMs need to be live-migrated.
Recommending VM and deployment sizes:
Using RC predictions of workload class and resource utilization,
the service could recommend deployments where VMs predicted to be
delay-insensitive would be more tightly sized than interactive VMs.
3.2. RC Design:
3.2.1 Design Principle:
1)For performance and availability, RC should be an independent
and general system.
2)For maintainability, it should be simple and rely on any
existing well-supported infrastructures.
3)For usability, it should require minimal changes to the systems
that use it, and provide an interface that is general enough for many
Figure 8 outlines RC architecture. RC has offline and online
components. The offline workflow comprises of several tasks: data extraction,
cleanup, aggregation, feature data generation, training, validation, and machine
learning (ML) model generation. The online part of RC uses a single, general,
and thread safe client dynamically linked library (DLL), within which the ML
models execute to produce predictions. This DLL is the only view of RC for all
The client (e.g., VM scheduler) calls the DLL passing as input the
model name and information about the VM(s) for which it wants predictions. We
refer to this information as the client inputs to the models which can be
subscription id, VM type and size, and deployment size. Other than the client
inputs, the model may require historical feature data as additional inputs such
as the lifetime model would also require information on historical lifetimes
for the same subscription.