In the field of modern cloud operations, multiple services continuously run on many different platforms, across a broad spectrum or hardware, network, and software environments. Broadly, they can be briefly summarised as having the following properties,

  • Multiple interacting applications running on multiple heterogenous platforms.
    • A range of versions of the same application running live at any one time.
    • A number of different applications acting together to provide an end-user service.

Service robustness and health is dealt with by an overarching management framework, reacting to non-performant infrastructure and application code by routing, scaling, and re-provisioning resources in a predetermined manner to maintain or recover platform health and performance.

Fragmentation and the Polylithic Service

In recent years, the architected state of applications and services has moved significantly towards much smaller, and much more isolated, microservices. Whereas a service might previously be measured in a handful of applications working cooperatively, it is not uncommon for an end-user service to be provided by hundreds and even thousands of active applications. It is expected that this this situation is likely to continue and become ever more pronounced, both within current service frameworks, and in the integration of ‘Internet of Things’ applications into cloud infrastructures.

In addition, newer deployment pipeline methods and philosophies deliver new code from development to live production systems in an automated fashion, the frequency now measured in minutes and hours rather than weeks and months. This leads unerringly to,

  • Increased heterogeneity of even one singularly defined application, many subtly different versions existing at any one time.

  • Exponential growth in interactions between applications that require to interact, synchronously or asynchronously.

  • Exponential variation in the running codebase of interaction combinations between communicating applications.

  • Infrastructure and applications generating hundreds of thousands of log events a second, and a similar number of metric events.

Complexity and Automation

With the exponential growth in the numeracy of applications and interactions and the by-design ephemeral nature of modern cloud components, the need to understand better ways of automating any manual intervention in managing applications is ever greater.

Currently, automation and orchestration techniques at scale are dominated by management frameworks. Management layer processes monitor resources and processes and react in defined ways, such as scaling-out for performance, migrating resources to utilise capacity, and starting or terminating application lifecycles based on performance metrics.

Failures of Management Architectures at Scale

The rapid growth in scale puts increasing pressure on management frameworks and monitoring systems. Substantial problems arise when operating at this level of diversity and numeracy.

  • There are so many heterogenous iterations of a singular application that coordinating communications between each group of applications in a service is fraught with error and subtle bugs. These are often impossible to identify both due to the sheer quantity of real-time logs, and the complexity of chained asynchronous communications between many interacting microservice components.

  • So many metrics and logs that automated reaction systems struggle to cope. The overwhelming amount of performance, error, and security data becomes difficult to interpret and and extract meaning from in real-time.

  • Ever increasing amounts of purely orchestration and monitoring related network traffic within the cloud environment.

  • A substantial increase in false-positive and false-negative triggers in alerting systems.

  • Service domain-specific knowledge required to understand an application’s performance is lost in abstraction as systems scale to handle the mass of information.

Embedding Intelligence in the Service Agent

The aim of my future research work is to identify means of moving the knowledge, intelligence, lifecycle, and inter-microservice interaction co-ordination, from the management framework, into the microservice itself. To do this, a level of learning and evolving behaviours must be embedded inside of the microservice agent. Key questions need to be answered such as,

  • How much can each component know about its own health and life cycle, and how can this knowledge be used in self-healing, replication and termination?

  • Can we embed enough intelligence in each component to allow it to meaningfully adapt and respond to its environment?

  • With each service communicating internally with hundreds of other components, and externally with thousands of other services, how can the complexity generated by multitudes of asynchronous and synchronous communications be understood and managed automatically by each agent?

  • How can service agents sense and understand their virtual hardware and resource neighbourhood in a meaningful and actionable way?

  • Similarly, how can agents interact with so that they can maintain their own health, and also the more abstract health of the encompassing service they combine to provide?

  • How do we design agents to self-organise, and with any other necessary service agents, to form groupings of micro applications that can create a larger, emergent service meeting a requested need?