What are the key concepts that define what we mean by a sustainable service on AWS?

  • All required data is persisted.
  • Containers can be restarted at any time without losing any essential data and without any disruption to the service
  • Instances can be restarted at any time without losing any essential data and without any disruption to the service
  • The infrastructure can be recreated at any time with automatic data recovery
  • All needed logs can be seen without accessing the server
  • All logging and metrics that indicate a user-affecting service problem alert

Reliable by Design

By designing for failure, and persisting data, we can provide a sustainable service that looks after itself

  • All containers must set themselves up on startup, loading any persistent data required.
  • All instances must set themselves up on startup, installing and configuring applications, containers, and persistent data as required.
  • Data held on containers or instances is considered ephemeral, any persisted data can only be relied upon if stored on a reliable multi-zone database such as RDS, or on S3.
  • Multiply redundant access can be maintained by load balancing instances through an ELB. The minimum requirements should be,
    • 2 or more online instances in different availability zones
    • Proper health checks on the ELB that assure the instances are truly available
    • The auto-scaling group uses the ELB health check
  • Logging and monitoring data is held completely separately from the service
    • all logs and metrics required are shipped to this location
    • no ssh access is required for additional information
    • all logs and metrics that indicate a service problem are monitored and alerted on
    • non-essential logs or metrics are performance data and do do not generate critical alerts