All Projects → theodesp → stable-systems-checklist

theodesp / stable-systems-checklist

Licence: other
An opinionated list of attributes and policies that need to be met in order to establish a stable software system.

Projects that are alternatives of or similar to stable-systems-checklist

Gauntlet
🔖 Guides, Articles, Podcasts, Videos and Notes to Build Reliable Large-Scale Distributed Systems.
Stars: ✭ 336 (+681.4%)
Mutual labels:  continuous-integration, continuous-delivery
fabric-beta-publisher-plugin
DEPRECATED: A Jenkins plugin that lets you publish Android apps to Fabric Beta
Stars: ✭ 24 (-44.19%)
Mutual labels:  continuous-integration, continuous-delivery
aws-cloudformation-simplified
AWS CloudFormation - Simplified | Hands On Learning !!
Stars: ✭ 51 (+18.6%)
Mutual labels:  continuous-integration, continuous-delivery
escape-inventory
Storing and querying Escape releases
Stars: ✭ 16 (-62.79%)
Mutual labels:  continuous-integration, continuous-delivery
hygieia
CapitalOne DevOps Dashboard
Stars: ✭ 3,697 (+8497.67%)
Mutual labels:  continuous-integration, continuous-delivery
build-plugin-template
Template repository to create new Netlify Build plugins.
Stars: ✭ 26 (-39.53%)
Mutual labels:  continuous-integration, continuous-delivery
workr
Simple and easy to setup job runner for any kind of automation
Stars: ✭ 15 (-65.12%)
Mutual labels:  continuous-integration, continuous-delivery
terraform-aws-concourse
Terraform Module for a distributed concourse cluster on AWS
Stars: ✭ 12 (-72.09%)
Mutual labels:  continuous-integration, continuous-delivery
cicdstatemgr
Utility for managing CICD state, sending notifications, and mediating Slack interactive messages & slash commands across multiple flows of execution in CICD platforms such as Tekton.
Stars: ✭ 25 (-41.86%)
Mutual labels:  continuous-integration, continuous-delivery
bump-everywhere
🚀 Automate versioning, changelog creation, README updates and GitHub releases using GitHub Actions,npm, docker or bash.
Stars: ✭ 24 (-44.19%)
Mutual labels:  continuous-integration, continuous-delivery
dashinator
Dashinator the daringly delightful dashboard. A replacement for dashing
Stars: ✭ 56 (+30.23%)
Mutual labels:  continuous-integration, continuous-delivery
cloud-s4-sdk-pipeline-docker
The Cloud SDK continuous delivery infrastructure makes heavy use of docker images. This are the docker sources of these images.
Stars: ✭ 13 (-69.77%)
Mutual labels:  continuous-integration, continuous-delivery
nightly-docker-rebuild
Use nightli.es 🌔 to rebuild N docker 🐋 images 📦 on hub.docker.com
Stars: ✭ 13 (-69.77%)
Mutual labels:  continuous-integration, continuous-delivery
www.go.cd
Github pages repo
Stars: ✭ 39 (-9.3%)
Mutual labels:  continuous-integration, continuous-delivery
cloud-s4-sdk-pipeline
The Cloud SDK pipeline uses the Cloud SDK continuous delivery server for building, checking, and deploying extension applications. Projects based on the SAP Cloud SDK archetype will automatically use this pipeline.
Stars: ✭ 65 (+51.16%)
Mutual labels:  continuous-integration, continuous-delivery
Cyclid
An Open Source continuous integration server
Stars: ✭ 25 (-41.86%)
Mutual labels:  continuous-integration, continuous-delivery
notebooks-ci-showcase
Fully Configured Example of CI/CD For Notebooks On Top Of GCP
Stars: ✭ 22 (-48.84%)
Mutual labels:  continuous-integration, continuous-delivery
ofcourse
A Concourse resource generator
Stars: ✭ 41 (-4.65%)
Mutual labels:  continuous-integration, continuous-delivery
noise-php
A starter-kit for your PHP project.
Stars: ✭ 52 (+20.93%)
Mutual labels:  continuous-integration, continuous-delivery
swarmci
Swarm CI - Docker Swarm-based CI system or enhancement to existing systems.
Stars: ✭ 48 (+11.63%)
Mutual labels:  continuous-integration, continuous-delivery

Stable Systems Checklist

Below is an opinionated list of attributes and policies that need to be met in order to establish a stable software system.

Preparation

  • Developers are in control of the Software and they own the code.
  • Only small units of work every time. Fix, deploy, Develop.
  • Only few people in each project no more than 6.
  • Every project has a win condition.
  • Show feasability of the project by working on an initial seed no longer than 3-4 days.
  • Gamble any new technology sensibly.
  • Define scope of project. Make sure is not too broad.
  • Define feature extensions of each project.
  • Make experiments and flag them as analysis pre-planning work to the real project.
  • Be on guard and look for bad APIS.
  • Data should be clean and not garbage.

Process & People

  • Pick the parts of Agile/XP/SCRUM/Kanban that work for the team and kill the rest.
  • Prefer asynchronous communication.
  • Know how different people on your team likes to work with the code base.

System Planning

  • The system is built for production.
  • Design a shared-nothing architecture.
  • You build your system as a 12 factor app.
    • Use revision control with many deploys.
    • Declare dependancies with package managers.
    • Store configuration in the environment.
    • Track backend services as resources.
    • Use seperate build, and run stages.
    • Execute the app as one or more stateless processes.
    • Export services via port binding.
    • Scale out via the process model. Never daemonize or write PID files. Use process managers.
    • Processes shut down gracefully when they receive a SIGTERM. They have a fast startup and graceful shutdown.
    • Keep development, staging, and production as similar as possible. Vagrant allow developers to run local environments.
    • Treat logs as event streams.
    • Run admin/management tasks as one-off processes. For example django manage.py commands.
  • The system is a set of modules with loose coupling.
  • Modules communicate loosely via a protocol.
  • Design protocols for future extension. Design each module for independence. Design each module so it could be ripped out and placed in another system and still work.
  • Avoid deep dependency hierarchies.
  • Avoid intermediaries parsing and interpreting on data.
  • Have a supervision/restart strategy.
  • Prefer ratcheting methods via idempotence.
  • Uses a unique ID on all messages which means you can always retry said message in case of a timeout and be sure it won’t be rerun by the receiving system, if the receiver keeps a log of what it has already done.
  • UNIX principle: each tool does one thing well.
  • Define the capacity of the system up front.
  • Decouple your SLA.
  • Put limits into other application-level protocols. HTTP, RPC, etc.

Setup

  • First you build an empty project.
  • Add this empty project to continuous integration.
  • Deploy the empty project into staging.
  • Once this works, you start building the application.
  • Preconfigure your systems so you need no external dependencies when deploying.
  • The same artifact is deployed to staging and production. It picks up a context from the environment and this context configures it.
  • Don’t use advanced technology too early on.
  • Lock dependencies to specific tags/versions.
  • Make upgrading dependencies a decision on your part.
  • Vendor everything.
  • Make a production deploy take less than 1 minute from button-push-to-operational-on-the-first-instance.
  • Build a default library you include in every application you write.
  • Let every application use the same library.

Development

  • Correctness is more important than fast.
  • Elegant is more important than fast.
  • Code Quality is more important than fast.
  • Fast is not really important.
  • Build your system to collect metrics about itself as it runs.
  • Ship metrics to a central point for further analysis.
  • Unit test, property based test, type systems, static analysis, and profiling.
  • No vendor lock-in.
  • Use proven synchronization primitives.
  • No code formatting disputes.
  • Use load regulation in the border of the system.
  • Use a retry policy for failed requests. Consider delayed retries with exp back off.
  • Use a timeout policy for slow requests.
  • Use circuit breakers to break cascading dependency failure.
  • Use Bulkheads to partition systems. Protect critical clients by giving them their own pool to call. Virtual servers provide an excellent mechanism for implementing bulkheads. For smaller scale Bind process to CPU.
  • Try to utilize Soft/Weak references in order to minimize memory footprint.

Picking a database

  • Pick postgresql as default.If you need MongoDB-like functionality you create a jsonb column.
  • Export to elasticsearch from postgres.
  • Use pg_bouncer.
  • Isolate complex transactional interactions to a few parts of the store.
  • Look for idempotent ratcheting methods as an alternative.

Picking a programming language

  • Avoid the monoculture.
  • Know the weaknesses of a language.
  • The deployment tooling must be in place before use.
  • Use make. Use the same make targets for all projects in the organization.

Picking Architecture

  • Use REST.
  • Use REST Specifications like OpenApi, RAML.

Configuration

  • Secure defaults.
  • Persistent data lives outside of the artifact path, on a dedicated disk with dedicated quota.
  • Log rotation.
  • The artifact path is not writable by the application.
  • Use different credentials in production and staging.
  • Deny developers laptops easy access to the production environment.
  • Avoid the temptation of too early etcd/Consul/chubby setups.

Operations

  • Optimize for sleep.The system must avoid waking people up in the middle of the night at all costs.
  • The system must be able to gracefully degrade.
  • The system runs out of monit, supervise, upstart, systemd, rcNG, SMF, or the like.
  • The application must gracefully stop and start if given the command to do so.
  • Every log file is shipped and indexed outside of the system. Every interesting metric too.
  • Don’t leave log files on production systems. Copy them to a staging area for analysis.
  • The only way to make changes to a production host is to redeploy.
  • Make it easy to roll back and downgrade a deployment.
  • In a production system you must be able to query its state in an ad-hoc fashion.
  • If you enable ad-hoc query and tracing on the system and then disable it again, there must no segfaults, no kernel crashes and no long-term impact.

Site Reliability

  • Hire Coders only.
  • Have an SLA for your service.
  • Measure performance based on your SLA.
  • Share 5% Operations work with developers.
  • Do Postmortems after each event and focus only on processes not people.

Tools

Debugging

  • Dtrace
  • Gdb

Cloud Storage

Security

Database

  • Use encryption for sensitive data.
  • All backups are stored encrypted as well.
  • Use minimal privilege for the database access user account.
  • Store and distribute secrets using a key store designed for the purpose.
  • Don’t hard code in your applications.
  • Only using SQL prepared statements.

Development

  • Use vulnerability scanners for every version pushed to production.
  • Use memory leak analyzers to your to your production runtime binaries.
  • Use race condition detection in your runtime binaries.
  • Acquire and investigate any vendor libraries for surprises and failure modes.

Authentication

  • Ensure all passwords are hashed using appropriate crypto such as bcrypt. Use secure random bytes.
  • Apply password rules that encourage users to have long, random passwords.
  • Use multi-factor authentication for your logins to all your service providers.

Denial of Service Protection

  • At a minimum, have rate limiters on your slower API paths and authentication related APIs like login and token generation routines.
  • Use CAPTCHA in front end.
  • Enforce sanity limits on the size and structure of user submitted data and requests.
  • Use a global caching proxy service like CloudFlare.
  • No single points of failure. Have redundancy on machines.
  • Use Bulkhead server partitioning. In essense assign limited resources to specific (groups of) clients, applications, operations, client endpoints, and so on.

Web Traffic

  • Use the strict-transport-security header to force HTTPS on all requests.
  • Cookies must be httpOnly and secure and be scoped by path and domain.
  • Use Content Security Policy without allowing unsafe-* backdoors.
  • Use CSP Subresource Integrity for CDN content.
  • Use X-Frame-Option, X-XSS-Protection headers in client responses.
  • Use CSRF tokens in all forms.
  • Use the new SameSite Cookie response header which fixes CSRF once and for all newer browsers.
  • Keep as little in the session state as possible.
  • Use a robots.txt file to keep legitimate bots away.

APIs

  • No resources are enumerable in your public API.
  • All users are fully authenticated and authorized appropriately when using your API.
  • Use canary checks in APIs to detect illegal or abnormal requests that indicate attacks.

Validation and Encoding

  • Do client-side input validation.
  • Escape text before showing.

Cloud Configuration

  • Ensure all services have minimum ports open.
  • Host backend database and services on private VPCs that are not visible on any public network.
  • Isolate logical services in separate VPCs and peer VPCs to provide inter-service communication.
  • Ensure all services only accept data from a minimal set of IP addresses.
  • Restrict outgoing IP and port traffic to minimize APTs and “botification”.
  • No root credentials.
  • Use minimal access privilege for all ops and developer staff.
  • Regularly rotate passwords and access keys according to a schedule.

Infrastructure

  • Ensure you can do upgrades without downtime. Automated.
  • Create all infrastructure using a tool such as Terraform, and not via the cloud console. Have zero tolerance for any resource created in the cloud by hand.
  • Use centralized logging for all services. You should never need SSH to access or retrieve logs.
  • Don’t SSH into services except for one-off diagnosis. Using SSH regularly, typically means you have not automated an important task.
  • Don’t keep port 22 open on any AWS service groups on a permanent basis. If you must use SSH, only use public key authentication and not passwords.
  • Create immutable hosts instead of long-lived servers that you patch and upgrade.
  • Protect infrastructure secrets with Centralized secret management tools like Vault or Keywhiz.

Operation

  • Power off unused services and servers.
  • Have a practiced security incident plan.

Test

  • Do Penetration Testing.
  • Do fuzz testing.
  • Everything is Auditable.
  • Identify whatever your most expensive transactions are, and double or triple the proportion of those transactions to see how your system handles stress.
  • Do Stress Tests.

Security tools

Auditing

Encryption

References

License

CC0

To the extent possible under law, Theo Despoudis has waived all copyright and related or neighboring rights to this work.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].