Site Reliability Engineering
Awesome Site Reliability Engineering ¶
精选清单 Site Reliability 和 Production 工程资源.
What is Site Reliability Engineering?¶
> “从根本上说,这就是当你要求软件工程师设计操作功能时发生的事情.” - Ben Treynor Sloss,谷歌工程副总裁,谷歌 SRE 创始人
Contributing¶
请看一下 contribution guidelines 第一的. 贡献总是受欢迎的!
Culture¶
- What is Site Reliability Engineering?
- Keys To SRE by Ben Treynor
- Google SRE Resources
- Notes from Production Engineering by Pedro Canahuati
- PostOps: Recovery from Operations
- Love DevOps? Wait 'till you meet SRE [视频]
- How Google Does Planet-Scale Engineering for Planet-Scale Infra
- Site Reliability Engineering at Facebook
- A History of Site Reliability Engineering at Uber
- Case Study: Adopting SRE Principles at StackOverflow
- Site Reliability Engineering at Dropbox
- Site Reliability Engineers — Keeping Google up and running 24/7
- Site Reliability Engineering at Salesforce
- 从系统管理员到 Netflix SRE - video 和 slides
- SRE@Google: Thousands of DevOps Since 2004
- Transactional System Administration Is Killing Us and Must be Stopped
- A hierarchy of SRE needs
- PostOps: A Non-Surgical Tale of Software, Fragility, and Reliability
- SRE: An incomplete guide to cultural Narnia - [视频]
- Putting Together Great SRE Teams
- Work at Google: Meet our Production Engineers for Site Reliability Hangout on Air
- Toil: A Word Every Engineer Should Know
- Engineering Reliability into Web Sites: Google SRE
- DEVOPS & SRE AMA - Building High Performance Organizations
- John Allspaw's AMA on Incident Analysis and Postmortems
- Paul Newson 的站点可靠性工程 - Part 1 & Part 2
- How SysAdmins Devalue Themselves
- The Softer Side of DevOps
- SRE, noun. See also: confidence, trust.
- Site Reliability Engineering with Stephen Weinberg
- We are the Google Site Reliability team. We make Google’s websites work. Ask us Anything!
- We are the Google Site Reliability Engineering team. Ask us Anything!
- The Ops Identity Crisis
- The Irreproducibility Of Bugs In Large-Scale Production Systems
- SE-Radio Episode 276: Björn Rabenstein on Site Reliability Engineering
- Microservices, DevOps and Production Complexity
- Introducing Google Customer Reliability Engineering
- Evolution or Rebellion? The rise of Site Reliability Engineers (SRE)
- The difference between Site Reliability Engineering, System Administration, and DevOps
- SRE in the Small and in the Large
- SBSRE Meetup: Different SRE roles and challenges(Netflix)
- Panel: Who/What Is SRE?
- Hope Is Not a Strategy
- Tenets of SRE
- Site Reliability Engineering Demystified
- Is Site Reliability Engineering the True ‘Ops’ in DevOps?
- SRE vs. DevOps vs. Cloud Native: The Server Cage Match
- SRE: What’s The Big Idea?
- Building the SRE Culture at LinkedIn
- Podcast #111 – SRE: Occasionally Maintaining Infrastructure That You Hate
- Splicing SRE DNA Sequences in the Biggest Software Company on the Planet
- Why should your app get SRE support? - CRE life lessons
- How SREs find the landmines in a service - CRE life lessons
- Making the most of an SRE service takeover - CRE life lessons
- The Cloudcast #301: SRE and Infrastructure Operations (Podcast)
- The SRE model
- Onboarding New Site Reliability Engineers
- Building Blocks for Site Reliability At Google
- Beyond Google SRE: What is Site Reliability Engineering like at Medium?
- Intelligent Site Reliability Engineering – A Machine Learning Perspective
- A crash course in LinkedIn's global site operations
- Google’s Site Reliability Engineering with Todd Underwood
- What is Site Reliability Engineering? (VMware)
- A Gentle Introduction to SRE
- Understanding Site Reliability Engineering through Movies and Books
- GOTO 2017 • Site Reliability Engineering at Google • Christof Leng
- 成功的地理分布 SRE 团队的构成 - Part1 & Part2
- Tech Leadership in SRE
- The Azure Podcast: Episode 227 - Azure SRE
- The human scalability of "DevOps"
- Podcast: Site Reliability Management with Mike Hiraga
- How a cat inspired system reliability at Knowlarity
- Getting Started with Site Reliability Engineering
- "Practical Applications of the Dickerson Pyramid" by Nat Welch
- LinkedIn’s Kurt Andersen Uncovers Blindspots in SRE Implementations
- Interview with Betsy Beyer, Stephen Thorne of Google
- Less Risk Through Greater Humanity - Dave Rensin
- Getting Started with SRE - Stephen Thorne, Google
- Building Successful SRE in Large Enterprises
- Solving Reliability Fears with Site Reliability Engineering
- SRE vs. DevOps: competing standards or close friends?
- How to Avoid the 5 SRE Implementation Traps that Catch Even the Best Teams
- Reliability Engineering – The Essential Discipline for Complex Systems
- The Modern Site Reliability Workbench on Top of OCI
- SRE in the Third Age
- About SRE and how (not) to apply it
- Transitioning a typical engineering ops team into an SRE powerhouse
- Making a Lion Bulletproof: SRE in Banking
- Identifying and tracking toil using SRE principles
- From Ops to SRE: Evolution of the OpenShift Dedicated Team
- Meeting reliability challenges with SRE principles
- A quick introduction to SRE principles
- The SRE I Aspire to Be
- SRE Cultural Values
- Are we there yet? Thoughts on assessing an SRE team’s maturity
- What SREs have to do with project-based services?
- Making operational work more visible
- SRE vs. DevOps: What’s the Difference Between Them?
Education¶
- Panel: Educating SRE
- From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams
- New to an SRE team?
- The Systems Engineering Side of Site Reliability Engineering
- Graduating from Bootcamp and interested in becoming a Site Reliability Engineer?
- So you want to be a Site Reliability Engineer?
- Spiraling Ops Debt & the SRE Coding Imperative
- So you want to be an SRE?
- What is the role of a Site Reliability Engineer?
- Lynda.com: DevOps Foundations: Site Reliability Engineering
- Incident Management Training: Wheel of Misfortune
- 站点不可靠性工程 [视频系列]
- The Ultimate Guide to Structuring a 90-Day Onboarding Plan
- SRE fundamentals: SLIs, SLAs and SLOs
- How to Get Into SRE
- Do you have an SRE team yet? How to start and assess your journey
- How SRE teams are organized, and how to get started
- Why SRE Documents Matter
- How to get started with site reliability engineering (SRE)
- Duties of a Site Reliability Engineering Manager
- Designing distributed systems using NALSD flashcards
- Training Site Reliability Engineers: What Your Organization Needs to Create a Learning Program
- SRE Classroom: Distributed PubSub workshop
- School of SRE: Curriculum for onboarding non-traditional hires and new grads
Books¶
- Practical Linux Infrastructure
- Site Reliability Engineering: How Google Runs Production Systems
- The Site Reliability Workbook: Practical Ways to Implement SRE
- Observability Engineering: Achieving Production Excellence
- The Practice Of Cloud System Administration: Designing and Operating Large Distributed Systems
- Web Operations - Keeping the Data On Time
- The Checklist Manifesto: How to Get Things Right
- Microservices in Production - Standard Principles and Requirements
- Production-Ready Microservices - Building Standardized Systems Across an Engineering Organization
- Systems Performance: Enterprise and the Cloud [Sample chapter titled CPUs
- Monitoring Distributed Systems: Case Studies from Google's SRE Teams
- The Human Side of Postmortems: Managing Stress and Cognitive Biases
- Chaos Engineering: Building Confidence in System Behavior through Experiment
- Post-Incident Reviews: Learning from Failure for Improved Incident Responses
- Antifragile Systems and Teams
- How to Monitoring the SRE Golden Signals (E-Book)
- Incident Management for Operations
- Real-World SRE
- Seeking SRE
- What is SRE?
- Engineering Reliable Mobile Applications: Strategies for Developing Resilient Native Mobile Applications
- Building Secure and Reliable Systems
- Chaos Engineering: Crash test your applications
- 97 Things Every SRE Should Know
- Four Steps to Creating Effective Game Day Tests
- The Linux Programming Interface
Hiring¶
- SRE Hiring
- Hiring SREs at LinkedIn
- Hiring Site Reliability Engineers
- Hiring your first SRE
- Growing the Site Reliability Team at LinkedIn: Hiring is Hard
- Engineering Manager - Site Reliability Engineering Interview Preparation
Reliability¶
- The Realities of the Job of Delivering Reliability
- Fail at Scale by Ben Maurer
- Embracing Failure: Fault-Injection and Service Reliability
- 10 Years of Crashing Google
- How we break things at Twitter: failure testing
- Reliable Cron across the Planet
- Push our limits - reliability testing at Twitter
- The Verification of a Distributed System by Caitie McCaffrey
- Weathering the Unexpected
- SRE Hour: Tech Talks by Box & Yelp
- Simplicity: A Prerequisite for Reliability
- The Two Sides to Google Infrastructure for Everyone Else
- How Embracing Continuous Release Reduced Change Complexity
- Making "Push On Green" a Reality
- BeyondCorp: A New Approach to Enterprise Security
- Brainstorming Failure by Jeff Smith
- The Ripple Effect Of Outages And Downtime Cannot Be Underestimated
- The infrastructure behind Twitter: efficiency and optimization
- Dickerson's Hierarchy of Reliability
- The Morning Paper on Operability
- Production is all that matters
- Using load shedding to survive a success disaster - CRE life lessons
- How to avoid a self-inflicted DDoS Attack - CRE life lessons
- Don't gamble when it comes to reliability
- Resilience Engineering: Learning to Embrace Failure
- The Infrastructure Behind Twitter: Scale
- Scaling Reliability at Twitter: So You Want to Add a 9
- Principles Of Chaos Engineering
- Chaos Engineering
- Available...or not? That is the question - CRE life lessons
- How Google Backs Up The Internet Along With Exabytes Of Other Data
- Performance, Scalability, And High Availability: 3 Key Infrastructure Adaptability Requirements
- Google 的生产环境 - Part 1 & Part 2
- Reliable releases and rollbacks - CRE life lessons
- How release canaries can save your bacon - CRE life lessons
- Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites
- Every Day Is Monday in Operations
- Under the Hood: Ensuring Site Reliability
- Designing reliable systems with cloud infrastructure (Google Cloud Next '17)
- A Google SRE explores GitHub reliability with BigQuery
- Know thy enemy: how to prioritize and communicate risks - CRE life lessons
- Chaos Engineering resources
- CRE life lessons: What is a dark launch, and what does it do for me?
- Why you should pick strong consistency, whenever possible
- The Network is Reliable
- Are You Load Balancing Wrong?
- How production engineers support global events on Facebook
- Google: A Collection Of Best Practices For Production Services
- Canary Analysis Service
- Tips for High Availability
- Progressive Service Architecture At Auth0
- Google Cloud Production Guideline
- production readiness
- Trust By Design: The Fusion of Operational Maturity and Risk Modeling
- Top Seven Myths of Robust Systems
- Taming chaos: Preparing for your next incident
- PID Loops and the Art of Keeping Systems Stable
- Are you ready for production? - Slides
- Production Checklist for Web Apps on Kubernetes
- Finding a problem at the bottom of the Google stack
- How maintenance windows affect your error budget
- The Production Readiness Spectrum
- How we’re building a production readiness review process at Grafana Labs
- Resiliency Planning for High-Traffic Events
- Using Fault Injection Testing to Improve DoorDash Reliability
Monitoring & Observability & Alerting¶
- A Working Theory-of-Monitoring
- The Evolution of Monitoring Systems at Google - Tony Rippy
- Monitoring without Infrastructure @ Airbnb
- Monitoring distributed systems
- Observability at Uber Engineering: Past, Present, Future
- The 4 Golden Signals of API Health and Performance in Cloud-Native Applications
- My Philosophy on Alerting by Rob Ewaschuk
- Time To Detect - Netflix
- Why Percentiles Don’t Work the Way you Think
- Building Twitter’s Next-Gen Alerting System
- Instrumentation: Worst case performance matters
- Instrumentation: What does 'uptime' mean?
- Incidents + Outages at CircleCI: Our Playbook and What We’ve Learned
- An introduction to monitoring and alerting with timeseries at scale, with Prometheus
- Detecting outliers and anomalies in realtime at Datadog
- How to Monitor the SRE Golden Signals
- Monitoring in a DevOps World
- Monitoring Your Monitoring’s Monitoring
- Observability: the new wave or buzzword?
- Monitoring Isn't Observability
- Monitoring in the time of Cloud Native
- Principles of Monitoring Microservices
- The Many Ways Your Monitoring Is Lying to You
- GitOps Part 3 - Observability
- Want to Debug Latency?
- Debugging Latency in Go 1.11
- Alerting on SLOs like Pros
- Applied Alerting Philosophy
- Observations on Observability
- Deploys: It's Not Actually About Fridays
- Site Reliability Engineering Best Practices for Data Pipelines
- Elastic Observability in SRE and Incident Response
- Error Budget Policy - Part 1 - Adoption at Expedia Group
- Error Budget Policy - Part 2 - Practices at Expedia Group
On-Call¶
- Being an On-Call Engineer: A Google SRE Perspective
- Inside Atlassian: how our site reliability engineers do incident management
- Inside Atlassian: how IT & SRE use ChatOps to run incident management
- Incident Response at Heroku
- Who's On Call?
- SysAdvent - Day 6 - No More On-Call Martyrs
- On Being On Call
- The On-Call Handbook
- Incident management at Google — adventures in SRE-land
- Run Book / Operations Manual template
- Automating Your Oncall: Open Sourcing Fossor and Ascii Etch
- Project STAR*: Streamlining Our On-Call Process
- SRE@Xero: Managing Incidents Part I
- SRE@Xero: Managing Incidents Part II
- How To Establish a High Severity Incident Management Program
- How Your Systems Keep Running Day After Day - John Allspaw
- On-call doesn’t have to suck
- Why, as a Netflix infrastructure manager, am I on call?
- Oncall and Sustainable Software Development
- On Call Rotations: How Best to Wake Devs Up in the Middle of the Night
- Understanding The Role Of The Incident Manager On-Call (IMOC)
- 3 Ways to Minimize the Impact of High Severity Incidents
- Advice to Management Teams While Enrolling Changes to On-Call Systems
- Moving Past Shallow Incident Data
- Sustainable On-Call
- dotScale 2017 - Aish Raj Dahal - Chaos management during a major incident
- Incident Management at Netflix Velocity
- Incidents, fixes, and the day after
- 10 Steps to Develop an Incident Response Plan You’ll ACTUALLY Use
- Checklists: a stupidly simple but valuable operational gift
- How to write a status page update
- Atlassian Incident Handbook
- PagerDuty Incident Response Handbook
- Avoiding Burnout for SREs
- Better On-Call the SRE way
- Managing Incidents at Monzo
- Making On-Call Not Suck
- How we (Monzo) respond to incidents
- How we’ve evolved on-call at Monzo
- Code Yellow: When Operations Isn’t Perfect
- MTTR is dead, long live CIRT
- Extended Dreyfus Model for Incident Lifecycles
- Inhumanity of Root Cause Analysis
- Incident insights from NASA, NTSB, and the CDC
- How to avoid On-Call Burnout the SRE Way
- My week shadowing a GitLab Site Reliability Engineer
- How our production team runs the weekly on-call handover
- Writing Runbook Documentation When You’re An SRE
- Incident response, programs and you(r startup)
- An Incident Command Training Handbook
- Shrinking the time to mitigate production incidents
- Incident writeup as sociological storytelling
- Elephant in the Blameless War Room: Accountability
- Naming names in incident writeups
- Building On-Call Culture at GitHub
Post-Mortem¶
- A collection of post-mortems
- Collection of Kubernetes Failure Stories
- Blameless PostMortems and a Just Culture
- A Tale of Postmortems
- Building a Blameless Post-Mortem Culture with Jason Hand
- The infinite hows
- Failure is Always An Option: How a Blameless Culture Leads to Better Results
- SysAdvent - Day 1 - Why You Need a Postmortem Process
- Etsy’s Debriefing Facilitation Guide for Blameless Postmortems
- Writing Your First Postmortem
- How to Write Great Outage Post-Mortems
- A collection of postmortem templates
- Embracing Feedback
- Postmortem Action Items: Plan the Work and Work the Plan
- Social Issues In Postmortems
- Google Has an Official Process in Place for Learning From Failure--and It's Absolutely Brilliant
- Postmortem culture: how you can learn from failure
- re:Work - Postmortem discussion template
- Post-mortems to the rescue
- Postmortem Action Items: Plan the Work and Work the Plan
- Why Every Company Can Benefit from a Blameless Culture
- "It's dead, Jim": How we write an incident postmortem
- Our incident postmortem template
- Learn out of mistakes. Postmortems to the rescue.
- Improving Postmortem Practices with Veteran Google SRE, Steve McGhee
- Inhumanity of Root Cause Analysis
Capacity Planning¶
- Capacity Planning
- SouthBay SRE: Cloud Capacity Planning
- Intent-based Capacity Planning and Autoscaling with Kubernetes
- How do you do Capacity Planning
- How Back Market SREs prepared for Black Friday
Service Level Agreement¶
- If It's in the Cloud, Get It on Paper: Cloud Computing Contract Issues
- Service Level Agreements in the Cloud: Who cares?
- SysAdvent- Day 20 - How to set and monitor SLAs
- SLOs, SLIs, SLAs, oh my - CRE life lessons
- Service Levels and Error Budgets
- (Un)Reliability Budgets - Finding Balance between Innovation and Reliability
- The Calculus of Service Availability
- Availability Calculator: Calculate how much downtime should be permitted in your SLA
- Standardize cloud SLA availability with numerical performance data
- Best practices to develop SLAs for cloud computing
- A Practical Guide to SLAs
- Building good SLOs - CRE life lessons
- No Grumpy Humans and Other Site Reliability Engineering Lessons from Google
- Consequences of SLO violations — CRE life lessons
- Service Level Objectives in Practice
- SRE Consensus Building
- An example escalation policy — CRE life lessons
- Error Budget Calculator
- Understanding error budget overspend - part one - CRE life lessons
- Good housekeeping for error budgets - part two - CRE life lessons
- SRE fundamentals: SLIs, SLAs and SLOs
- SLOs & You: A Guide To Service Level Objectives
- Earning Our Wings: Stories and Findings From Operating a Large-scale Concourse Deployment
- Nines are Not Enough: Meaningful Metrics for Clouds
- How many nines is my storage system?
- Don't follow the sun.
- The Tyranny of the SLA
- Backblaze Durability is 99.999999999% — And Why It Doesn’t Matter
- DevOpsDays Chicago 2019 - The Art of SLOs
- The Art of SLOs Workshop Materials
- How to Include Latency in SLO-Based Alerting
- Succeeding With Service Level Objectives
- Putting customers first with SLIs and SLOs
- SRE Leadership: Have Tiered SLAs
- How SLOs Enable Fast, Reliable Application Delivery
- The Tail at Scale
- The Tail at Scale Revisited
- Defining SLOs for services with dependencies
- Service Level Disagreements
- How We Use Sloth to do SLO Monitoring and Alerting with Prometheus
- SLI Deep Dive
- Measuring Reliability in GCP: Step By Step SLO creation guide using Cloud Operation Sandbox
- SLO tracker
- SLO Alerting for Mortals
- SRE methods and climate change
- What made SLOs so messy (and what we can do about it)
- SLICK: Adopting SLOs for improved reliability
- Calculating composite SLA
- Best practices for setting SLOs and SLIs for modern, complex systems
Performance¶
- Performance Checklists for SREs
- South Bay SRE Meetup - Netflix Cloud Performance Team
- Software Performance Analysis Guided By SLOs
- A framework for pragmatic performance engineering
Programming¶
- Go Language for Ops and Site Reliability Engineering
- Go for SREs using Python
- Operability in Go
- Go Reliability and Durability at Dropbox
Misc Articles¶
- What is SRE (Site Reliability Engineering)?
- Here’s How Google Makes Sure It (Almost) Never Goes Down
- Are site reliability engineers the next data scientists?
- Site Reliability Engineers: "solving the most interesting problems"
- Site Reliability Engineers: the "world’s most intense pit crew"
- Site reliability engineering kicks rote tasks out of IT ops
- Notes on Site Reliability Engineering
- Adventures in SRE-land: Welcome to Google Mission Control
- Book Review: Site Reliability Engineering - How Google Runs Production Systems
- Site Reliability Engineers: “We solve cooler problems”
- SREcon17: Brave new world of site reliability engineering
- Open AWS guide
- Commentary on Site Reliability Engineering
- Site Reliability Engineering: 4 Things to Know
- Looking for SRE Success? Then Find the Intrapreneurs!
- What Team Structure is Right for DevOps to Flourish?
- Injured on Vacation? Applying Principles from Site Reliability Engineering to a Travel Emergency
- Building blameless working environment
- SRE Adoption Report
- SREs: The Happiest – and Highest Paid – in the Industry
- The Role of Site Reliability Engineering, Today and Tomorrow
- SRE as a Lifestyle Choice
- SRECon EMEA 2019 Recap
- Life of an SRE at Google - JC van Winkel
- Site Reliability Engineering for Native Mobile Apps - Abhijith Krishnappa - 案例研究:本地移动应用程序 SRE 原则的 Halodoc 改编
- SRE Best Practices by InfraCloud
Real-time Messaging¶
- #sre channel at Hangops Slack - 一般讨论站点可靠性工程.
- #incident_response channel at Hangops Slack - 关于事件响应的讨论.
- USENIX SREcon Slack
Blogs¶
- Brendan Gregg's Blog - 关于系统内部结构、性能和 SRE 的高度技术性博客文章.
- Everything Sysadmin - Tom Limoncelli 关于 SysAdmin/DevOps/SRE 的博客文章.
- High Scalability - 关于系统架构的技术博客文章.
- rachelbythebay - 技术博客文章.
- Susan J. Fowler - 关于 SRE、软件工程和微服务的各种博客文章.
- SysAdvent - 12 月每天一篇文章,在第 25 篇文章结束.
- Stephen Thorne's Blog - 关于 SRE 的博客文章
- Increment - 一本关于团队如何大规模构建和操作软件系统的数字杂志.
- GopherSRE - 关于 Go 和 SRE 的博客文章.
- Cindy Sridharan - 关于分布式系统及其管理的博客文章.
- Blameless Blog - 关于 SRE 文化和实践的博客文章.
- Resilience Roundup - 每周分析为软件系统设计的弹性工程和人为因素研究
- Squadcast Blog - 关于 SRE 最佳实践、可靠性、随叫随到和事件管理的博客文章.
- FireHydrant Blog - 关于复杂系统、事件响应和 SRE 最佳实践的帖子.
- Rootly Blog - 事件管理最佳实践和指南.
- incident.io Blog - 有关事件管理和响应的指南、建议和资源.
- Logit.io Blog - 有关日志管理、SRE 和 devOps 的资源.
Newsletters¶
- DevOpsLinks - 关于 SRE、SysAdmin 和 DevOps 新闻、工具、教程和意见的每周时事通讯.
- KubeWeekly - 关于 Kubernetes 的每周时事通讯. KubeWeekly 由 Bob Killen、Chris Short、Craig Box、Kim McMahon 和 Michael Hausenblas 策划
- SRE Weekly - 每周网站可靠性通讯.
- O’Reilly Systems Engineering and Operations Newsletter - 每周系统工程和运营新闻以及业内人士的见解.
- ChaosEngineering.news - 混沌工程时事通讯. 混沌工程的所有内容,直接发送到您的收件箱!
- Monitoring Weekly - 监控有什么新内容? 每周将精选的监控文章发送到您的收件箱.
- Observability news - 围绕可观察性 (o11y) 的更新,特别关注开源.
Conferences & Meetups¶
- SRECon Conferences - 官方 SRE 会议.
- LISA Conferences - 关于 SysAdmin/DevOps/SRE 的重要会议.
- SRE Tech Talks - 谷歌主持的 SRE 演讲.
- South Bay Site Reliability Engineering (Sunnyvale, CA) Meetup - 一个为应对网络规模系统的可靠性挑战的个人而设的团体.
- San Francisco Reliability Engineering - 一群热衷于可靠、高性能软件系统的人.
- Site Reliability Engineering Munich, Germany - 在慕尼黑啤酒节城市的更大区域举行 SRE 聚会.
- ADDO - All Day DevOps - 完全在线且免费的 24 小时会议.
- Site Reliability Engineering Paris, France - 光明之城的 SRE Meetup.
- Site Reliability Engineering India - SRE 聚会印度
Twitter¶
- Google SRE Twitter Account - Google 的 SRE Twitter 帐户.
- SREBook - 网站可靠性工程书籍的官方推特账号.
- SREcon - SRECon 的官方 Twitter 帐户.
- SREWorkbook - 站点可靠性工作簿的官方 Twitter 帐户.
- The SRE Dev - SRE 相关的帖子来自 dev.to.
- Twitter SRE - Twitter SRE 团队的官方 Twitter 帐户.
- Twitter SRE Weekly - SRE 每周时事通讯的官方 Twitter 帐户.
- USENIX Association - 官方 USENIX Twitter 帐户.
SRE Tools¶
- Awesome SRE Tools - 站点可靠性和生产工程工具的精选列表
- List of Continuous Integration services
- SRE cheat sheet - 网站可靠性工程原则和数字的备忘单