Services Engineering
Services Engineering Reading List¶
服务工程的阅读清单,重点是云 基础设施服务.
我们欢迎 suggestions.
Papers¶
- Fault Injection in Production (Allspaw)
- Making Reliable Distributed Systems in the Presence of Software Errors (阿姆斯特朗)
- Highly Available Transactions: Virtues and Limitations (贝利斯等人)
- The Incident Command System (比格利和罗伯茨)
- Bigtable: a Distributed Storage System for Structured Data (张等人)
- Spanner: Google’s Globally-Distributed Database (科贝特等人)
- Dynamo: Amazon’s Highly Available Key-Value Store (德坎迪亚等人)
- MapReduce: Simplified Data Processing on Large Clusters (院长和 Ghemawat)
- The Google File System (Ghemawat 等人)
- On Designing and Deploying Internet Scale Services (汉密尔顿)
- Kafka: A Distributed Messaging System for Log Processing (克雷普斯等人)
- Weathering the Unexpected (克里希南)
- The Unified Logging Infrastructure for Data Analytics at Twitter (李等人)
- Automatic Management of Partitioned, Replicated Search Services (莱伯特等人)
- Learning to Embrace Failure (柠檬切利等人)
- Scaling Big Data Mining Infrastructure: The Twitter Experience (林和 Rayboy)
- Out of the Tar Pit (莫斯利和马克)
- The Log-Structured Merge-Tree (奥尼尔等人)
- In Search of an Understandable Consensus Algorithm (Ongaro 和 Ousterhout)
- Fallacies of Distributed Computing Explained (Rotem-Gal-Oz)
- F1 - The Fault-Tolerant Distributed RDBMS Supporting Google’s Ad Business (舒特等人)
- Dapper, A Large Scale Distributed Systems Tracing Infrastructure (西格尔曼等人)
- Resident Distributed Datasets: a Fault-Tolerant Abstraction for In-Memory Cluster Computing (扎哈里等人)
- The Human Side of Postmortems (Zwieback)
- Crew Resource Management: a Positive Change for the Fire Service
Posts¶
- Resilience Engineering: Part I, Part II (Allspaw)
- Systems Engineering: a Great Definition (Allspaw)
- Chaos Monkey Released Into The Wild (贝内特和采特林)
- Some Rules for Engineering and Operations (黑色的)
- Service Level Disagreements Part I, Part II (黑色的)
- Incuriosity Will Kill Your Infrastructure (克雷福德)
- My Philosophy on Alerting (埃瓦修克)
- You Can’t Sacrifice Partition Tolerance (房子)
- Customer Trust (汉密尔顿)
- Observations on Errors, Corrections, & Trust of Dependent Systems (汉密尔顿)
- Game Day Exercises at Stripe: Learning from
kill -9
(赫德兰) - Life Beyond Distributed Transactions: An Apostate’s Opinion (地狱)
- Notes on Distributed Systems for Young Bloods (霍奇斯)
- The Network is Reliable (金斯伯里)
- The Trouble with Clocks (金斯伯里)
- Call Me Maybe: Final Thoughts (金斯伯里)
- Getting Real About Distributed Systems Reliability (癌症)
- The Log: What every software engineer should know about real-time data's unifying abstraction (癌症)
- Incident Response at Heroku (麦格拉纳汉)
- On HTTP Load Testing (诺丁汉)
- Observability at Twitter (沃森)
- Stevey’s Google Platforms Rant (耶格)
Presentations¶
- Design, Lessons, and Advice from Building Distributed Systems at Google (院长)
- Service Design Best Practices (汉密尔顿)
Books¶
- The Field Guide To Understanding Human Error (德克尔)
- Agile Retrospectives: Making Good Teams Great (德比等人)
- Better: A Surgeon’s Notes on Performance (加万德)
- The Checklist Manifesto: How to Get Things Right (加万德)
- High Performance Browser Networking (格里戈里)
- Resilience Engineering in Practice (Hollnagel 等人)
- Effective Monitoring and Alerting (会绑定)
- Release It!: Design and Deploy Production-Ready Software (尼加德)
- The Challenger Launch Decision (旺市)
- Managing the Unexpected (韦克和萨克利夫)