Skip to content

超赞合集 awesome list chinese

Services Engineering

icopy-site/awesome-cn

Services Engineering

Services Engineering Reading List¶

服务工程的阅读清单，重点是云基础设施服务.

我们欢迎 suggestions.

Papers¶

Fault Injection in Production (Allspaw)
Making Reliable Distributed Systems in the Presence of Software Errors （阿姆斯特朗）
Highly Available Transactions: Virtues and Limitations （贝利斯等人）
The Incident Command System （比格利和罗伯茨）
Bigtable: a Distributed Storage System for Structured Data （张等人）
Spanner: Google’s Globally-Distributed Database （科贝特等人）
Dynamo: Amazon’s Highly Available Key-Value Store （德坎迪亚等人）
MapReduce: Simplified Data Processing on Large Clusters （院长和 Ghemawat）
The Google File System （Ghemawat 等人）
On Designing and Deploying Internet Scale Services （汉密尔顿）
Kafka: A Distributed Messaging System for Log Processing （克雷普斯等人）
Weathering the Unexpected （克里希南）
The Unified Logging Infrastructure for Data Analytics at Twitter （李等人）
Automatic Management of Partitioned, Replicated Search Services （莱伯特等人）
Learning to Embrace Failure （柠檬切利等人）
Scaling Big Data Mining Infrastructure: The Twitter Experience （林和 Rayboy）
Out of the Tar Pit （莫斯利和马克）
The Log-Structured Merge-Tree （奥尼尔等人）
In Search of an Understandable Consensus Algorithm （Ongaro 和 Ousterhout）
Fallacies of Distributed Computing Explained (Rotem-Gal-Oz)
F1 - The Fault-Tolerant Distributed RDBMS Supporting Google’s Ad Business （舒特等人）
Dapper, A Large Scale Distributed Systems Tracing Infrastructure （西格尔曼等人）
Resident Distributed Datasets: a Fault-Tolerant Abstraction for In-Memory Cluster Computing （扎哈里等人）
The Human Side of Postmortems (Zwieback)
Crew Resource Management: a Positive Change for the Fire Service

Posts¶

Resilience Engineering: Part I, Part II (Allspaw)
Systems Engineering: a Great Definition (Allspaw)
Chaos Monkey Released Into The Wild （贝内特和采特林）
Some Rules for Engineering and Operations （黑色的）
Service Level Disagreements Part I, Part II （黑色的）
Incuriosity Will Kill Your Infrastructure （克雷福德）
My Philosophy on Alerting （埃瓦修克）
You Can’t Sacrifice Partition Tolerance （房子）
Customer Trust （汉密尔顿）
Observations on Errors, Corrections, & Trust of Dependent Systems （汉密尔顿）
Game Day Exercises at Stripe: Learning from kill -9 （赫德兰）
Life Beyond Distributed Transactions: An Apostate’s Opinion （地狱）
Notes on Distributed Systems for Young Bloods （霍奇斯）
The Network is Reliable （金斯伯里）
The Trouble with Clocks （金斯伯里）
Call Me Maybe: Final Thoughts （金斯伯里）
Getting Real About Distributed Systems Reliability （癌症）
The Log: What every software engineer should know about real-time data's unifying abstraction （癌症）
Incident Response at Heroku （麦格拉纳汉）
On HTTP Load Testing （诺丁汉）
Observability at Twitter （沃森）
Stevey’s Google Platforms Rant （耶格）

Presentations¶

Design, Lessons, and Advice from Building Distributed Systems at Google （院长）
Service Design Best Practices （汉密尔顿）

Books¶

The Field Guide To Understanding Human Error （德克尔）
Agile Retrospectives: Making Good Teams Great （德比等人）
Better: A Surgeon’s Notes on Performance （加万德）
The Checklist Manifesto: How to Get Things Right （加万德）
High Performance Browser Networking （格里戈里）
Resilience Engineering in Practice （Hollnagel 等人）
Effective Monitoring and Alerting （会绑定）
Release It!: Design and Deploy Production-Ready Software （尼加德）
The Challenger Launch Decision （旺市）
Managing the Unexpected （韦克和萨克利夫）

Research Groups¶

Conferences¶