我在学习该讲话的过程中做了一点笔记(详细内容请看演讲稿):
1. Google 的新数据中心位于俄勒冈州 Dalles 市,哥伦比亚河畔。http://en.wikipedia.org/wiki/The_Dalles,_Oregon
2. Google 的机架服务器密度较高,每机架40-80台服务器。服务器硬件配置为:DRAM: 16GB, 100ns, 20GB/s Disk:2TB, 10ms, 200MB/s 。
3. 每机架80台服务器总计:DRAM: 1TB, 300us, 100MB/s Disk: 160TB, 11ms, 100MB/s。30个机架合起来:DRAM: 30TB, 500us, 10MB/s Disk:4.80PB, 12ms, 10MB/s
p.7 这里说的是在现有体系下扩展,随着总容量越来越大,延时也会随之增加,传输速度显著下降。
4. Local DRAM 延时最小、容量最小、带宽最大,Datacenter Disk 延时最大、容量最大、带宽最小。 p.8
5. 一 切都将崩溃,关键在于如何应对。软件必须容错!从年度统计来看,1-5%的磁盘会损坏,2-4%的服务器会崩溃。
6. 某新集群典型的第一年,0.5次(两年 一次)过热,1次PDU失效(掉电),1次机架移动,1次网线重布,20次机架失效,5次机架抖动(丢包),8次网络维护,12次路由器重启,3次路由器 失效,若干次为期30秒的DNS闪断,1000次单机故障,数千块硬盘损坏... p.10
7. 长距离传输线路可能遇到的问题:野狗、鲨鱼、死马、醉酒的猎人...
8. Google 的每个集群一般由1000台左右服务器组成,核心服务是GFS和集群调度系统,同时运行 100 到 1000 个活动的任务。 p.16
9. Google 的 GFS 使用情况:200+ 个集群,多数集群有上千台服务器,多个容纳上千客户端的pool,文件系统总容量 4PB+,读写量 40GB/s。以上数据在当前硬件故障频出的环境中录得。
10. 协议描述语言至关重要,Google 自2000年以来使用 Protocol Buffers 语言,自解释、多语言支持、200+MB/s的编解码效率,还有诸多牛叉特性。 http://code.google.com/p/protobuf/
11. 设计高效系统必须知道的基本 数据:L1 cache reference 0.5 ns ==> ... ==> Send packet CA->Netherlands->CA 150,000,000 ns p.24
12. Know Your Basic Building Blocks:不仅要了解接口,更要理解内部实现,如果你不知道具体的工作机制,那就没法计算各步骤的成本了。
13. 压缩/编码需要权衡空间、编解码速度等 等:Zippy: encode@300 MB/s, decode@600MB/s, 2-4X compression• gzip: encode@25MB/s, decode@200MB/s, 4-6X compression
14. Important not to try to be all things to all people:more complex, compromises other clients in trying to satisfy everyone p.31
15. Make your apps do something reasonable even if not all is right – Better to give users limited functionality than an error page
Keynote #3: Jeffrey Dean, Google [ Slides] "Large-Scale Distributed Systems at Google: Current Systems and Future Directions"
As part of implementing the many products and services offered by Google, we have built a collection of systems and tools that simplify the storing and processing of large-scale data sets, and the construction of heavily-used public services based on these data sets. These systems are intended to work well in Google's computational environment, which consists of large numbers of commodity machines connected by commodity networking hardware. Our systems handle issues like storage reliability and availability in the face of machine failures, and our processing tools make it relatively easy to write robust computations that run reliably and efficiently on thousands of machines. In this talk I'll highlight some of the systems we have built, and discuss some challenges and future directions for new systems.
Jeff Dean joined Google in 1999 and is currently a Google Fellow in Google's Systems Infrastructure Group. He has co-designed/implemented five generations of Google's crawling, indexing, and query serving systems, and co-designed/implemented major pieces of Google's initial advertising and AdSense for Content systems. He is also a co-designer and co-implementor of Google's distributed computing infrastructure, including the MapReduce and BigTable systems, has worked on system software for statistical machine translation, and implemented on a variety of internal and external developer tools.



0 评论:
发表评论