博文

google yahoo

已有 3107 次阅读 2011-7-24 22:58 |个人分类:Network|系统分类:论文交流| google, YAHOO

Contents

1 Google

[1] Bigtable: A Distributed Storage System for Structured Data

[2] Pregel: A System for Large-Scale Graph Processing

[3] PageRank vs HITS

[4]

2 Yahoo

[1] PNUTS: Yahoo!’s Hosted Data Serving Platform

[2] Data Challenges at Yahoo!

[3] Cloud Data Management @ Yahoo!

3 Microsoft

[1] Scale-Out Beyond Map-Reduce

[2]

1 Google三驾马车

Bigtable
GFS

beamer_bigtable_osdi06.pdf

bigtable-osdi06.pdf

[1-] Bigtable: A Distributed Storage System for Structured Data
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach
Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber

OSDI 2006

Abstract
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large
size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable,
including web indexing, Google Earth, and Google Finance. These applications place very different demands
on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements
(from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully
provided a exible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

1 Introduction

the organization of this paper:

1) Section 2 describes the data model in more detail

2) Section 3 provides an overview of the client API.

3) Section 4 briey describes the underlying Google infrastructure on which Bigtable depends.

4) Section 5 describes the fundamentals of the Bigtable implementation

5) Section 6 describes some of the renements that we made to improve Bigtable's performance.

6) Section 7 provides measurements of Bigtable's performance.

7) Section 8 describe several examples of how Bigtable is used at Google

8) Section 9 discuss some lessons we learned in designing and supporting Bigtable

9) Finally, Section 10 describes related work

10) Section 11 presents our conclusions.

2 Data Model

3 API

4 Building Blocks

GFS: the distributed Google File System

5 Implementation

three major components:
-- a library that is linked into every client,

-- one master server,

-- many tablet servers.

5.1 Tablet Location
a three-level hierarchy analogous to that of a B+-tree to store tablet location information (Figure 4).

5.2 Tablet Assignment

5.3 Tablet Serving

The persistent state of a tablet is stored in GFS, as illustrated in Figure 5.

5.4 Compactions

6 Renements

7 Performance Evaluation

8 Real Applications

9 Lessons

several interesting lessons:

10 Related Work

11 Conclusions

Bigtable A Distributed Storage System for Structured Data.pdf

[2-] Pregel: A System for Large-Scale Graph Processing

Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski

SIGMOD’10, June 6–11, 2010, Indianapolis, Indiana, USA.

ABSTRACT
Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs|in some cases billions of vertices, trillions of edges|poses challenges to their ecient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertexcentric approach is exible enough to express a broad set of
algorithms. The model has been designed for ecient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distributionrelated details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

Keywords: Distributed computing, graph algorithms

1. INTRODUCTION

The rest of the paper is structured as follows.

1) Section 2 describes the model.

2) Section 3 describes its expression as a C++ API.

3) Section 4 discusses implementation issues, including performance and fault tolerance.

4) Section 5 present several applications of this model to graph algorithm problems

5) Section 6 present performance results.

6) Finally, Sect.7 discuss related work and future directions.

2. MODEL OF COMPUTATION

Pregel a system for large-scale graph processing.pdf

PPT: 20100707_Pregel.ppt

[3-] PageRank vs HITS

可参考：PageRank 彻底解说中文版(www) http://www.kreny.com/pagerank_cn.htm

wiki

${.rm PageRank}(p_i) = .frac{1-q}{N} + q .sum_{p_j} .frac{{.rm PageRank} (p_j)}{L(p_j)}$

the n-by-n connectivity matrix G the transition probability matrix of the Markov chain A

MATLAB 实现： 12 pagerank.pdf

The PageRank Citation Ranking - Redone.ppt

Origin paper: The PageRank Citation Ranking.pdf

Netlogo Model Library: PageRank

2 Yahoo

[1] PNUTS: Yahoo!’s Hosted Data Serving Platform

Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein,Philip Bohannon, HansArno
Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni
VLDB ‘08

ABSTRACT
We describe PNUTS, a massively parallel and geographically distributed database system for Yahoo!’s web applications. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and utilizes automated load-balancing and failover to reduce operational complexity. The first version of the system is currently serving in production. We describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimental results.

1. INTRODUCTION
web applications

The foremost requirements of a web application:
-- Scalability.
-- Response Time and Geographic Scope.
-- High Availability and Fault Tolerance.

PNUTS Yahoo!’s Hosted Data Serving Platform.pdf

[2-] Data Challenges at Yahoo!》，2008

ABSTRACT：
In this short paper we describe the data that Yahoo! handles, the current trends in Web applications, and the many challenges that this poses for Yahoo! Research. These challenges have led to the development of new data systems and novel data mining techniques.

Problem:

Storing and managing this ocean of data poses several important challenges.

Four major interconnected trends:
The emergence of structure
Design and dynamics of social systems
The Web as a delivery channel
The Web as wisdom(Web mining)

Data Platforms:

Dynamo

Two ongoing research projects
Sherpa and PNUTS
PIG

个人点评：

从工业界角度提出问题更值得一读！

data challenges at yahoo.pdf

[3] Cloud Data Management @ Yahoo!

Raghu Ramakrishnan
Yahoo! Research, USA
ramakris@yahoo-inc.com

DASFAA 2010, Part I, LNCS 5981, p. 2, 2010. 一页短文，只有摘要，

Abstract.

In this talk, I will present an overview of cloud computing at Yahoo!,
in particular, the data management aspects. I will discuss two major systems in
use at Yahoo!–the Hadoop map-reduce system and the PNUTS/Sherpa storage
system, in the broader context of offline and online data management in a cloud
setting.

Hadoop is a well known open source implementation of a distributed file system
with a map-reduce interface. Yahoo! has been a major contributor to this
open source effort, and Hadoop is widely used internally. Given that the mapreduce
paradigm is widely known, I will cover it briefly and focus on describing
how Hadoop is used at Yahoo!. I will also discuss our approach to open source
software, with Hadoop as an example.

Yahoo! has also developed a data serving storage system called Sherpa (sometimes
referred to as PNUTS) to support data-backed web applications. These applications
have stringent availability, performance and partition tolerance requirements
that are difficult, sometimes even impossible, to meet using conventional
database management systems. On the other hand, they typically are able to trade
off consistency to achieve their goals. This has led to the development of specialized
key-value stores, which are now used widely in virtually every large-scale
web service.

Since most web services also require capabilities such as indexing, we are
witnessing an evolution of data serving stores as systems builders seek to balance
these trade-offs. In addition to presenting PNUTS/Sherpa, I will survey some of
the solutions that have been developed, including Amazon’s S3 and SimpleDB,
Microsoft’s Azure, Google’s Megastore, the open source systems Cassandra and
HBase, and Yahoo!’s PNUTS, and discuss the challenges in building such systems
as ”cloud services”, providing elastic data serving capacity to developers,
along with appropriately balanced consistency, availability, performance and partition
tolerance.

Cloud Data Management @ Yahoo!.pdf

3 Microsoft

[1] Scale-Out Beyond Map-Reduce

Raghu Ramakrishnan, CISL Team Members

KDD’13, August 11–14, 2013, Chicago, Illinois, USA.

Keywords
Analytics, Big Data, data science, Hadoop, YARN, REEF, Map-
Reduce, SQL, Machine Learning, scale-out.

1. ABSTRACT

scale-out architectures

Hadoop

Hive: 基于Hadoop的一个数据仓库工具 Apache

Pig: 对mapreduce算法(框架)实现了一套SQL shell脚本

YARN: Apache Hadoop NextGen MapReduce

Apache Mesos: a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks

Scale-out beyond map-reduce.pdf

转载本文请联系原作者获取授权，同时请注明本文来自江东明科学网博客。
链接地址：https://blog.sciencenet.cn/blog-468147-468066.html

上一篇：Modal Logic 逻辑
下一篇：BPMN and BPEL

收藏 IP: 111.78.101.*| 热度|

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

数据加载中...

返回顶部

博文发布时间已经超过87600小时，评论已关闭。

江东明

扫一扫，分享此博文

jiangdm的个人博客分享 http://blog.sciencenet.cn/u/jiangdm

博文

google yahoo

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

江东明

全部精选博文导读

相关博文

jiangdm的个人博客分享 http://blog.sciencenet.cn/u/jiangdm

博文

google yahoo

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

江东明

全部精选博文导读

相关博文

该博文允许注册用户评论请点击登录评论 (0 个评论)