jiangdm的个人博客分享 http://blog.sciencenet.cn/u/jiangdm

博文

google yahoo

已有 3107 次阅读 2011-7-24 22:58 |个人分类:Network|系统分类:论文交流| google, YAHOO

Contents
1 Google
[1] Bigtable: A Distributed Storage System for Structured Data
[2] Pregel: A System for Large-Scale Graph Processing
[3] PageRank  vs HITS
[4]
2 Yahoo
[1] PNUTS: Yahoo!’s Hosted Data Serving Platform  
[2] Data Challenges at Yahoo!
[3] Cloud Data Management @ Yahoo!
 
3 Microsoft
[1] Scale-Out Beyond Map-Reduce
[2]
 
 
1 Google三驾马车
  1. Bigtable

  2. GFS

  3.  

 
[1-] Bigtable: A Distributed Storage System for Structured Data
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach
Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber
OSDI 2006
 
Abstract
 Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large
size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable,
including web indexing, Google Earth, and Google Finance. These applications place very different demands
on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements
(from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully
provided a exible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

1 Introduction
 
 the organization of this paper:
 1) Section 2 describes the data model in more detail
 2) Section 3 provides an overview of the client API.
 3) Section 4 briey describes the underlying Google infrastructure on which Bigtable depends.
 4) Section 5 describes the fundamentals of the Bigtable implementation
 5) Section 6 describes some of the renements that we made to improve Bigtable's performance.
 6) Section 7 provides measurements of Bigtable's performance.
 7) Section 8 describe several examples of how Bigtable is used at Google
 8) Section 9 discuss some lessons we learned in designing and supporting Bigtable
 9) Finally, Section 10 describes related work
 10) Section 11 presents our conclusions.
 
2 Data Model
 
3 API
4 Building Blocks
 GFS: the distributed Google File System
5 Implementation
 three major components:
 -- a library that is linked into every client,
 -- one master server,
 -- many tablet servers.
5.1 Tablet Location
 a three-level hierarchy analogous to that of a B+-tree to store tablet location information (Figure 4).
       
 
5.2 Tablet Assignment
5.3 Tablet Serving
 The persistent state of a tablet is stored in GFS, as illustrated in Figure 5.
     
5.4 Compactions
 
6 Renements
 
7 Performance Evaluation
8 Real Applications
 
9 Lessons
 several interesting lessons:
 --
 
10 Related Work
11 Conclusions
 
 
 
 
 
[2-] Pregel: A System for Large-Scale Graph Processing
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski
SIGMOD’10, June 6–11, 2010, Indianapolis, Indiana, USA.
 
ABSTRACT
 Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs|in some cases billions of vertices, trillions of edges|poses challenges to their ecient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertexcentric approach is exible enough to express a broad set of
algorithms. The model has been designed for ecient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distributionrelated details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.
 
Keywords: Distributed computing, graph algorithms
 

1. INTRODUCTION
 
 The rest of the paper is structured as follows.
 1) Section 2 describes the model.
 2) Section 3 describes its expression as a C++ API.
 3) Section 4 discusses implementation issues, including performance and fault tolerance.
 4) Section 5 present several applications of this model to graph algorithm problems
 5) Section 6 present performance results.
 6) Finally, Sect.7 discuss related work and future directions.

2. MODEL OF COMPUTATION
 
 
[3-] PageRank vs HITS
 
 
可参考:PageRank 彻底解说 中文版(www)    http://www.kreny.com/pagerank_cn.htm
                wiki
 
{.rm PageRank}(p_i) = .frac{1-q}{N} + q .sum_{p_j} .frac{{.rm PageRank} (p_j)}{L(p_j)}
 
 
the n-by-n connectivity matrix G   the transition probability matrix of the Markov chain A
 
MATLAB 实现: 12 pagerank.pdf
Netlogo Model Library: PageRank
 
 
 
2 Yahoo
[1] PNUTS: Yahoo!’s Hosted Data Serving Platform  
Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein,Philip Bohannon, HansArno
 Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni
VLDB ‘08
 
ABSTRACT
 We describe PNUTS, a massively parallel and geographically distributed database system for Yahoo!’s web applications. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and utilizes automated load-balancing and failover to reduce operational complexity. The first version of the system is currently serving in production. We describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimental results.

1. INTRODUCTION
  web applications
 
 The foremost requirements of a web application:
 -- Scalability.
 -- Response Time and Geographic Scope.
 -- High Availability and Fault Tolerance.
 

PNUTS Yahoo!’s Hosted Data Serving Platform.pdf

[2-] Data Challenges at Yahoo!》,2008

ABSTRACT:
   In this short paper we describe the data that Yahoo! handles, the current trends in Web applications, and the many challenges that this poses for Yahoo! Research. These challenges have led to the development of new data systems and novel data mining techniques.
Problem:
   Storing and managing this ocean of data poses several important challenges.
Four major interconnected trends:
   The emergence of structure
   Design and dynamics of social systems
   The Web as a delivery channel
   The Web as wisdom(Web mining)
Data Platforms:  
    Dynamo
Two ongoing research projects
    Sherpa and PNUTS
    PIG
个人点评:
    从工业界角度提出问题更值得一读!
 
[3] Cloud Data Management @ Yahoo!
Raghu Ramakrishnan
Yahoo! Research, USA
ramakris@yahoo-inc.com
DASFAA 2010, Part I, LNCS 5981, p. 2, 2010. 一页短文,只有摘要,

Abstract.
   In this talk, I will present an overview of cloud computing at Yahoo!,
in particular, the data management aspects. I will discuss two major systems in
use at Yahoo!–the Hadoop map-reduce system and the PNUTS/Sherpa storage
system, in the broader context of offline and online data management in a cloud
setting.

   Hadoop is a well known open source implementation of a distributed file system
with a map-reduce interface. Yahoo! has been a major contributor to this
open source effort, and Hadoop is widely used internally. Given that the mapreduce
paradigm is widely known, I will cover it briefly and focus on describing
how Hadoop is used at Yahoo!. I will also discuss our approach to open source
software, with Hadoop as an example.

   Yahoo! has also developed a data serving storage system called Sherpa (sometimes
referred to as PNUTS) to support data-backed web applications. These applications
have stringent availability, performance and partition tolerance requirements
that are difficult, sometimes even impossible, to meet using conventional
database management systems. On the other hand, they typically are able to trade
off consistency to achieve their goals. This has led to the development of specialized
key-value stores, which are now used widely in virtually every large-scale
web service.

   Since most web services also require capabilities such as indexing, we are
witnessing an evolution of data serving stores as systems builders seek to balance
these trade-offs. In addition to presenting PNUTS/Sherpa, I will survey some of
the solutions that have been developed, including Amazon’s S3 and SimpleDB,
Microsoft’s Azure, Google’s Megastore, the open source systems Cassandra and
HBase, and Yahoo!’s PNUTS, and discuss the challenges in building such systems
as ”cloud services”, providing elastic data serving capacity to developers,
along with appropriately balanced consistency, availability, performance and partition
tolerance.  
 
 
 
3 Microsoft
[1] Scale-Out Beyond Map-Reduce
Raghu Ramakrishnan, CISL Team Members
KDD’13, August 11–14, 2013, Chicago, Illinois, USA.
 
 
Keywords
Analytics, Big Data, data science, Hadoop, YARN, REEF, Map-
Reduce, SQL, Machine Learning, scale-out.

1. ABSTRACT
 
   scale-out architectures
   
   Hadoop
   Hive: 基于Hadoop的一个数据仓库工具  Apache
   Pig: 对mapreduce算法(框架)实现了一套SQL shell脚本
 
   YARN: Apache Hadoop NextGen MapReduce
   Apache Mesos: a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks
 
 
   


https://blog.sciencenet.cn/blog-468147-468066.html

上一篇:Modal Logic 逻辑
下一篇:BPMN and BPEL
收藏 IP: 111.78.101.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-6-3 21:37

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部