[1-] Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber
OSDI 2006
Abstract Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a exible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.
1 Introduction
the organization of this paper:
1) Section 2 describes the data model in more detail
2) Section 3 provides an overview of the client API.
3) Section 4 briey describes the underlying Google infrastructure on which Bigtable depends.
4) Section 5 describes the fundamentals of the Bigtable implementation
5) Section 6 describes some of the renements that we made to improve Bigtable's performance.
6) Section 7 provides measurements of Bigtable's performance.
7) Section 8 describe several examples of how Bigtable is used at Google
8) Section 9 discuss some lessons we learned in designing and supporting Bigtable
9) Finally, Section 10 describes related work
10) Section 11 presents our conclusions.
2 Data Model
3 API
4 Building Blocks
GFS: the distributed Google File System
5 Implementation
three major components: -- a library that is linked into every client,
-- one master server,
-- many tablet servers.
5.1 Tablet Location a three-level hierarchy analogous to that of a B+-tree to store tablet location information (Figure 4).
5.2 Tablet Assignment
5.3 Tablet Serving
The persistent state of a tablet is stored in GFS, as illustrated in Figure 5.
[2-] Pregel: A System for Large-Scale Graph Processing
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski
SIGMOD’10, June 6–11, 2010, Indianapolis, Indiana, USA.
ABSTRACT Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs|in some cases billions of vertices, trillions of edges|poses challenges to their ecient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertexcentric approach is exible enough to express a broad set of algorithms. The model has been designed for ecient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distributionrelated details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.
Keywords: Distributed computing, graph algorithms
1. INTRODUCTION
The rest of the paper is structured as follows.
1) Section 2 describes the model.
2) Section 3 describes its expression as a C++ API.
3) Section 4 discusses implementation issues, including performance and fault tolerance.
4) Section 5 present several applications of this model to graph algorithm problems
5) Section 6 present performance results.
6) Finally, Sect.7 discuss related work and future directions.
Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein,Philip Bohannon, HansArno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni VLDB ‘08
ABSTRACT We describe PNUTS, a massively parallel and geographically distributed database system for Yahoo!’s web applications. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and utilizes automated load-balancing and failover to reduce operational complexity. The first version of the system is currently serving in production. We describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimental results.
1. INTRODUCTION web applications
The foremost requirements of a web application: -- Scalability. -- Response Time and Geographic Scope. -- High Availability and Fault Tolerance.
ABSTRACT: In this short paper we describe the data that Yahoo! handles, the current trends in Web applications, and the many challenges that this poses for Yahoo! Research. These challenges have led to the development of new data systems and novel data mining techniques.
Problem:
Storing and managing this ocean of data poses several important challenges.
Four major interconnected trends: The emergence of structure Design and dynamics of social systems The Web as a delivery channel The Web as wisdom(Web mining)
Data Platforms:
Dynamo
Two ongoing research projects Sherpa and PNUTS PIG
DASFAA 2010, Part I, LNCS 5981, p. 2, 2010. 一页短文,只有摘要,
Abstract.
In this talk, I will present an overview of cloud computing at Yahoo!, in particular, the data management aspects. I will discuss two major systems in use at Yahoo!–the Hadoop map-reduce system and the PNUTS/Sherpa storage system, in the broader context of offline and online data management in a cloud setting.
Hadoop is a well known open source implementation of a distributed file system with a map-reduce interface. Yahoo! has been a major contributor to this open source effort, and Hadoop is widely used internally. Given that the mapreduce paradigm is widely known, I will cover it briefly and focus on describing how Hadoop is used at Yahoo!. I will also discuss our approach to open source software, with Hadoop as an example.
Yahoo! has also developed a data serving storage system called Sherpa (sometimes referred to as PNUTS) to support data-backed web applications. These applications have stringent availability, performance and partition tolerance requirements that are difficult, sometimes even impossible, to meet using conventional database management systems. On the other hand, they typically are able to trade off consistency to achieve their goals. This has led to the development of specialized key-value stores, which are now used widely in virtually every large-scale web service.
Since most web services also require capabilities such as indexing, we are witnessing an evolution of data serving stores as systems builders seek to balance these trade-offs. In addition to presenting PNUTS/Sherpa, I will survey some of the solutions that have been developed, including Amazon’s S3 and SimpleDB, Microsoft’s Azure, Google’s Megastore, the open source systems Cassandra and HBase, and Yahoo!’s PNUTS, and discuss the challenges in building such systems as ”cloud services”, providing elastic data serving capacity to developers, along with appropriately balanced consistency, availability, performance and partition tolerance.