博文

纯XML数据库管理系统OrientX3.5

已有 5314 次阅读 2009-6-25 21:01 |个人分类:数据库与知识库|系统分类:科研笔记

开源精神是全部开放乐于分享实际上没有什么值得保密的大家帮你研究会更好

国外这类项目很多，BLOG中就有，首先肯定一点，这样的研究小组对教学、科研都有好处，但是借来的东西太多（要处理好开源许可证问题），采用的开发工具不利于开源扩展，还不支持Linux os.

Background

The increaing number of XML data proposes a requirement for efficient storage and retrieval. A straight forward solution is to map XML data into relatioanl table,and XML query into SQL. During the mapping, a heavy cost is raised and some semantics are lost. However,a native solution benefits that there is no needs to map your XML data to some other data structure. You just store the data as XML and retrieve it as XML. This is especially valuable when you have very complex XML structures which would be difficult or impossible to map to a more structured database.

OrientX is an open source system.Currenly, we provide user three methods to utilize OrientX system:

Development package.In this way,all sources in OrientX are compiled into static libraries and composed into a development package.Based on the package,users can develop their own applications;but they can not modify any implement techniques in OrientX.(变了姿态的开源）
Standalone application.In this way,users can manage and retrieve their XML data through a set of command in concole application.
Client/Server application.In this way,OrientX provides a friendly GUI client.All manipulations from differnt clients is submitted to the OrientX Server,which will run continuously like a DBMS server in the background.（它就属于XMLDBMS,无论纯的，还是混合的,大类属于DBMS)

Acknowledgements

Thanks to Apache xercesc. Aparche xercesc is used to parse XML and schema files when they are imported into OrientX system.
Thanks to Parser Generator,one Bumble-Bee Software. It is used to generate the lexical and syntatic analyzer for XML Query.
Thanks to XMark.XMark data is used as test cases of OrientX test.Also,some data of XMark is used as example data for OrientX demo.
怎么不提Berkeley DB Berkeley DB XML！！！！！

Xerces Java Parser

Berkeley DB （这里面文档很多)

XMark — An XML Benchmark Project
-------------------------------------------------------------------------------------------------------------------------------------------------------

Berkeley DB XML Reference Guide:来源于 http://www.oracle.com/technology/documentation/berkeley-db/xml/ref_xml/xml/arch.html

Architecture

Berkeley DB XML is implemented as C++ library on top of Berkeley DB. BDB XML is distributed as a shared library that is embedded into the client application. The BDB XML library exposes API's that enable C++ and Java applications to interact with the XML data containers. Figure 1 illustrates the Berkeley DB XML system architecture.

BDB XML uses Berkeley DB for data storage and transaction management. Client applications can also store data directly to a Berkeley DB database. Although BDB XML hides much of the internal use of Berkeley DB, some understanding of the underlying Berkeley DB API is required, as some BDB XML API methods accept Berkeley DB object handles as parameters. In particular, transactional applications need to fully understand the Berkeley DB database management interfaces for operations such as backup and restore, archiving, database recovery, etc.

The BDB XML library comprises several main components: document storage, XML indexing and index management, query optimization, and query execution.

----------------------------------------------------------------------------------------------------------------------------------------------------

ARCHITECTURE OF ORIENTX

OrientX adopts client-server architecture. Client provides graphical interfaces for user managing and retrieving data.Server provides an API interface to access database. The communication between them is implemented by socket technique. The overall architecture of OrientX is shown in Figure 1. We introduce in brief some modules here, and some important modules are focused on in the following sections.

Architecture

File Manager: The underlying file manager communicateswith file system to create, delete, open and close data les,in units of fixed size such as 8 MB.

Storage Manager: The storage manager manages the storage space of the file in units of a physical page, which is set to 8 KB. The main tasks include: apply/free physical page,create/delete dataset, etc.

Buffer Manager: There are two layers of our Buffer Mechanism: the lower layer is page buffer, and the higher layer is record buffer. Like RDBMS, page buffer manager managing the physical pages with LRU(Least Recently Used)method. Unlike RDBMS, the record in OrientX is tree structure, and need to be generated from the byte stream, which may cost some CPU time. Record buffer cached such tree structures to reduce the generating time. Another main target of record buffer is to enable OrientX query large documents. Through record buffer, documents can be read in peaces(records), and the unoccupied record can be freed to accommodate new records. In OirentX system, the record buffer is called treefrog, which means the current cursor can jump from records to records on the XML tree.

Access Manager: The access manager provides a uniform access interface to data manager, index manager, and schema manager. Details of the buffer manager and storage manager are hidden.

Data Manager: The data manager provides functions for importing, exporting, and retrieving the root of a document,etc. It formats a record(memory object) into (and from) a byte-stream.

Schema Manager: Schema-independent system can import XML data without schema. But for accelerating query processing, the system need to extract the schema form the data. That may make the schema even more huge and complex than the data. Moreover, the schema has not the function of constraining data, which will limit the use cases of schema, such as type checking in query and update. Like traditional database, OrientX is schema-based. Schema strictly constraint the type and structure of data. So, data retrieving, updating and storing are all under the schema's guidance.Schema information can be used in data layout, in choice of index, in type checking, in user access control, and in query optimization. Schema in OrientX is consistent with the XML Schema standard. Schema information is stored as a special data set in the database. Meanwhile, schema saved by tree structure is semi-structure itself, so it can restrict XML data without breaking features of XML data. Schema manager provides a uniform interface for other modules to access the schema information.

Data Processor: The data processor includes query evaluator and data updater. The former will be described in Section 5. Now we introduce the later in brief. In RDBMS,relationship between the records is represented by foreign key, and in OODBMS, relationship between objects is represented by object containment. While XML supports both of them: identity reference and nesting structure. OrientX keeps the reference integrity within updating. While deleting a complex element, all of the nested elements and values will be removed. While deleting an element referenced by other elements, the corresponding reference will be found by the value index and then deleted. The deleting of reference directly is also supported.

In our storage prototype, the elements are stored as variable length records. Each record has its parent record's or neighbor sibling record's pointer. The records may change their address because of increase or decrease contents during update operations, thus leads to the changes of the pointer.In order to decrease the modification of the pointers we introduce the oid(object id). Each element has a unique id.We use the oid table to store the oid and its corresponding storage address. In the system the record stores its parent and children oid as the pointer rather than their storage address. Therefore if the storage address of one record is changed due to update, we just to update the oid table.

To decrease the address modification of the updating record,we set a preserve factor of each page to preserve space for updating record. We supply garbage collection mechanismfor space reuse.