16 Business Intelligence


001 This chapter describes some of the basic ideas in business intelligence.
 
本章介绍业务智能(business intelligence)领域的基础概念。
 
002 This chapter contains the following topics: 本章包含以下主题:
003

Introduction to Data Warehousing and Business Intelligence

16.1 数据仓库及业务智能简介

004 A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.
 
数据仓库是一个专为查询与分析(query and analysis)——而非事物处理(transaction processing)——而设计的关系型数据库。 数据仓库中通常包含来自事务处理系统的历史数据,还可能包含来自其他数据源的数据。数据仓库的作用是将数据分析负载与事务处理负载分离,同时还能将用户 拥有的多个数据源融合为一个整体。
 
005 In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.
 
数据仓库系统中除了关系型数据库之外,还应包含数据抽取(extraction),转换(transportation),传输(transformation),加载(loading)方案(即 ETL),联机分析处理(online analytical processing,OLAP)引擎,客户端分析工具,以及其他用于管理数据收集与数据发布过程的应用程序。
 
006

Characteristics of Data Warehousing

16.1.1 数据仓库特性

007 A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: 在介绍数据仓库时,通常都会引用 William Inmon 所阐述的数据仓库应具备的特性:
008

Subject Oriented

16.1.1.1 面向主题的

009 Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.
 
数据仓库的作用是帮助用户分析数据。例如,如果用户需要分析企业的销售数据,就应建立一个以销售数据为中心的数据仓库。利用此数据仓库,分析人员能够回答类似“谁是去年对某类产品销售贡献最大的用户?”等问题。由于数据仓库是依据主题而创建的,所以说数据仓库是面向主题的(Subject Oriented)。
 
010

Integrated

16.1.1.2 统一的

011 Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.
 
数据仓库的统一性(integration)与其面向主题性(subject orientation)是紧密相关的。数据仓库需要将来自不同数据源的数据整合为统一的形式。在向数据仓库中加载数据时,还要解决命名冲突,度量单位不一致等问题。当上述问题都解决后,数据仓库就可以 被看作是统一的(integrated)。
 
012

Nonvolatile

16.1.1.3 永久的

013 Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.
 
永久性(nonvolatile)是指数据进入数据仓库后就不应被修改或删除。永久性的意义在于,数据仓库能够供用户分析历史上发生了什么。
 
014

Time Variant

16.1.1.4 历史的

015 In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.
 
为了发现业务中存在的趋势,分析用户需要使用大量数据。这与联机事务处理(online transaction processing,OLTP)系统截然相反,OLTP 系统对性能要求较高,历史数据必须移动到归档中。由于数据仓库关注数据在时间上的变化,因此称其是历史的(time variant)。
 
016 Typically, data flows from one or more online transaction processing (OLTP) databases into a data warehouse on a monthly, weekly, or daily basis. The data is normally processed in a staging file before being added to the data warehouse. Data warehouses commonly range in size from tens of gigabytes to a few terabytes. Usually, the vast majority of the data is stored in a few very large fact tables.
 
在一个典型的数据仓库系统中,数据每月,每周或每日从一个或多个 OLTP 数据库汇入数据仓库。数据在加载到数据仓库之前经常存储在中转文件(staging file)内,并进行适当的处理。数据仓库的容量一般在数十 GB 至数 TB 之间。通常,事实表(fact table)占了数据仓库数据量的绝大部分。
 
017

Differences Between Data Warehouse and OLTP Systems

16.1.2 数据仓库系统与 OLTP 系统的区别

018 Data warehouses and OLTP systems have very different requirements. Here are some examples of differences between typical data warehouses and OLTP systems:
 
数据仓库系统与 OLTP 系统的需求差别很大。以下介绍典型的数据仓库系统与 OLTP 系统的区别:
 
019

Workload

16.1.2.1 负载

020 Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations.
 
设计数据仓库时要满足即席查询(ad hoc query)的需要。开发者无法预先确定数据仓库的负载,因此在设计数据仓库时应该考虑到所有可能的查询操作 ,并进行优化。
 
021 OLTP systems support only predefined operations. Your applications might be specifically tuned or designed to support only these operations.
 
而 OLTP 系统只支持预定义的操作(predefined operation)。用户的应用系统可以进行有针对性的调优,或只提供特定的操作功能。
 
022

Data Modifications

16.1.2.2 数据变更

023 A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data modification techniques. The end users of a data warehouse do not directly update the data warehouse.
 
数据仓库的数据由采用批量数据变更方式(bulk data modification technique)的 ETL 过程周期性地(每天夜间或每周)进行更新。数据仓库的终端用户(end user)不能直接更新其中的数据。
 
024 In OLTP systems, end users routinely issue individual data modification statements to the database. The OLTP database is always up to date, and reflects the current state of each business transaction.
 
而在 OLTP 系统中,终端用户会经常性地向数据库提交进行数据变更的语句。OLTP 数据库总是在不断地更新,以反映事务的最新状态。
 
025

Schema Design

16.1.2.3 模式设计

026 Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance.
 
数据仓库通常采用反范式模式(denormalized schema)或部分反范式模式(partially denormalized schema)(例如星型模式(star schema))来优化查询性能。
 
027 OLTP systems often use fully normalized schemas to optimize update/insert/delete performance, and to guarantee data consistency.
 
而 OLTP 系统通常采用范式模式(normalized schema)以优化更新/插入/删除等操作的性能,并确保数据一致性。
 
028

Typical Operations

16.1.2.4 典型操作

029 A typical data warehouse query scans thousands or millions of rows. For example, "Find the total sales for all customers last month."
 
一个典型的数据仓库查询所扫描数据行的数量级通常是数千,甚至是数百万。例如,“找出所有客户上个月的总销售量”。
 
030 A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order for this customer."
 
而一个典型的 OLTP 操作只需访问个别的记录。例如,“获得某客户的当前订单”。
 
031

Historical Data

16.1.2.5 历史数据

032 Data warehouses usually store many months or years of data. This is to support historical analysis.
 
数据仓库内通常会存储数月甚至数年的数据。这些数据被用于历史分析。
 
033 OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction.
 
而 OLTP 系统内通常只存储几周或几个月的数据。OLTP 系统只存储当前事务所需的历史数据。
 
034

Data Warehouse Architecture

16.1.3 数据仓库体系结构

035 Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Three common architectures are: 数据仓库的体系结构是由用户的实际情况决定的。下面介绍三种常见的体系结构:
036

Data Warehouse Architecture (Basic)

16.1.3.1 数据仓库体系结构(基本型)

037 Figure 16-1 shows a simple architecture for a data warehouse. End users directly access data derived from several source systems through the data warehouse.
 
图 16-1 显示了一个最基本的数据仓库的体系结构。数据仓库的数据来自多个数据源,终端用户直接访问数据仓库。
 
038 Figure 16-1 Architecture of a Data Warehouse
 
图 16-1 数据仓库体系结构
 
039


 


 

040

Figure 16-1 shows a data warehouse with metadata, raw data and summary data. Data going in to the data warehouse comes from data sources (operational systems and flat files) and from users (doing analysis, reporting, and mining).

图 16-1 显示了一个包含元数据,原始数据,及汇总数据的数据仓库。数据仓库存储来自多个数据源(操作型数据库及平面文件)及终端用户(分析,报表,及挖掘)的数据。
041 In Figure 16-1, the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like August sales.
 
图 16-1 中,数据仓库内存储了来自传统 OLTP 系统的原始数据(raw data)及元数据(metadata),同时还有汇总数据(summary data)。汇总数据在数据仓库中十分必要,因为其中预先计算了需要长时间执行才能得到结果。例如,一个典型的数据仓库查询可能会获取八月份的总销售情况。
 
042 Summaries in Oracle are called materialized views.
 
在 Oracle 中可以使用物化视图(materialized view)存储汇总数据。
 
043

Data Warehouse Architecture (with a Staging Area)

16.1.3.2 数据仓库体系结构(使用中转区)

044 Figure 16-1, you need to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management. Figure 16-2 illustrates this typical architecture.
 
图 16-1 所示的系统中,用户在将操作型数据(operational data)放入数据仓库之前,首先要对数据进行清洗及其他处理。上述过程可以完全在数据抽取程序中完成,但大多数数据仓库系统都会利用中转区(staging area)来进行此类操作。中转区能够简化计算汇总数据的过程,并使数据仓库更易于管理。图 16-2 显示了一个使用中转区的数据仓库系统。

 
045 Figure 16-2 Architecture of a Data Warehouse with a Staging Area
 
图 16-2 使用中转区的数据仓库体系结构
 
046


 


 

047

Figure 16-2 shows a data warehouse with metadata, raw data and summary data. Data going into the data warehouse comes from data sources (operational systems and flat files) and from users (doing analysis, reporting, and mining). The operational data from the data sources first goes to a staging area.

图 16-2 显示了一个包含元数据,原始数据,及汇总数据的数据仓库。数据仓库存储来自多个数据源(操作型数据库及平面文件)及终端用户(分析,报表,及挖掘)的数据。 来自数据源的操作型数据首先进入中转区。
048

Data Warehouse Architecture (with a Staging Area and Data Marts)

16.1.3.3 数据仓库体系结构(使用中转区及数据集市)

049 Although the architecture in Figure 16-2 is quite common, you might want to customize your warehouse's architecture for different groups within your organization.
 
图 16-2 所示的系统是较常见的数据仓库体系结构。但有时开发者还需要为组织内部的不同用户群定制特殊的数据仓库。
 
050 Do this by adding data marts, which are systems designed for a particular line of business. Figure 16-3 illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales.
 
这种定制是通过数据集市(data mart)实现的,数据集市是为不同业务而单独设计的数据仓库系统。在 图 16-3 所示的系统中,采购,销售,库存等系统是相互分离的。在这个例子中,专注于财务的分析用户只需访问采购及销售系统中的历史数据。
 
051 Figure 16-3 Architecture of a Data Warehouse with a Staging Area and Data Marts
 
图 16-3 使用中转区及数据集市的数据仓库体系结构
052


 


 

053

Figure 16-3 shows a data warehouse with metadata, raw data and summary data. Data going in to the data warehouse comes from data sources (operational systems and flat files) and from users (doing analysis, reporting, and mining). The operational data from the data sources first goes to a staging area. The user data is separated into lines of business, such as sales and purchasing.

图 16-3 显示了一个包含元数据,原始数据,及汇总数据的数据仓库。数据仓库存储来自多个数据源(操作型数据库及平面文件)及终端用户(分析,报表,及挖掘)的数据。 来自数据源的操作型数据首先进入中转区。终端用户使用的数据根据业务进行分割,例如销售及采购。
054

See Also:

Oracle Database Data Warehousing Guide

另见:

Oracle Database Data Warehousing Guide
055

Overview of Extraction, Transformation, and Loading (ETL)

16.2 抽取,转换,及加载(ETL)概述

056 You need to load your data warehouse regularly so that it can serve its purpose of facilitating business analysis. To do this, data from one or more operational systems needs to be extracted and copied into the warehouse. The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading. The acronym ETL is perhaps too simplistic, because it omits the transportation phase and implies that each of the other phases of the process is distinct. We refer to the entire process, including data loading, as ETL. You should understand that ETL refers to a broad process, and not three well-defined steps.
 
用户需要定期地向数据仓库中加载数据,才能使其发挥业务分析的作用。数据需要从一个或多个操作型系统(operational system)中抽取并复制到数据仓库中。从源系统(source system)抽取数据再存储到数据仓库的过程被称为 ETL,即抽取(extraction),转换(transformation),及加载(loading)的首字母缩写。 事实上,只用 ETL 描述此过程过于简化,因为其中没有体现出传输(transportation)子过程,也没有体现出各子过程之间的联系。在本文中将介绍完整的 ETL 过程。用户应意识到 ETL 涉及一系列子过程,而不只是 3 个已经严格定义的步骤。
 
057 The methodology and tasks of ETL have been well known for many years, and are not necessarily unique to data warehouse environments: a wide variety of proprietary applications and database systems are the IT backbone of any enterprise. Data has to be shared between applications or systems, trying to integrate them, giving at least two applications the same picture of the world. This data sharing was mostly addressed by mechanisms similar to what we now call ETL.
 
ETL 过程采用的方法论及其中包含的主要工作其实早已为开发者所熟知,且 ETL 并非只应用于数据仓库系统。一个企业的 IT 环境通常由多个数据库系统及定制的应用系统构成。这些系统之间应该可以共享数据,或进行相互的集成,不同系统中的数据应该能够反映相同的业务事实。上述的数据共享工作通常是由与 ETL 类似的机制实现的。
 
058 Data warehouse environments face the same challenge with the additional burden that they not only have to exchange but to integrate, rearrange and consolidate data over many systems, thereby providing a new unified information base for business intelligence. Additionally, the data volume in data warehouse environments tends to be very large.
 
与数据共享工作相比,数据仓库系统除了需要交换数据,还需集成(integrate),重构(rearrange),融合(consolidate)来自多个源系统的数据,从而为业务智能提供一个统一的信息基础。此外,数据仓库系统中的数据量通常十分巨大。
 
059 What happens during the ETL process? During extraction, the desired data is identified and extracted from many different sources, including database systems and applications. Very often, it is not possible to identify the specific subset of interest, therefore more data than necessary has to be extracted, so the identification of the relevant data will be done at a later point in time. Depending on the source system's capabilities (for example, operating system resources), some transformations may take place during this extraction process. The size of the extracted data varies from hundreds of kilobytes up to gigabytes, depending on the source system and the business situation. The same is true for the time delta between two (logically) identical extractions: the time span may vary between days/hours and minutes to near real-time. Web server log files for example can easily become hundreds of megabytes in a very short period of time.
 
ETL 过程究竟包含哪些具体工作呢?在抽取阶段,需要从不同的数据源(数据库系统或应用程序)中定位并抽取分析所需的数据。但是准确地定位感兴趣的数据子集比较困难,因此在抽取 阶段通常会抽取多余的数据,定位有用数据的工作将在之后的步骤进行。有些转换工作也是在抽取阶段完成的,这取决于数据源系统的能力(例如,操作系统是否有足够资源)。抽取的数据量是由源系统及业务需求决定的,或许数字节,也可能高达数 GB。两次(逻辑上)完全相同的抽取之间的间隔同样由源系统及业务需求决定,有可能为数天,数小时,数分钟,甚至接近实时。例如,Web 服务器的日志数据在短时间内就可能积累数百 MB。
 
060 After extracting data, it has to be physically transported to the target system or an intermediate system for further processing. Depending on the chosen way of transportation, some transformations can be done during this process, too. For example, a SQL statement which directly accesses a remote target through a gateway can concatenate two columns as part of the SELECT statement.
 
数据被抽取后,需要传输(physically transport)到目标系统(target system)或中间系统(intermediate system)中进行进一步处理。在传输的过程中也可以进行某些转换工作,这取决于用户选择得传输方式。例如,使用 SQL 语句通过网关(gateway)直接访问远程目标系统时,可以在 SELECT 语句中将两列数据合并。
 
061 If any errors occur during loading, an error is logged and the operation can continue.
 
如果在加载中发生错误,应该将问题记录到日志中,并使抽取过程继续执行。
 
062

Transportable Tablespaces

16.2.1 可移动表空间

063 Transportable tablespaces are the fastest way for moving large volumes of data between two Oracle databases. You can transport tablespaces between different computer architectures and operating systems.
 
可移动表空间(transportable tablespace)是在两个 Oracle 数据库间移动大量数据时最迅速的方式。用户可以在操作系统或计算机体系结构不同的系统间移动表空间。
 
064 Previously, the most scalable data transportation mechanisms relied on moving flat files containing raw data. These mechanisms required that data be unloaded or exported into files from the source database. Then, after transportation, these files were loaded or imported into the target database. Transportable tablespaces entirely bypass the unload and reload steps.
 
在可移动表空间技术出现之前,将原始数据(raw data)存储在平面文件(flat file)中再进行移动是最具伸缩性(scalable)的数据传输机制。这种方法需要首先将数据从源数据库中导出(unload or export)到平面文件。经过传输(transportation)后,再将文件中的数据导入(load or import)目标数据库。而使用可移动表空间可以完全跳过导入及导出的步骤。
 
065 Using transportable tablespaces, Oracle data files (containing table data, indexes, and almost every other Oracle database object) can be directly transported from one database to another. Furthermore, like import and export, transportable tablespaces provide a mechanism for transporting metadata in addition to transporting data.
 
使用可移动表空间功能时,Oracle 数据文件(包含表数据,索引,及几乎所有类型的 Oracle 数据库对象)可以直接在数据库间传输。此外,可移动表空间功能除了能够进行数据传输,还能进行元数据传输,这同 import 及 export 的元数据传输功能类似。
 
066 The most common applications of transportable tablespaces in data warehouses are in moving data from a staging database to a data warehouse, or in moving data from a data warehouse to a data mart.
 
在数据仓库系统中,可移动表空间功能最常见的应用是将数据从中转数据库(staging database)移动到数据仓库,或将数据从数据仓库移动到数据集市。
 
067

Table Functions

16.2.2 表函数

068 Table functions provide the support for pipelined and parallel execution of transformations implemented in PL/SQL, C, or Java. Scenarios as mentioned earlier can be done without requiring the use of intermediate staging tables, which interrupt the data flow through various transformations steps.
 
用户可以利用表函数(table function)管道化(pipelined)及并行化(parallel)地执行数据转换(transformation),表函数可以由 PL/SQL,C,或 Java 实现。本章前面讲述的某些抽取案例中使用了起中间作用的中转表(staging table),但这会使数据在各个转换步骤间出现停顿,而采用表函数则能够避免这种停顿。
 
069 A table function is defined as a function that can produce a set of rows as output. Additionally, table functions can take a set of rows as input. Table functions extend database functionality by allowing:
  • Multiple rows to be returned from a function
  • Results of SQL subqueries (that select multiple rows) to be passed directly to functions
  • Functions take cursors as input
  • Functions can be parallelized
  • Returning result sets incrementally for further processing as soon as they are created. This is called incremental pipelining
表函数指能够返回一个结果集的函数。此外,表函数也能够把结果集作为输入参数。表函数在以下几个方面扩展了数据库的能力:
  • 一个函数可以返回多行数据
  • SQL 子查询的结果(即选择多行数据)可以直接传递给表函数
  • 表函数能够把游标作为输入参数
  • 表函数能够并行执行
  • 表函数能够增量地返回结果集。这被称为增量管道
070 Table functions can be defined in PL/SQL using a native PL/SQL interface, or in Java or C using the Oracle Data Cartridge Interface (ODCI).
 
表函数可以使用本地 PL/SQL 接口(native PL/SQL interface)在 PL/SQL 中定义,也可以使用 Oracle Data Cartridge Interface(ODCI)在 Java 或 C 中定义。
 
071

External Tables

16.2.3 外部表

072 External tables let you use external data as a virtual table that can be queried and joined directly and in parallel without requiring the external data to be first loaded in the database. You can then use SQL, PL/SQL, and Java to access the external data.
 
外部表(external table)功能使用户可以将外部数据虚拟为一个表,并可直接对此虚拟表进行并行地查询与联接操作,而无需将外部数据导入数据库。建立了外部表后,用户可以使用 SQL,PL/SQL,及 Java 访问外部数据。
 
073 External tables enable the pipelining of the loading phase with the transformation phase. The transformation process can be merged with the loading process without any interruption of the data streaming. It is no longer necessary to stage the data inside the database for further processing inside the database, such as comparison or transformation. For example, the conversion functionality of a conventional load can be used for a direct-path INSERT AS SELECT statement in conjunction with the SELECT from an external table. Figure 16-4 illustrates a typical example of pipelining.
 
采用外部表后,加载及转换两个阶段即可实现管道化。转换过程及加载过程可以相互融合,其间的数据流无需中断。用户无需首先将数据暂存在数据库内就可直接对其进行处理(例如比较或转换)。例如,用户可以使用 direct-path 的 INSERT AS SELECT 语句,使用 SELECT 语句访问外部表,并在加载的同时实现转换。图 16-4 显示了一个管道化(pipelining)的典型例子。
 
074 Figure 16-4 Pipelined Data Transformation
 
图 16-4 管道化的数据转换
075


 


 

076

Figure 16-4 shows 4 dimension tables (products, customers, times, and channels) connected to one fact table, called sales.

图 16-4 显示了四个维表(产品,客户,时间,及渠道)与名为销售的事实表关联。
077 The main difference between external tables and regular tables is that externally organized tables are read-only. No DML operations (UPDATE/INSERT/DELETE) are possible and no indexes can be created on them.
 
外部表与常规表的区别在于外部表是只读的。在外部表上不能执行 DML 操作(UPDATE/INSERT/DELETE),也不能在其上创建索引。
 
078 External tables are a complement to SQL*Loader and are especially useful for environments where the complete external source has to be joined with existing database objects and transformed in a complex manner, or where the external data volume is large and used only once. SQL*Loader, on the other hand, might still be the better choice for loading of data where additional indexing of the staging table is necessary. This is true for operations where the data is used in independent complex transformations or the data is only partially used in further processing.
 
外部表可以作为 SQL*Loader 的补充,当存在以下情况,例如全部外部数据需要与已有数据库对象结合并进行复杂地转换,或外部数据的数据量很大但只需被使用一次,使用外部表就能很好地解决问题。如果必须在中转表上创建索引,那么还是应使用 SQL*Loader 加载数据。如果外部数据自身需要进行复杂地转换,或外部数据中的某些部分数据需要进行进一步处理,也应先将数据加载到数据库中。
 
079

Table Compression

16.2.4 表压缩

080 You can save disk space by compressing heap-organized tables. A typical type of heap-organized table you should consider for table compression is partitioned tables.
 
用户可以通过压缩堆表(heap-organized table)来节省磁盘空间。分区表(partitioned table)就是一种典型的可以考虑采用表压缩(table compression)的堆表。
 
081 To reduce disk use and memory use (specifically, the buffer cache), you can store tables and partitioned tables in a compressed format inside the database. This often leads to a better scaleup for read-only operations. Table compression can also speed up query execution. There is, however, a slight cost in CPU overhead.
 
为了减少磁盘使用及内存使用(尤其是数据库缓存(buffer cache)),用户可以将数据库内的表及分区表以压缩形式存储。这有助于提高只读操作的性能。表压缩能够提高查询执行的速度,但会略微增加 CPU 开销。
 
082 Table compression should be used with highly redundant data, such as tables with many foreign keys. You should avoid compressing tables with much update or other DML activity. Although compressed tables or partitions are updatable, there is some overhead in updating these tables, and high update activity may work against compression by causing some space to be wasted.
 
当数据表内存在大量冗余数据时适合使用表压缩,例如表内存储了大量外键时。用户应该避免对可能存在大量 DML 操作的表进行压缩。尽管被压缩的表或分区是可更新的,但是更新时存在额外开销,对更新操作较多的表进行压缩可能反而会导致空间浪费。
 
083

See Also:

"Table Compression"

另见:

表压缩
084

Change Data Capture

16.2.5 变化数据捕捉

085 Change Data Capture efficiently identifies and captures data that has been added to, updated, or removed from Oracle relational tables, and makes the change data available for use by applications.
 
变化数据捕捉(Change Data Capture)功能能够识别并捕捉到 Oracle 关系数据表内新增,更新,及被删除的数据,并使这些数据可以被其他应用所使用。
 
086 Oftentimes, data warehousing involves the extraction and transportation of relational data from one or more source databases into the data warehouse for analysis. Change Data Capture quickly identifies and processes only the data that has changed, not entire tables, and makes the change data available for further use.
 
通常,数据仓库需要从一个或多个源数据库中抽取数据,再传输到数据仓库中供分析使用。采用变化数据捕捉功能后,用户能够迅速地识别并处理修改过(而非整个表)的数据以供分析 使用。
 
087 Change Data Capture does not depend on intermediate flat files to stage the data outside of the relational database. It captures the change data resulting from INSERT, UPDATE, and DELETE operations made to user tables. The change data is then stored in a database object called a change table, and the change data is made available to applications in a controlled way.
 
变化数据捕捉功能无需依赖关系数据库外部的中间文件来中转数据。此功能根据对数据表的 INSERTUPDATE,及 DELETE 操作来捕捉变化的数据。变化数据将存储在被称为变化表(change table)的数据库对象,这些数据能够根据用户的控制而提供给有需要的应用程序。
 
088

See Also:

Oracle Database Data Warehousing Guide

另见:

Oracle Database Data Warehousing Guide
089

Overview of Materialized Views for Data Warehouses

16.3 在数据仓库中使用物化视图

090 One technique employed in data warehouses to improve performance is the creation of summaries. Summaries are special kinds of aggregate views that improve query execution times by precalculating expensive joins and aggregation operations prior to execution and storing the results in a table in the database. For example, you can create a table to contain the sums of sales by region and by product.
 
在数据仓库系统中,可以通过创建汇总数据(summary)来提升其性能。汇总数据是一种聚合视图(aggregate view),其中可以保存预先计算的复杂关联,聚合,或排序,从而提高相关查询的执行性能。例如,用户可以创建一个表,保存按区域及产品汇总的销售数据。
 
091 The summaries or aggregates that are referred to in this book and in literature on data warehousing are created in Oracle using a schema object called a materialized view. Materialized views can perform a number of roles, such as improving query performance or providing replicated data.
 
在本章内或 Oracle 数据仓库系统中,汇总数据或聚合数据(aggregate)通常是通过物化视图(materialized view) 这种模式对象(schema object)创建的。物化视图的应用有多种,例如提高查询性能或进行数据复制。
 
092 Previously, organizations using summaries spent a significant amount of time and effort creating summaries manually, identifying which summaries to create, indexing the summaries, updating them, and advising their users on which ones to use. Summary management eased the workload of the database administrator and meant that the user no longer needed to be aware of the summaries that had been defined. The database administrator creates one or more materialized views, which are the equivalent of a summary. The end user queries the tables and views at the detail data level.
 
在物化视图出现之前,开发者如果需要使用汇总数据需要花费大量时间进行手工管理,例如需要确定创建哪些汇总数据,对汇总数据进行索引,更新汇总数据,并将汇总数据的信息告知有需求的用户。而(采用物化视图的)汇总管理(summary management)既减轻了数据库管理员的工作,同时使用户无需关心与汇总数据有关的细节。管理员只需要创建起到汇总数据作用的物化视图。而用户则直接在明细级查询数据表或视图。
 
093 The query rewrite mechanism in the Oracle database server automatically rewrites the SQL query to use the summary tables. This mechanism reduces response time for returning results from the query. Materialized views within the data warehouse are transparent to the end user or to the database application.
 
Oracle 数据库的查询重写(query rewrite)机制能够自动地重写查询 SQL 以使用汇总表。此机制减少了查询返回结果的响应时间。数据仓库内的物化视图对终端用户及数据库应用程序是透明的。
 
094 Although materialized views are usually accessed through the query rewrite mechanism, an end user or database application can construct queries that directly access the summaries. However, serious consideration should be given to whether users should be allowed to do this because any change to the summaries will affect the queries that reference them.
 
物化视图通常是通过查询重写机制访问的,但终端用户及数据库应用程序也可以直接对其访问。但是管理员应该仔细限定是否让终端用户直接访问物化视图,因为对物化视图的任何修改都将影响引用她们的所有查询。
 
095 To help you select from among the many possible materialized views in your schema, Oracle provides a collection of materialized view analysis and advisor functions and procedures in the DBMS_ADVISOR package. Collectively, these functions are called the SQL Access Advisor, and they are callable from any PL/SQL program. The SQL Access Advisor recommends materialized views from a hypothetical or user-defined workload or one obtained from the SQL cache. You can run the SQL Access Advisor from Oracle Enterprise Manager or by invoking the DBMS_ADVISOR package.
 
为了帮助用户选择可用的物化视图,Oracle 在 DBMS_ADVISOR 包中提供了一系列用于分析及建议物化视图的函数与过程。上述函数与过程被统称为 SQL Access Advisor(SQL 访问建议器),可以在 PL/SQL 程序中调用。SQL Access Advisor 能够根据假设的负载,用户定义的负载,或从 SQL 缓存(SQL cache)中获得的负载来建议如何使用物化视图。用户可以在 Oracle Enterprise Manager 里运行 SQL Access Advisor,也可以通过 DBMS_ADVISOR 包进行调用。
 
096
See Also:

Oracle Database Performance Tuning Guide for information about materialized views and the SQL Access Advisor
另见:

Oracle Database Performance Tuning Guide 了解更更多关于物化视图及 SQL Access Advisor 的信息
097

Overview of Bitmap Indexes in Data Warehousing

16.4 在数据仓库中使用位图索引

098 Bitmap indexes are widely used in data warehousing environments. The environments typically have large amounts of data and ad hoc queries, but a low level of concurrent DML transactions. For such applications, bitmap indexing provides:
  • Reduced response time for large classes of ad hoc queries
  • Reduced storage requirements compared to other indexing techniques
  • Dramatic performance gains even on hardware with a relatively small number of CPUs or a small amount of memory
  • Efficient maintenance during parallel DML and loads
数据仓库应用(data warehousing application)的特点是数据量巨大,执行的多为即席查询(ad hoc query),且并发 DML 事务较少。这种环境下使用位图索引(bitmap index)具备如下优势:
  • 能够减少大数据量即席查询的响应时间
  • 与其他索引技术相比能够节省大量存储空间
  • 即使硬件配置较低也能显著提高性能
  • 有利于并行 DML(parallel DML)和并行加载(parallel load)
099 Fully indexing a large table with a traditional B-tree index can be prohibitively expensive in terms of space because the indexes can be several times larger than the data in the table. Bitmap indexes are typically only a fraction of the size of the indexed data in the table.
 
为一个大表建立传统的 B-tree index(平衡树索引)可能占用极大的存储空间,索引有可能比数据表还要大数倍。而一个位图索引(bitmap index)所占的空间比被索引数据还要小得多。
 
100 An index provides pointers to the rows in a table that contain a given key value. A regular index stores a list of rowids for each key corresponding to the rows with that key value. In a bitmap index, a bitmap for each key value replaces a list of rowids.
 
索引的目标是为用户提供指向包含特定键值(key value)的数据行的指针。在常规的索引中,Oracle 将各行的键值及与此键值对应的一组 ROWID 存储在一起,从而实现了上述目标。而在位图索引(bitmap index)中,只需为每个键值存储一个位图(bitmap),而非一组 ROWID。
 
101 Each bit in the bitmap corresponds to a possible rowid, and if the bit is set, it means that the row with the corresponding rowid contains the key value. A mapping function converts the bit position to an actual rowid, so that the bitmap index provides the same functionality as a regular index. If the number of different key values is small, bitmap indexes save space.
 
位图(bitmap)中的每一位(bit)对应一个可能的 ROWID。如果某一位被置位(set),则表明着与此位对应的 ROWID 所指向的行中包含此此位图对应的键值(key value)。Oracle 通过一个映射函数(mapping function)将位信息转化为实际的 ROWID,因此位图索引同样能实现常规索引的功能。当不同值的索引键的数量较少时,位图索引的存储效率相当高。
 
102 Bitmap indexes are most effective for queries that contain multiple conditions in the WHERE clause. Rows that satisfy some, but not all, conditions are filtered out before the table itself is accessed. This improves response time, often dramatically. A good candidate for a bitmap index would be a gender column due to the low number of possible values.
 
如果一个查询的 WHERE 子句中包含多个条件,在查询中利用位图索引能够极大地提高效率。不满足全部条件的数据行可以在访问表数据之前被预先过滤掉。这通常会显著地缩短响应时间。 在类似于性别的数据列上适合创建位图索引,因为这样的列中可能的键值的数量较少。
 
103 Parallel query and parallel DML work with bitmap indexes as they do with traditional indexes. Bitmap indexing also supports parallel create indexes and concatenated indexes.
 
位图索引(bitmap index)如同常规索引一样,可以结合并行查询(parallel query)和并行 DML(parallel DML)一起工作。Oracle 还支持并行地创建位图索引,以及创建复合位图索引。
 
104
See Also:

Oracle Database Data Warehousing Guide
另见:

Oracle Database Data Warehousing Guide
105

Overview of Parallel Execution

16.5 并行执行概述

106 When Oracle runs SQL statements in parallel, multiple processes work together simultaneously to run a single SQL statement. By dividing the work necessary to run a statement among multiple processes, Oracle can run the statement more quickly than if only a single process ran it. This is called parallel execution or parallel processing.
 
当 Oracle 并行地(parallel)执行 SQL 语句时,存在多个进程同时工作共同执行同一个 SQL 语句。将一个语句的工作负载分配给多个进程,与单进程相比,同样的语句能够被更快地执行。这被称为并行执行(parallel execution)或并行处理(parallel processing)。
 
107 Parallel execution dramatically reduces response time for data-intensive operations on large databases typically associated with decision support systems (DSS) and data warehouses. Symmetric multiprocessing (SMP), clustered systems, and large-scale cluster systems gain the largest performance benefits from parallel execution because statement processing can be split up among many CPUs on a single Oracle system. You can also implement parallel execution on certain types of online transaction processing (OLTP) and hybrid systems.
 
对于大型数据库中的大数据量操作,并行执行能够显著地缩短响应时间,这个特点在决策支持系统(decision support systems,DDS)及数据仓库系统中表现的尤为突出。对称多处理系统(Symmetric multiprocessing,SMP),集群系统(clustered system),及大型集群系统(large-scale clustered system)能够从并行执行中获得最大的性能提升,因为在这样的系统平台上,一个 Oracle 系统的语句处理工作可以被分配到多个 CPU 上。用户也可以在某些联机事务处理系统(online transaction processing,OLTP)或混合系统(hybrid system)上采用并行执行。
 
108 Parallelism is the idea of breaking down a task so that, instead of one process doing all of the work in a query, many processes do part of the work at the same time. An example of this is when 12 processes handle 12 different months in a year instead of one process handling all 12 months by itself. The improvement in performance can be quite high.
 
并行(parallelism)指将一个工作分解,多个进程同时执行查询工作的一部分,而非一个进程完成全部工作。例如,12 个进程分别处理 1 年 12 个月的数据,而非 1 个进程独自处理 12 个月的数据。并行带来的性能提升十分显著。
 
109 Parallel execution helps systems scale in performance by making optimal use of hardware resources. If your system's CPUs and disk controllers are already heavily loaded, you need to alleviate the system's load or increase these hardware resources before using parallel execution to improve performance.
 
并行执行能够优化硬件资源的使用,从而使系统性能具备可伸缩性。如果系统的 CPU 或磁盘控制器(disk controller)上的负载已经很重,用户必须首先减轻负载或增减硬件资源,才能使用并行执行来提升系统性能。
 
110 Some tasks are not well-suited for parallel execution. For example, many OLTP operations are relatively fast, completing in mere seconds or fractions of seconds, and the overhead of utilizing parallel execution would be large, relative to the overall execution time.
 
有些工作并不适合并行执行。例如,有些 OLTP 操作执行速度较快,能购在数秒甚至不到一秒的时间内完成,此时采用并行执行的额外开销与总执行时间相比就比较大。
 
111
See Also:

Oracle Database Data Warehousing Guide for specific information on tuning your parameter files and database to take full advantage of parallel execution
另见:

Oracle Database Data Warehousing Guide 了解如何调整参数文件及数据库,以便发挥并行执行的全部优势
112

How Parallel Execution Works

16.5.1 并行执行是如何工作的

113 When parallel execution is not used, a single server process performs all necessary processing for the sequential execution of a SQL statement. For example, to perform a full table scan (such as SELECT * FROM emp), one process performs the entire operation, as illustrated in Figure 16-5.
 
在未使用并行执行(parallel execution)时,只有一个服务进程(server process)完成了一个 SQL 语句的顺序执行所需的全部处理工作。例如,进行全表扫描(SQL 语句可能为 SELECT * FROM emp)时,一个进程完成了所有的工作,如 图 16-5 所示。
 
114 Figure 16-5 Serial Full Table Scan
 
图 16-5 串行全表扫描
 
115


 


 

116 Figure 16-6 illustrates several parallel execution servers performing a scan of the table emp. The table is divided dynamically (dynamic partitioning) into load units called granules and each granule is read by a single parallel execution server. The granules are generated by the coordinator. Each granule is a range of physical blocks of the table emp. The mapping of granules to execution servers is not static, but is determined at execution time. When an execution server finishes reading the rows of the table emp corresponding to a granule, it gets another granule from the coordinator if there are any granules remaining. This continues until all granules are exhausted, in other words, until the entire table emp has been read. The parallel execution servers send results back to the parallel execution coordinator, which assembles the pieces into the desired full table scan.
 
图 16-6 显示了多个并行执行服务器(parallel execution server)对表 emp 进行扫描。表被动态地划分(即动态分区(dynamic partitioning))为称作颗粒(granule)的负载单位(load unit),每个颗粒由一个并行执行服务器负责读取。颗粒是由并行执行协调器(coordinator)划分的。每个颗粒都是 emp 表的一组物理数据块。颗粒与并行执行服务器之间的对应关系并不是静态的,而是在执行时决定的。当一个并行执行服务器完成了对 emp 表某一颗粒的读取后,能够从并行执行协调器获得剩余的颗粒继续工作。这个过程将持续直至所有颗粒全部完成,即整个 emp 表被读取完毕。每个并行执行服务器都将结果返回给并行执行协调器,后者将所有结果汇总为全表扫描的结果。
 
117 Figure 16-6 Parallel Full Table Scan
 
图 16-6 并行全表扫描
 
118


 


 

119 Given a query plan for a SQL query, the parallel execution coordinator breaks down each operator in a SQL query into parallel pieces, runs them in the right order as specified in the query, and then integrates the partial results produced by the parallel execution servers executing the operators. The number of parallel execution servers assigned to a single operation is the degree of parallelism (DOP) for an operation. Multiple operations within the same SQL statement all have the same degree of parallelism.
 
通过观察并行执行的 SQL 查询的执行计划(query plan),可以发现并行查询协调器将 SQL 查询的每个操作都进行了分割,并按查询中指定的顺序执行,最后将所有并行执行服务器的执行结果合并。为一个操作所分配的并行执行服务器的数量被称为此操作的并行度(degree of parallelism,DOP)。同一 SQL 语句内的所有操作的并行度均相同。
 
120
See Also:

Oracle Database Data Warehousing Guide for information on granules as well as how Oracle divides work and handles DOP in multiuser environments
另见:

Oracle Database Data Warehousing Guide 了解关于颗粒的信息,以及 Oracle 如何分割工作,如何在多用户环境中处理 DOP
121

Overview of Analytic SQL

16.6 分析型 SQL 概述

122 Oracle has introduced many SQL operations for performing analytic operations in the database. These operations include ranking, moving averages, cumulative sums, ratio-to-reports, and period-over-period comparisons. Although some of these calculations were previously possible using SQL, this syntax offers much better performance.
 
Oracle 提供了多种 SQL 功能,用于在数据库中进行分析操作。这些功能包括排名(ranking),移动平均值(moving average),累积求和(cumulative sum),百分比(ratio-to-report),以及同期比较(period-over-period comparison)等。尽管有些计算以前的 SQL 也能实现,但新语法能够实现更好的性能。
 
123 This section discusses: 本节将讨论以下内容:
124

SQL for Aggregation

16.6.1 用于聚合运算的 SQL

125 Aggregation is a fundamental part of data warehousing. To improve aggregation performance in your warehouse, Oracle provides extensions to the GROUP BY clause to make querying and reporting easier and faster. Some of these extensions enable you to:
  • Aggregate at increasing levels of aggregation, from the most detailed up to a grand total
  • Calculate all possible combinations of aggregations with a single statement
  • Generate the information needed in cross-tabulation reports with a single query
聚合运算(aggregation)是数据仓库系统的基本功能。为了提升数据仓库中聚合运算的性能,Oracle 对 GROUP BY 字句进行了扩展,使查询及报表生成更简单更迅速。这些扩展使用户可以进行以下操作:
  • 从最明细的数据开始逐层汇总,即可以在低层汇总的基础上再次进行汇总
  • 在一个 SQL 语句中计算所有可能组合的汇总
  • 使用一个 SQL 语句获得旋转表(cross-tabulation reports)所需的数据
126 These extension let you specify exactly the groupings of interest in the GROUP BY clause. This allows efficient analysis across multiple dimensions without performing a CUBE operation. Computing a full cube creates a heavy processing load, so replacing cubes with grouping sets can significantly increase performance. CUBE, ROLLUP, and grouping sets produce a single result set that is equivalent to a UNION ALL of differently grouped rows.
 
这些扩展使用户可以利用 GROUP BY 子句获得其感兴趣的汇总结果。用户无需使用 CUBE 操作就能够高效地进行多维分析(analysis across multiple dimensions)。完全计算一个立方体(cube)将带来很重的系统负载,因此以聚合运算的结果集来代替立方体能够显著地提升系统性能。使用 CUBEROLLUP,及聚合运算都能得到一个单一的结果集,其效果与使用 UNION ALL 语句将多个聚合运算得到的数据行进行合并后相同。
 
127 To enhance performance, these extensions can be parallelized: multiple processes can simultaneously run all of these statements. These capabilities make aggregate calculations more efficient, thereby enhancing database performance, and scalability.
 
为了进一步提升系统性能,上述扩展操作可以被并行执行:即多个进程同时执行语句。这使聚合运算更为高效,也提高了数据库的性能及可伸缩性(scalability)。
 
128 One of the key concepts in decision support systems is multidimensional analysis: examining the enterprise from all necessary combinations of dimensions. We use the term dimension to mean any category used in specifying questions. Among the most commonly specified dimensions are time, geography, product, department, and distribution channel, but the potential dimensions are as endless as the varieties of enterprise activity. The events or entities associated with a particular set of dimension values are usually referred to as facts. The facts might be sales in units or local currency, profits, customer counts, production volumes, or anything else worth tracking.
 
多维分析(multidimensional analysis)是决策支持系统(decision support system)中的一个关键概念:其含义是以各种可能的维度(dimension)的组合来观察一个企业。此处使用维度这个术语来代指分析中所使用的不同角度。常用的维度有时间(time),地域(geography),产品(product),部门(department),及渠道(distribution channel)等,但用户可以根据企业自身行为特点总结出各种维度。与一组维度值(imension value)相关的事件(event)或实体(entity)通常被称为事实(fact)。常见的事实有销售数量及金额,利润,客户数量,产量等,任何值得观察的信息都可以作为事实。
 
129 Here are some examples of multidimensional requests:
  • Show total sales across all products at increasing aggregation levels for a geography dimension, from state to country to region, for 1999 and 2000.
  • Create a cross-tabular analysis of our operations showing expenses by territory in South America for 1999 and 2000. Include all possible subtotals.
  • List the top 10 sales representatives in Asia according to 2000 sales revenue for automotive products, and rank their commissions.
以下是一些多维查询的例子:
  • 按产品维度及各层次地域维度(从州到国家再到地区)对 1999 年及 2000 年的销售进行聚合运算。
  • 创建旋转表,以南美洲各个国家为角度分析 1999 年及 2000 年企业运行的成本。包括所有子汇总(subtotal)。
  • 列出 2000 年创造汽车产品销售利润前 10 名的销售代表,并按佣金排序。
130 All these requests involve multiple dimensions. Many multidimensional questions require aggregated data and comparisons of data sets, often across time, geography or budgets.
 
上述查询都涉及多个维度。大多数多维查询都会按时间,地域等维度汇总数据,并对汇总结果进行比较。
 
131
See Also:

Oracle Database Data Warehousing Guide
另见:

Oracle Database Data Warehousing Guide
132

SQL for Analysis

16.6.2 用于分析运算的 SQL

133 Oracle has advanced SQL analytical processing capabilities using a family of analytic SQL functions. These analytic functions enable you to calculate:
  • Rankings and percentiles
  • Moving window calculations
  • Lag/lead analysis
  • First/last analysis
  • Linear regression statistics
Oracle 内含有一套分析型 SQL 函数(analytic SQL function),提供了高级 SQL 分析处理能力。用户可以利用这套函数进行以下计算:
  • 排名(ranking)及百分位数(percentile)
  • 移动窗口计算(moving window calculation)
  • 滞后/提前分析(lag/lead analysis)
  • 首位/末位分析(first/last analysis)
  • 线性回归统计(linear regression statistic)
134 Ranking functions include cumulative distributions, percent rank, and N-tiles. Moving window calculations allow you to find moving and cumulative aggregations, such as sums and averages. Lag/lead analysis enables direct inter-row references so you can calculate period-to-period changes. First/last analysis enables you to find the first or last value in an ordered group.
 
排名函数包括累积分布(cumulative distribution),百分比排名(percent rank),以及 N-tile 函数。移动窗口计算使用户可以进行移动及累积的汇总或平均计算。滞后/提前分析能够在数据行之间建立直接的关联(direct inter-row reference),用户可以使用此功能进行同期分析。首位/末位分析的典型应用是帮助用户找出所有订单中最具价值及最无价值的分组。
 
135 Other features include the CASE expression. CASE expressions provide if-then logic useful in many situations.
 
Oracle 还支持 CASE 表达式。CASE 表达式能够处理 if-then 逻辑。
 
136 To enhance performance, analytic functions can be parallelized: multiple processes can simultaneously run all of these statements. These capabilities make calculations easier and more efficient, thereby enhancing database performance, scalability, and simplicity.
 
为了进一步提升系统性能,分析函数可以被并行执行:即多个进程同时执行语句。这使运算更为高效,也提高了数据库的性能,可伸缩性(scalability),及易用性。
 
137
See Also:

Oracle Database Data Warehousing Guide
另见:

Oracle Database Data Warehousing Guide
138

SQL for Modeling

16.6.3 用于建模的 SQL

139 Oracle's MODEL clause brings a new level of power and flexibility to SQL calculations. With the MODEL clause, you can create a multidimensional array from query results and then apply formulas to this array to calculate new values. The formulas can range from basic arithmetic to simultaneous equations using recursion. For some applications, the MODEL clause can replace PC-based spreadsheets. Models in SQL leverage Oracle's strengths in scalability, manageability, collaboration, and security. The core query engine can work with unlimited quantities of data. By defining and executing models within the database, users avoid transferring large datasets to and from separate modeling environments. Models can be shared easily across workgroups, ensuring that calculations are consistent for all applications. Just as models can be shared, access can also be controlled precisely with Oracle's security features. With its rich functionality, the MODEL clause can enhance all types of applications.
 
Oracle 的 MODEL 子句能使 SQL 的计算能力更强大更灵活。利用 MODEL 子句,用户可以在结果集的基础上创建一个多维数组(multidimensional array),并可以使用此数据进行新的公式计算(formula)。公式可以为基本的数学运算,也可以是包含递归运算的联立方程(simultaneous equations using recursion)。对于某些应用来说,MODEL 子句能够代替在 PC 上运行的电子表格软件。Oracle 的 MODEL 子句增强了其可伸缩性,可管理性,协作能力及安全性。核心查询引擎(core query engine)对其处理的数据量没有限制。当数据库具备了定义和执行模型的能力后,用户不必在独立的建模系统与数据库间移动大量数据。模型可以在工作组(workgroup)间共享,这使所有应用都能获得具备一致性的数据。同样,对模型的访问也可以依靠 Oracle 的安全特性进行细致的控制。MODEL 子句丰富的功能有助于增强所有类型的应用系统。
 
140
See Also:

Oracle Database Data Warehousing Guide
另见:

Oracle Database Data Warehousing Guide
141

Overview of OLAP Capabilities

16.7 OLAP 功能概述

142 Oracle OLAP provides the query performance and calculation capability previously found only in multidimensional databases to Oracle's relational platform. In addition, it provides a Java OLAP API that is appropriate for the development of internet-ready analytical applications. Unlike other combinations of OLAP and RDBMS technology, Oracle OLAP is not a multidimensional database using bridges to move data from the relational data store to a multidimensional data store. Instead, it is truly an OLAP-enabled relational database. As a result, Oracle provides the benefits of a multidimensional database along with the scalability, accessibility, security, manageability, and high availability of the Oracle database. The Java OLAP API, which is specifically designed for internet-based analytical applications, offers productive data access.
 
Oracle OLAP 增强了 Oracle 关系型数据库的查询性能及计算能力,而这些增强以前只能通过多维数据库实现。此外,Oracle OLAP 中提供了一套 Java OLAP API,供用户开发适合 internet 环境的分析应用程序。与其他将 OLAP 与 RDBMS 整合的技术不同,Oracle OLAP 多维数据库不需要将数据从关系型数据存储转移到多维数据存储。Oracle OLAP 是一个实现了 OLAP 功能的关系型数据库。因此,Oracle 在提供了多维数据库功能的同时,还具备 Oracle 数据库的可伸缩性,可访问性(accessibility),安全性,可管理性,及高可用性。专为开发适合 internet 环境的分析应用程序的 Java OLAP API 提供了高效的数据访问能力。
 
143
See Also:

Oracle OLAP Application Developer's Guide
另见:

Oracle OLAP Application Developer's Guide
144

Benefits of OLAP and RDBMS Integration

16.7.1 OLAP 与 RDBMS 集成的优势

145 Basing an OLAP system directly on the Oracle database server offers the following benefits: 直接在 Oracle 数据库上构建 OLAP 系统具备以下优势:
146

Scalability

16.7.1.1 可伸缩性

147 There is tremendous growth along three dimensions of analytic applications: number of users, size of data, and complexity of analyses. There are more users of analytical applications, and they need access to more data to perform more sophisticated analysis and target marketing. For example, a telephone company might want a customer dimension to include detail such as all telephone numbers as part of an application that is used to analyze customer turnover. This would require support for multi-million row dimension tables and very large volumes of fact data. Oracle can handle very large data sets using parallel execution and partitioning, as well as offering support for advanced hardware and clustering.
 
对一个分析应用系统(analytic application)来说应该考虑三个方面的快速增长:用户数,数据量,及分析的复杂性。分析应用系统的用户会逐渐增长,用户将需要更多的数据以便进行更复杂的分析,或进行目标营销(target marketing)。例如,电信企业可能需要在客户维度中包含所有电话号码的明细数据,以便分析每个客户的通话量。此时,维表的数据量可能高达数百万行,事实表的数据量也会很大。Oracle 数据库具有并行执行(parallel execution)及分区(partitioning)能力,还能支持先进的硬件及集群,从而实现对大数据量的处理。
 
148

Availability

16.7.1.2 可用性

149 Partitioning allows management of precise subsets of tables and indexes, so that management operations affect only small pieces of these data structures. By partitioning tables and indexes, data management processing time is reduced, thus minimizing the time data is unavailable. Transportable tablespaces also support high availability. With transportable tablespaces, large data sets, including tables and indexes, can be added with almost no processing to other databases. This enables extremely rapid data loading and updates.
 
采用分区技术(partitioning)后,用户可以对表及索引进行分段管理,即管理操作只会影响对象的一部分内容。对表及索引进行分区后,数据管理的时间被缩短,从而使数据的可用时间最大化。可移动表空间(transportable tablespace)技术也有助于提高可用性。采用可移动表空间技术,大量数据(包括表及索引)可以在 Oracle 数据库间移动而无需任何处理。这加速了数据加载以及更新操作。
 
150

Manageability

16.7.1.3 可管理性

151 Oracle lets you precisely control resource utilization. The Database Resource Manager, for example, provides a mechanism for allocating the resources of a data warehouse among different sets of end-users.
 
用户可以详细地控制 Oracle 中的资源使用。例如,Database Resource Manager(数据库资源管理器)能够为不同终端用户组分配数据仓库资源。
 
152 Another resource management facility is the progress monitor, which gives end users and administrators the status of long-running operations. Oracle maintains statistics describing the percent-complete of these operations. Oracle Enterprise Manager lets you view a bar-graph display of these operations showing what percent complete they are. Moreover, any other tool or any database administrator can also retrieve progress information directly from the Oracle data server using system views.
 
属于资源管理的功能还有进展监控(progress monitor),终端用户及管理员可以查询长时间执行操作(long-running operation)的状态。Oracle 能够自动维护描述操作完成百分比的统计信息。用户在 Oracle Enterprise Manager(Oracle 企业管理器)中可以以柱状图的形式查询这些操作的完成百分比。此外,其他工具或数据库管理员也可以通过系统视图直接从数据库服务器中获取相关信息。
 
153

Backup and Recovery

16.7.1.4 备份与恢复

154 Oracle provides a server-managed infrastructure for backup, restore, and recovery tasks that enables simpler, safer operations at terabyte scale. Some of the highlights are:
  • Details related to backup, restore, and recovery operations are maintained by the server in a recovery catalog and automatically used as part of these operations.
  • Backup and recovery operations are fully integrated with partitioning. Individual partitions, when placed in their own tablespaces, can be backed up and restored independently of the other partitions of a table.
  • Oracle includes support for incremental backup and recovery using Recovery Manager, enabling operations to be completed efficiently within times proportional to the amount of changes, rather than the overall size of the database.
Oracle 中提供了用于备份,复原及恢复的服务器端基础架构(server-managed infrastructure),使这些操作在大数据量环境下同样简单安全。这个基础架构的主要特征有:
  • 与备份,复原及恢复相关的详细信息由服务器管理,存储在恢复资料库(recovery catalog)中,在执行相应操作时被自动使用。
  • 备份恢复操作与分区技术(partitioning)紧密集成。如果每个分区拥有独立的表空间,那么这样的分区可以被独立地备份与复原。
  • Oracle 利用 Recovery Manager(恢复管理器)可以进行增量的备份与恢复,因此这些操作所需时间仅与数据变化量有关,而不受数据库总容量的影响。
155
See Also:

Oracle Database Backup and Recovery Quick Start Guide
另见:

Oracle Database Backup and Recovery Quick Start Guide
156

Security

16.7.1.5 安全性

157 The security features in Oracle have reached the highest levels of U.S. government certification for database trustworthiness. Oracle's fine grained access control enables cell-level security for OLAP users. Fine grained access control works with minimal burden on query processing, and it enables efficient centralized security management.
 
Oracle 的安全特性已经达到美国政府数据库可信度认证的最高等级。Oracle 细粒度访问控制功能可以为 OLAP 用户实现块级(cell-level)的安全控制。细粒度访问控制对查询操作的性能影响很小,且能够实现高效的集中化安全管理。
 
158

Overview of Data Mining

16.8 数据挖掘概述

159 Oracle Data Mining (ODM) embeds data mining within the Oracle Database. The data never leaves the database — the data, data preparation, model building, and model scoring results all remain in the database. This enables Oracle to provide an infrastructure for application developers to integrate data mining seamlessly with database applications. Some typical examples of the applications that data mining are used in are call centers, ATMs, ERM, and business planning applications.
 
Oracle Data Mining(ODM,Oracle 数据挖掘)组件将数据挖掘功能嵌入了 Oracle 数据库。任何数据都不必离开数据库,即分析所用的数据,数据准备(data preparation),建模(model building),模型评估结果(model scoring result)都可以保存在数据库中。应用开发者可以把 Oracle 作为基础架构,将数据库应用与数据挖掘应用无缝地集成。将应用与数据挖据集成的典型例子有呼叫中心,ATM,ERM,及业务计划应用(business planning application)等。
 
160 By eliminating the need for extracting data into specialized tools and then importing the results back into the database, you can save significant amounts of time. In addition, by having the data and the data model in the same location (an Oracle database), there is no need to export the model as code.
 
由于无需将数据抽取到专业工具,也无需将结果返回数据库,这大大节省了数据挖掘的时间。此外,由于数据与数据模型存储在同一位置(Oracle 数据库),用户也不必将模型导出为代码。
 
161 Data mining functions such as model building, testing, and scoring are provided through a Java API.
 
建模,测试,评估等数据挖掘函数是通过 Java API 提供的。
 
162 Oracle Data Mining supports the following algorithms:
  • For classification, Naive Bayes, Adaptive Bayes Networks, and Support Vector Machines (SVM)
  • For regression, Support Vector Machines
  • For clustering, k-means and O-Cluster
  • For feature extraction, Non-Negative Matrix Factorization (NMF)
  • For sequence matching and annotation, BLAST
Oracle Data Mining 支持以下算法:
  • 分类(classification)算法,Naive Bayes,Adaptive Bayes Networks,及 Support Vector Machines(SVM)
  • 回归分析(regression)算法,Support Vector Machines
  • 聚类(clustering)算法,k-means 及 O-Cluster
  • 特征抽取算法(feature extraction),Non-Negative Matrix Factorization(NMF)
  • 序列比对与解释(sequence matching and annotation)算法,BLAST
163 ODM also includes several feature and performance enhancements.
 
ODM 中还提供了一些其他功能及性能增强特性。
 
164

See Also:

另见:

 

A 翻译不确定的词汇(格式:黄色背景 )  

 

B 翻译不确定的Oracle/数据库词汇(格式:
黄色背景

[063] Transportable tablespaces
[068] Table functions
[068] pipelined
[069] incremental pipelining
[073] direct-path
[095] SQL Access Advisor
[133] Lag/lead analysis
[154] recovery catalog
[159] model scoring

C 翻译不确定的句子(格式:
黄色背景

[056] We refer to the entire process, including data loading, as ETL. You should understand that ETL refers to a broad process, and not three well-defined steps.
[064] Previously, the most scalable data transportation mechanisms relied on moving flat files containing raw data.
[068] Table functions provide the support for pipelined and parallel execution of transformations implemented in PL/SQL, C, or Java.
[069] Returning result sets incrementally for further processing as soon as they are created. This is called incremental pipelining
[073] External tables enable the pipelining of the loading phase with the transformation phase.
[073] For example, the conversion functionality of a conventional load can be used for a direct-path INSERT AS SELECT statement in conjunction with the SELECT from an external table.
[098] Efficient maintenance during parallel DML and loads
[103] Bitmap indexing also supports parallel create indexes and concatenated indexes.
[119] Given a query plan for a SQL query, the parallel execution coordinator breaks down each operator in a SQL query into parallel pieces, runs them in the right order as specified in the query, and then integrates the partial results produced by the parallel execution servers executing the operators.
[127] We use the term dimension to mean any category used in specifying questions.
[139] The core query engine can work with unlimited quantities of data.
[160] In addition, by having the data and the data model in the same location (an Oracle database), there is no need to export the model as code.

D 注释性的文字(格式:
[绿色]

 

E 未完成的链接


 

F Oracle学习问题(格式:
黄色背景
1、看看 incremental pipelining 是怎么回事?
[069] incremental pipelining

2、Figure 16-4 Pipelined Data Transformation
[074] 图错了?

3、怎么“逐层汇总”?
[125] Aggregate at increasing levels of aggregation, from the most detailed up to a grand total

4、有什么“可能组合”?
[125] Calculate all possible combinations of aggregations with a single statement

5、为什么分析角度有“budgets”?
[130] Many multidimensional questions require aggregated data and comparisons of data sets, often across time, geography or budgets.

6、“N-tiles”是什么东西?
[134] N-tiles

translator: zw1840@hotmail.com