16 Business Intelligence
001 |
This chapter describes some of the basic ideas in business intelligence. |
本章介绍业务智能(business intelligence)领域的基础概念。 |
002 |
This chapter contains the following topics:
|
本章包含以下主题: |
003 |
Introduction to Data Warehousing and Business Intelligence |
16.1 数据仓库及业务智能简介 |
004 |
A data warehouse is a relational database that is designed for query and
analysis rather than for transaction processing. It usually contains
historical data derived from transaction data, but it can include data
from other sources. It separates analysis workload from transaction
workload and enables an organization to consolidate data from several
sources. |
数据仓库是一个专为查询与分析(query and analysis)——而非事物处理(transaction processing)——而设计的关系型数据库。
数据仓库中通常包含来自事务处理系统的历史数据,还可能包含来自其他数据源的数据。数据仓库的作用是将数据分析负载与事务处理负载分离,同时还能将用户
拥有的多个数据源融合为一个整体。 |
005 |
In addition to a relational database, a data warehouse environment
includes an extraction, transportation, transformation, and loading
(ETL) solution, an online analytical processing (OLAP) engine, client
analysis tools, and other applications that manage the process of
gathering data and delivering it to business users. |
数据仓库系统中除了关系型数据库之外,还应包含数据抽取(extraction),转换(transportation),传输(transformation),加载(loading)方案(即
ETL),联机分析处理(online analytical
processing,OLAP)引擎,客户端分析工具,以及其他用于管理数据收集与数据发布过程的应用程序。 |
006 |
Characteristics of Data Warehousing |
16.1.1 数据仓库特性 |
007 | A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: | 在介绍数据仓库时,通常都会引用 William Inmon 所阐述的数据仓库应具备的特性: |
008 |
Subject Oriented |
16.1.1.1 面向主题的 |
009 |
Data warehouses are designed to help you analyze data. For example, to
learn more about your company's sales data, you can build a warehouse
that concentrates on sales. Using this warehouse, you can answer
questions like "Who was our best customer for this item last year?" This
ability to define a data warehouse by subject matter, sales in this
case, makes the data warehouse subject oriented. |
数据仓库的作用是帮助用户分析数据。例如,如果用户需要分析企业的销售数据,就应建立一个以销售数据为中心的数据仓库。利用此数据仓库,分析人员能够回答类似“谁是去年对某类产品销售贡献最大的用户?”等问题。由于数据仓库是依据主题而创建的,所以说数据仓库是面向主题的(Subject Oriented)。 |
010 |
Integrated |
16.1.1.2 统一的 |
011 |
Integration is closely related to subject orientation. Data warehouses
must put data from disparate sources into a consistent format. They must
resolve such problems as naming conflicts and inconsistencies among
units of measure. When they achieve this, they are said to be
integrated. |
数据仓库的统一性(integration)与其面向主题性(subject
orientation)是紧密相关的。数据仓库需要将来自不同数据源的数据整合为统一的形式。在向数据仓库中加载数据时,还要解决命名冲突,度量单位不一致等问题。当上述问题都解决后,数据仓库就可以
被看作是统一的(integrated)。 |
012 |
Nonvolatile |
16.1.1.3 永久的 |
013 |
Nonvolatile means that, once entered into the warehouse, data should not
change. This is logical because the purpose of a warehouse is to enable
you to analyze what has occurred. |
永久性(nonvolatile)是指数据进入数据仓库后就不应被修改或删除。永久性的意义在于,数据仓库能够供用户分析历史上发生了什么。 |
014 |
Time Variant |
16.1.1.4 历史的 |
015 |
In order to discover trends in business, analysts need large amounts of
data. This is very much in contrast to online transaction processing
(OLTP) systems, where performance requirements demand that historical
data be moved to an archive. A data warehouse's focus on change over
time is what is meant by the term time variant. |
为了发现业务中存在的趋势,分析用户需要使用大量数据。这与联机事务处理(online transaction
processing,OLTP)系统截然相反,OLTP
系统对性能要求较高,历史数据必须移动到归档中。由于数据仓库关注数据在时间上的变化,因此称其是历史的(time variant)。 |
016 |
Typically, data flows from one or more online transaction processing
(OLTP) databases into a data warehouse on a monthly, weekly, or daily
basis. The data is normally processed in a staging file before
being added to the data warehouse. Data warehouses commonly range in
size from tens of gigabytes to a few terabytes. Usually, the vast
majority of the data is stored in a few very large fact tables. |
在一个典型的数据仓库系统中,数据每月,每周或每日从一个或多个 OLTP 数据库汇入数据仓库。数据在加载到数据仓库之前经常存储在中转文件(staging
file)内,并进行适当的处理。数据仓库的容量一般在数十 GB 至数 TB 之间。通常,事实表(fact
table)占了数据仓库数据量的绝大部分。 |
017 |
Differences Between Data Warehouse and OLTP Systems |
16.1.2 数据仓库系统与 OLTP 系统的区别 |
018 |
Data warehouses and OLTP systems have very different requirements. Here
are some examples of differences between typical data warehouses and
OLTP systems: |
数据仓库系统与 OLTP 系统的需求差别很大。以下介绍典型的数据仓库系统与 OLTP 系统的区别: |
019 |
Workload |
16.1.2.1 负载 |
020 |
Data warehouses are designed to accommodate ad hoc queries. You might
not know the workload of your data warehouse in advance, so a data
warehouse should be optimized to perform well for a wide variety of
possible query operations. |
设计数据仓库时要满足即席查询(ad hoc query)的需要。开发者无法预先确定数据仓库的负载,因此在设计数据仓库时应该考虑到所有可能的查询操作
,并进行优化。 |
021 |
OLTP systems support only predefined operations. Your applications might
be specifically tuned or designed to support only these operations. |
而 OLTP 系统只支持预定义的操作(predefined operation)。用户的应用系统可以进行有针对性的调优,或只提供特定的操作功能。 |
022 |
Data Modifications |
16.1.2.2 数据变更 |
023 |
A data warehouse is updated on a regular basis by the ETL process (run
nightly or weekly) using bulk data modification techniques. The end
users of a data warehouse do not directly update the data warehouse. |
数据仓库的数据由采用批量数据变更方式(bulk data modification technique)的 ETL 过程周期性地(每天夜间或每周)进行更新。数据仓库的终端用户(end
user)不能直接更新其中的数据。 |
024 |
In OLTP systems, end users routinely issue individual data modification
statements to the database. The OLTP database is always up to date, and
reflects the current state of each business transaction. |
而在 OLTP 系统中,终端用户会经常性地向数据库提交进行数据变更的语句。OLTP 数据库总是在不断地更新,以反映事务的最新状态。 |
025 |
Schema Design |
16.1.2.3 模式设计 |
026 |
Data warehouses often use denormalized or partially denormalized schemas
(such as a star schema) to optimize query performance. |
数据仓库通常采用反范式模式(denormalized schema)或部分反范式模式(partially denormalized
schema)(例如星型模式(star schema))来优化查询性能。 |
027 |
OLTP systems often use fully normalized schemas to optimize
update/insert/delete performance, and to guarantee data consistency. |
而 OLTP 系统通常采用范式模式(normalized schema)以优化更新/插入/删除等操作的性能,并确保数据一致性。 |
028 |
Typical Operations |
16.1.2.4 典型操作 |
029 |
A typical data warehouse query scans thousands or millions of rows. For
example, "Find the total sales for all customers last month." |
一个典型的数据仓库查询所扫描数据行的数量级通常是数千,甚至是数百万。例如,“找出所有客户上个月的总销售量”。 |
030 |
A typical OLTP operation accesses only a handful of records. For
example, "Retrieve the current order for this customer." |
而一个典型的 OLTP 操作只需访问个别的记录。例如,“获得某客户的当前订单”。 |
031 |
Historical Data |
16.1.2.5 历史数据 |
032 |
Data warehouses usually store many months or years of data. This is to
support historical analysis. |
数据仓库内通常会存储数月甚至数年的数据。这些数据被用于历史分析。 |
033 |
OLTP systems usually store data from only a few weeks or months. The
OLTP system stores only historical data as needed to successfully meet
the requirements of the current transaction. |
而 OLTP 系统内通常只存储几周或几个月的数据。OLTP 系统只存储当前事务所需的历史数据。 |
034 |
Data Warehouse Architecture |
16.1.3 数据仓库体系结构 |
035 | Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Three common architectures are: | 数据仓库的体系结构是由用户的实际情况决定的。下面介绍三种常见的体系结构: |
036 |
Data Warehouse Architecture (Basic) |
16.1.3.1 数据仓库体系结构(基本型) |
037 |
Figure 16-1 shows a simple
architecture for a data warehouse. End users directly access data
derived from several source systems through the data warehouse. |
图 16-1
显示了一个最基本的数据仓库的体系结构。数据仓库的数据来自多个数据源,终端用户直接访问数据仓库。 |
038 |
Figure 16-1 Architecture of
a Data Warehouse |
图 16-1 数据仓库体系结构 |
039 | ||
040 |
|
图 16-1 显示了一个包含元数据,原始数据,及汇总数据的数据仓库。数据仓库存储来自多个数据源(操作型数据库及平面文件)及终端用户(分析,报表,及挖掘)的数据。 |
041 |
In Figure 16-1, the metadata and
raw data of a traditional OLTP system is present, as is an additional
type of data, summary data. Summaries are very valuable in data
warehouses because they pre-compute long operations in advance. For
example, a typical data warehouse query is to retrieve something like
August sales. |
在
图 16-1
中,数据仓库内存储了来自传统 OLTP 系统的原始数据(raw data)及元数据(metadata),同时还有汇总数据(summary
data)。汇总数据在数据仓库中十分必要,因为其中预先计算了需要长时间执行才能得到结果。例如,一个典型的数据仓库查询可能会获取八月份的总销售情况。 |
042 |
Summaries in Oracle are called materialized views. |
在 Oracle 中可以使用物化视图(materialized view)存储汇总数据。 |
043 |
Data Warehouse Architecture (with a Staging Area) |
16.1.3.2 数据仓库体系结构(使用中转区) |
044 |
Figure 16-1, you need to clean
and process your operational data before putting it into the warehouse.
You can do this programmatically, although most data warehouses use a
staging area instead. A staging area simplifies building summaries and
general warehouse management. Figure
16-2 illustrates this typical architecture. |
在
图 16-1
所示的系统中,用户在将操作型数据(operational
data)放入数据仓库之前,首先要对数据进行清洗及其他处理。上述过程可以完全在数据抽取程序中完成,但大多数数据仓库系统都会利用中转区(staging
area)来进行此类操作。中转区能够简化计算汇总数据的过程,并使数据仓库更易于管理。图 16-2
显示了一个使用中转区的数据仓库系统。 |
045 |
Figure 16-2 Architecture of
a Data Warehouse with a Staging Area |
图 16-2 使用中转区的数据仓库体系结构 |
046 | ||
047 |
|
图 16-2 显示了一个包含元数据,原始数据,及汇总数据的数据仓库。数据仓库存储来自多个数据源(操作型数据库及平面文件)及终端用户(分析,报表,及挖掘)的数据。 来自数据源的操作型数据首先进入中转区。 |
048 |
Data Warehouse Architecture (with a Staging Area and Data Marts) |
16.1.3.3 数据仓库体系结构(使用中转区及数据集市) |
049 |
Although the architecture in Figure
16-2 is quite common, you might want to customize your warehouse's
architecture for different groups within your organization. |
图 16-2
所示的系统是较常见的数据仓库体系结构。但有时开发者还需要为组织内部的不同用户群定制特殊的数据仓库。 |
050 |
Do this by adding data marts, which are systems designed for a
particular line of business. Figure
16-3 illustrates an example where purchasing, sales, and inventories
are separated. In this example, a financial analyst might want to
analyze historical data for purchases and sales. |
这种定制是通过数据集市(data mart)实现的,数据集市是为不同业务而单独设计的数据仓库系统。在
图 16-3
所示的系统中,采购,销售,库存等系统是相互分离的。在这个例子中,专注于财务的分析用户只需访问采购及销售系统中的历史数据。 |
051 |
Figure 16-3 Architecture of
a Data Warehouse with a Staging Area and Data Marts |
图 16-3 使用中转区及数据集市的数据仓库体系结构 |
052 | ||
053 |
|
图 16-3 显示了一个包含元数据,原始数据,及汇总数据的数据仓库。数据仓库存储来自多个数据源(操作型数据库及平面文件)及终端用户(分析,报表,及挖掘)的数据。 来自数据源的操作型数据首先进入中转区。终端用户使用的数据根据业务进行分割,例如销售及采购。 |
054 |
|
另见: |
055 |
Overview of Extraction, Transformation, and Loading (ETL) |
16.2 抽取,转换,及加载(ETL)概述 |
056 |
You need to load your data warehouse regularly so that it can serve its
purpose of facilitating business analysis. To do this, data from one or
more operational systems needs to be extracted and copied into the
warehouse. The process of extracting data from source systems and
bringing it into the data warehouse is commonly called ETL, which
stands for extraction, transformation, and loading. The acronym ETL is
perhaps too simplistic, because it omits the transportation phase and
implies that each of the other phases of the process is distinct.
We
refer to the entire process, including data loading, as ETL. You should
understand that ETL refers to a broad process, and not three
well-defined steps. |
用户需要定期地向数据仓库中加载数据,才能使其发挥业务分析的作用。数据需要从一个或多个操作型系统(operational
system)中抽取并复制到数据仓库中。从源系统(source system)抽取数据再存储到数据仓库的过程被称为 ETL,即抽取(extraction),转换(transformation),及加载(loading)的首字母缩写。
事实上,只用
ETL 描述此过程过于简化,因为其中没有体现出传输(transportation)子过程,也没有体现出各子过程之间的联系。在本文中将介绍完整的
ETL 过程。用户应意识到 ETL 涉及一系列子过程,而不只是 3 个已经严格定义的步骤。 |
057 |
The methodology and tasks of ETL have been well known for many years,
and are not necessarily unique to data warehouse environments: a wide
variety of proprietary applications and database systems are the IT
backbone of any enterprise. Data has to be shared between applications
or systems, trying to integrate them, giving at least two applications
the same picture of the world. This data sharing was mostly addressed by
mechanisms similar to what we now call ETL. |
ETL 过程采用的方法论及其中包含的主要工作其实早已为开发者所熟知,且 ETL 并非只应用于数据仓库系统。一个企业的 IT
环境通常由多个数据库系统及定制的应用系统构成。这些系统之间应该可以共享数据,或进行相互的集成,不同系统中的数据应该能够反映相同的业务事实。上述的数据共享工作通常是由与
ETL 类似的机制实现的。 |
058 |
Data warehouse environments face the same challenge with the additional
burden that they not only have to exchange but to integrate, rearrange
and consolidate data over many systems, thereby providing a new unified
information base for business intelligence. Additionally, the data
volume in data warehouse environments tends to be very large. |
与数据共享工作相比,数据仓库系统除了需要交换数据,还需集成(integrate),重构(rearrange),融合(consolidate)来自多个源系统的数据,从而为业务智能提供一个统一的信息基础。此外,数据仓库系统中的数据量通常十分巨大。 |
059 |
What happens during the ETL process? During extraction, the desired data
is identified and extracted from many different sources, including
database systems and applications. Very often, it is not possible to
identify the specific subset of interest, therefore more data than
necessary has to be extracted, so the identification of the relevant
data will be done at a later point in time. Depending on the source
system's capabilities (for example, operating system resources), some
transformations may take place during this extraction process. The size
of the extracted data varies from hundreds of kilobytes up to gigabytes,
depending on the source system and the business situation. The same is
true for the time delta between two (logically) identical extractions:
the time span may vary between days/hours and minutes to near real-time.
Web server log files for example can easily become hundreds of megabytes
in a very short period of time. |
ETL
过程究竟包含哪些具体工作呢?在抽取阶段,需要从不同的数据源(数据库系统或应用程序)中定位并抽取分析所需的数据。但是准确地定位感兴趣的数据子集比较困难,因此在抽取
阶段通常会抽取多余的数据,定位有用数据的工作将在之后的步骤进行。有些转换工作也是在抽取阶段完成的,这取决于数据源系统的能力(例如,操作系统是否有足够资源)。抽取的数据量是由源系统及业务需求决定的,或许数字节,也可能高达数
GB。两次(逻辑上)完全相同的抽取之间的间隔同样由源系统及业务需求决定,有可能为数天,数小时,数分钟,甚至接近实时。例如,Web
服务器的日志数据在短时间内就可能积累数百 MB。 |
060 |
After extracting data, it has to be physically transported to the target
system or an intermediate system for further processing. Depending on
the chosen way of transportation, some transformations can be done
during this process, too. For example, a SQL statement which directly
accesses a remote target through a gateway can concatenate two columns
as part of the SELECT statement. |
数据被抽取后,需要传输(physically transport)到目标系统(target
system)或中间系统(intermediate
system)中进行进一步处理。在传输的过程中也可以进行某些转换工作,这取决于用户选择得传输方式。例如,使用 SQL
语句通过网关(gateway)直接访问远程目标系统时,可以在 SELECT
语句中将两列数据合并。 |
061 |
If any errors occur during loading, an error is logged and the operation
can continue. |
如果在加载中发生错误,应该将问题记录到日志中,并使抽取过程继续执行。 |
062 |
Transportable Tablespaces |
16.2.1 可移动表空间 |
063 |
Transportable tablespaces are the fastest way for moving large volumes
of data between two Oracle databases. You can transport tablespaces
between different computer architectures and operating systems. |
可移动表空间(transportable tablespace)是在两个 Oracle
数据库间移动大量数据时最迅速的方式。用户可以在操作系统或计算机体系结构不同的系统间移动表空间。 |
064 |
Previously, the most scalable data transportation mechanisms relied on
moving flat files containing raw data. These mechanisms required that
data be unloaded or exported into files from the source database. Then,
after transportation, these files were loaded or imported into the
target database. Transportable tablespaces entirely bypass the unload
and reload steps. |
在可移动表空间技术出现之前,将原始数据(raw data)存储在平面文件(flat
file)中再进行移动是最具伸缩性(scalable)的数据传输机制。这种方法需要首先将数据从源数据库中导出(unload or
export)到平面文件。经过传输(transportation)后,再将文件中的数据导入(load or
import)目标数据库。而使用可移动表空间可以完全跳过导入及导出的步骤。 |
065 |
Using transportable tablespaces, Oracle data files (containing table
data, indexes, and almost every other Oracle database object) can be
directly transported from one database to another. Furthermore, like
import and export, transportable tablespaces provide a mechanism for
transporting metadata in addition to transporting data. |
使用可移动表空间功能时,Oracle 数据文件(包含表数据,索引,及几乎所有类型的 Oracle
数据库对象)可以直接在数据库间传输。此外,可移动表空间功能除了能够进行数据传输,还能进行元数据传输,这同 import 及 export
的元数据传输功能类似。 |
066 |
The most common applications of transportable tablespaces in data
warehouses are in moving data from a staging database to a data
warehouse, or in moving data from a data warehouse to a data mart. |
在数据仓库系统中,可移动表空间功能最常见的应用是将数据从中转数据库(staging
database)移动到数据仓库,或将数据从数据仓库移动到数据集市。 |
067 |
Table Functions |
16.2.2 表函数 |
068 |
Table functions provide the support for pipelined and parallel execution
of transformations implemented in PL/SQL, C, or Java. Scenarios as
mentioned earlier can be done without requiring the use of intermediate
staging tables, which interrupt the data flow through various
transformations steps. |
用户可以利用表函数(table
function)管道化(pipelined)及并行化(parallel)地执行数据转换(transformation),表函数可以由
PL/SQL,C,或 Java 实现。本章前面讲述的某些抽取案例中使用了起中间作用的中转表(staging
table),但这会使数据在各个转换步骤间出现停顿,而采用表函数则能够避免这种停顿。 |
069 |
A table function is defined as a function that can produce a set of rows
as output. Additionally, table functions can take a set of rows as
input. Table functions extend database functionality by allowing:
|
表函数指能够返回一个结果集的函数。此外,表函数也能够把结果集作为输入参数。表函数在以下几个方面扩展了数据库的能力:
|
070 |
Table functions can be defined in PL/SQL using a native PL/SQL
interface, or in Java or C using the Oracle Data Cartridge Interface (ODCI). |
表函数可以使用本地 PL/SQL 接口(native PL/SQL interface)在 PL/SQL 中定义,也可以使用 Oracle
Data Cartridge Interface(ODCI)在 Java 或 C 中定义。 |
071 |
External Tables |
16.2.3 外部表 |
072 |
External tables let you use external data as a virtual table that can be
queried and joined directly and in parallel without requiring the
external data to be first loaded in the database. You can then use SQL,
PL/SQL, and Java to access the external data. |
外部表(external
table)功能使用户可以将外部数据虚拟为一个表,并可直接对此虚拟表进行并行地查询与联接操作,而无需将外部数据导入数据库。建立了外部表后,用户可以使用 SQL,PL/SQL,及 Java 访问外部数据。 |
073 |
External tables enable the pipelining of the loading phase with the
transformation phase. The transformation process can be merged with the
loading process without any interruption of the data streaming. It is no
longer necessary to stage the data inside the database for further
processing inside the database, such as comparison or transformation.
For example, the conversion functionality of a conventional load can be
used for a direct-path
INSERT AS SELECT
statement in conjunction with the
SELECT
from an external table. Figure 16-4
illustrates a typical example of pipelining. |
采用外部表后,加载及转换两个阶段即可实现管道化。转换过程及加载过程可以相互融合,其间的数据流无需中断。用户无需首先将数据暂存在数据库内就可直接对其进行处理(例如比较或转换)。例如,用户可以使用
direct-path 的 INSERT AS SELECT
语句,使用 SELECT
语句访问外部表,并在加载的同时实现转换。图 16-4
显示了一个管道化(pipelining)的典型例子。 |
074 |
Figure 16-4 Pipelined Data
Transformation |
图 16-4 管道化的数据转换 |
075 | ||
076 |
|
图 16-4 显示了四个维表(产品,客户,时间,及渠道)与名为销售的事实表关联。 |
077 |
The main difference between external tables and regular tables is that
externally organized tables are read-only. No DML operations (UPDATE/INSERT/DELETE)
are possible and no indexes can be created on them. |
外部表与常规表的区别在于外部表是只读的。在外部表上不能执行 DML 操作(UPDATE/INSERT/DELETE),也不能在其上创建索引。 |
078 |
External tables are a complement to SQL*Loader and are especially useful
for environments where the complete external source has to be joined
with existing database objects and transformed in a complex manner, or
where the external data volume is large and used only once. SQL*Loader,
on the other hand, might still be the better choice for loading of data
where additional indexing of the staging table is necessary. This is
true for operations where the data is used in independent complex
transformations or the data is only partially used in further
processing. |
外部表可以作为 SQL*Loader
的补充,当存在以下情况,例如全部外部数据需要与已有数据库对象结合并进行复杂地转换,或外部数据的数据量很大但只需被使用一次,使用外部表就能很好地解决问题。如果必须在中转表上创建索引,那么还是应使用
SQL*Loader 加载数据。如果外部数据自身需要进行复杂地转换,或外部数据中的某些部分数据需要进行进一步处理,也应先将数据加载到数据库中。 |
079 |
Table Compression |
16.2.4 表压缩 |
080 |
You can save disk space by compressing heap-organized tables. A typical
type of heap-organized table you should consider for table compression
is partitioned tables. |
用户可以通过压缩堆表(heap-organized table)来节省磁盘空间。分区表(partitioned
table)就是一种典型的可以考虑采用表压缩(table compression)的堆表。 |
081 |
To reduce disk use and memory use (specifically, the buffer cache), you
can store tables and partitioned tables in a compressed format inside
the database. This often leads to a better scaleup for read-only
operations. Table compression can also speed up query execution. There
is, however, a slight cost in CPU overhead. |
为了减少磁盘使用及内存使用(尤其是数据库缓存(buffer cache)),用户可以将数据库内的表及分区表以压缩形式存储。这有助于提高只读操作的性能。表压缩能够提高查询执行的速度,但会略微增加
CPU 开销。 |
082 |
Table compression should be used with highly redundant data, such as
tables with many foreign keys. You should avoid compressing tables with
much update or other DML activity. Although compressed tables or
partitions are updatable, there is some overhead in updating these
tables, and high update activity may work against compression by causing
some space to be wasted. |
当数据表内存在大量冗余数据时适合使用表压缩,例如表内存储了大量外键时。用户应该避免对可能存在大量 DML
操作的表进行压缩。尽管被压缩的表或分区是可更新的,但是更新时存在额外开销,对更新操作较多的表进行压缩可能反而会导致空间浪费。 |
083 |
|
另见: |
084 |
Change Data Capture |
16.2.5 变化数据捕捉 |
085 |
Change Data Capture efficiently identifies and captures data that has
been added to, updated, or removed from Oracle relational tables, and
makes the change data available for use by applications. |
变化数据捕捉(Change Data Capture)功能能够识别并捕捉到 Oracle
关系数据表内新增,更新,及被删除的数据,并使这些数据可以被其他应用所使用。 |
086 |
Oftentimes, data warehousing involves the extraction and transportation
of relational data from one or more source databases into the data
warehouse for analysis. Change Data Capture quickly identifies and
processes only the data that has changed, not entire tables, and makes
the change data available for further use. |
通常,数据仓库需要从一个或多个源数据库中抽取数据,再传输到数据仓库中供分析使用。采用变化数据捕捉功能后,用户能够迅速地识别并处理修改过(而非整个表)的数据以供分析
使用。 |
087 |
Change Data Capture does not depend on intermediate flat files to stage
the data outside of the relational database. It captures the change data
resulting from INSERT,
UPDATE, and DELETE
operations made to user tables. The change data is then stored in a
database object called a change table, and the change data is made
available to applications in a controlled way. |
变化数据捕捉功能无需依赖关系数据库外部的中间文件来中转数据。此功能根据对数据表的 INSERT,UPDATE,及
DELETE 操作来捕捉变化的数据。变化数据将存储在被称为变化表(change
table)的数据库对象,这些数据能够根据用户的控制而提供给有需要的应用程序。 |
088 |
|
另见: |
089 |
Overview of Materialized Views for Data Warehouses |
16.3 在数据仓库中使用物化视图 |
090 |
One technique employed in data warehouses to improve performance is the
creation of summaries. Summaries are special kinds of aggregate views
that improve query execution times by precalculating expensive joins and
aggregation operations prior to execution and storing the results in a
table in the database. For example, you can create a table to contain
the sums of sales by region and by product. |
在数据仓库系统中,可以通过创建汇总数据(summary)来提升其性能。汇总数据是一种聚合视图(aggregate
view),其中可以保存预先计算的复杂关联,聚合,或排序,从而提高相关查询的执行性能。例如,用户可以创建一个表,保存按区域及产品汇总的销售数据。 |
091 |
The summaries or aggregates that are referred to in this book and in
literature on data warehousing are created in Oracle using a schema
object called a materialized view. Materialized views can perform
a number of roles, such as improving query performance or providing
replicated data. |
在本章内或 Oracle 数据仓库系统中,汇总数据或聚合数据(aggregate)通常是通过物化视图(materialized view)
这种模式对象(schema
object)创建的。物化视图的应用有多种,例如提高查询性能或进行数据复制。 |
092 |
Previously, organizations using summaries spent a significant amount of
time and effort creating summaries manually, identifying which summaries
to create, indexing the summaries, updating them, and advising their
users on which ones to use. Summary management eased the workload of the
database administrator and meant that the user no longer needed to be
aware of the summaries that had been defined. The database administrator
creates one or more materialized views, which are the equivalent of a
summary. The end user queries the tables and views at the detail data
level. |
在物化视图出现之前,开发者如果需要使用汇总数据需要花费大量时间进行手工管理,例如需要确定创建哪些汇总数据,对汇总数据进行索引,更新汇总数据,并将汇总数据的信息告知有需求的用户。而(采用物化视图的)汇总管理(summary
management)既减轻了数据库管理员的工作,同时使用户无需关心与汇总数据有关的细节。管理员只需要创建起到汇总数据作用的物化视图。而用户则直接在明细级查询数据表或视图。 |
093 |
The query rewrite mechanism in the Oracle database server automatically
rewrites the SQL query to use the summary tables. This mechanism reduces
response time for returning results from the query. Materialized views
within the data warehouse are transparent to the end user or to the
database application. |
Oracle 数据库的查询重写(query rewrite)机制能够自动地重写查询 SQL
以使用汇总表。此机制减少了查询返回结果的响应时间。数据仓库内的物化视图对终端用户及数据库应用程序是透明的。 |
094 |
Although materialized views are usually accessed through the query
rewrite mechanism, an end user or database application can construct
queries that directly access the summaries. However, serious
consideration should be given to whether users should be allowed to do
this because any change to the summaries will affect the queries that
reference them. |
物化视图通常是通过查询重写机制访问的,但终端用户及数据库应用程序也可以直接对其访问。但是管理员应该仔细限定是否让终端用户直接访问物化视图,因为对物化视图的任何修改都将影响引用她们的所有查询。 |
095 |
To help you select from among the many possible materialized views in
your schema, Oracle provides a collection of materialized view analysis
and advisor functions and procedures in the
DBMS_ADVISOR package. Collectively, these functions are called
the SQL Access Advisor, and they are callable from any PL/SQL program.
The SQL Access Advisor recommends materialized views from a hypothetical
or user-defined workload or one obtained from the SQL cache. You can run
the SQL Access Advisor from Oracle Enterprise Manager or by invoking the
DBMS_ADVISOR package. |
为了帮助用户选择可用的物化视图,Oracle 在
DBMS_ADVISOR 包中提供了一系列用于分析及建议物化视图的函数与过程。上述函数与过程被统称为 SQL Access
Advisor(SQL 访问建议器),可以在 PL/SQL 程序中调用。SQL Access Advisor
能够根据假设的负载,用户定义的负载,或从 SQL 缓存(SQL cache)中获得的负载来建议如何使用物化视图。用户可以在 Oracle
Enterprise Manager 里运行 SQL Access Advisor,也可以通过 DBMS_ADVISOR
包进行调用。 |
096 |
See Also: |
另见: |
097 |
Overview of Bitmap Indexes in Data Warehousing |
16.4 在数据仓库中使用位图索引 |
098 |
Bitmap indexes are widely used in data warehousing environments. The
environments typically have large amounts of data and ad hoc queries,
but a low level of concurrent DML transactions. For such applications,
bitmap indexing provides:
|
数据仓库应用(data warehousing application)的特点是数据量巨大,执行的多为即席查询(ad hoc
query),且并发 DML 事务较少。这种环境下使用位图索引(bitmap index)具备如下优势:
|
099 |
Fully indexing a large table with a traditional B-tree index can be
prohibitively expensive in terms of space because the indexes can be
several times larger than the data in the table. Bitmap indexes are
typically only a fraction of the size of the indexed data in the table. |
为一个大表建立传统的 B-tree index(平衡树索引)可能占用极大的存储空间,索引有可能比数据表还要大数倍。而一个位图索引(bitmap
index)所占的空间比被索引数据还要小得多。 |
100 |
An index provides pointers to the rows in a table that contain a given
key value. A regular index stores a list of rowids for each key
corresponding to the rows with that key value. In a bitmap index, a
bitmap for each key value replaces a list of rowids. |
索引的目标是为用户提供指向包含特定键值(key value)的数据行的指针。在常规的索引中,Oracle 将各行的键值及与此键值对应的一组
ROWID 存储在一起,从而实现了上述目标。而在位图索引(bitmap index)中,只需为每个键值存储一个位图(bitmap),而非一组
ROWID。 |
101 |
Each bit in the bitmap corresponds to a possible rowid, and if the bit
is set, it means that the row with the corresponding rowid contains the
key value. A mapping function converts the bit position to an actual
rowid, so that the bitmap index provides the same functionality as a
regular index. If the number of different key values is small, bitmap
indexes save space. |
位图(bitmap)中的每一位(bit)对应一个可能的 ROWID。如果某一位被置位(set),则表明着与此位对应的 ROWID
所指向的行中包含此此位图对应的键值(key value)。Oracle 通过一个映射函数(mapping function)将位信息转化为实际的
ROWID,因此位图索引同样能实现常规索引的功能。当不同值的索引键的数量较少时,位图索引的存储效率相当高。 |
102 |
Bitmap indexes are most effective for queries that contain multiple
conditions in the WHERE clause. Rows that
satisfy some, but not all, conditions are filtered out before the table
itself is accessed. This improves response time, often dramatically. A
good candidate for a bitmap index would be a gender column due to the
low number of possible values. |
如果一个查询的 WHERE
子句中包含多个条件,在查询中利用位图索引能够极大地提高效率。不满足全部条件的数据行可以在访问表数据之前被预先过滤掉。这通常会显著地缩短响应时间。
在类似于性别的数据列上适合创建位图索引,因为这样的列中可能的键值的数量较少。 |
103 |
Parallel query and parallel DML work with bitmap indexes as they do with
traditional indexes. Bitmap indexing also supports parallel create
indexes and concatenated indexes. |
位图索引(bitmap index)如同常规索引一样,可以结合并行查询(parallel query)和并行 DML(parallel
DML)一起工作。Oracle 还支持并行地创建位图索引,以及创建复合位图索引。 |
104 |
See Also: |
另见: |
105 |
Overview of Parallel Execution |
16.5 并行执行概述 |
106 |
When Oracle runs SQL statements in parallel, multiple processes work
together simultaneously to run a single SQL statement. By dividing the
work necessary to run a statement among multiple processes, Oracle can
run the statement more quickly than if only a single process ran it.
This is called parallel execution or parallel processing. |
当 Oracle 并行地(parallel)执行 SQL 语句时,存在多个进程同时工作共同执行同一个 SQL
语句。将一个语句的工作负载分配给多个进程,与单进程相比,同样的语句能够被更快地执行。这被称为并行执行(parallel execution)或并行处理(parallel processing)。 |
107 |
Parallel execution dramatically reduces response time for data-intensive
operations on large databases typically associated with decision support
systems (DSS) and data warehouses. Symmetric multiprocessing (SMP),
clustered systems, and large-scale cluster systems gain the largest
performance benefits from parallel execution because statement
processing can be split up among many CPUs on a single Oracle system.
You can also implement parallel execution on certain types of online
transaction processing (OLTP) and hybrid systems. |
对于大型数据库中的大数据量操作,并行执行能够显著地缩短响应时间,这个特点在决策支持系统(decision support
systems,DDS)及数据仓库系统中表现的尤为突出。对称多处理系统(Symmetric multiprocessing,SMP),集群系统(clustered
system),及大型集群系统(large-scale clustered
system)能够从并行执行中获得最大的性能提升,因为在这样的系统平台上,一个 Oracle 系统的语句处理工作可以被分配到多个 CPU
上。用户也可以在某些联机事务处理系统(online transaction processing,OLTP)或混合系统(hybrid
system)上采用并行执行。 |
108 |
Parallelism is the idea of breaking down a task so that, instead
of one process doing all of the work in a query, many processes do part
of the work at the same time. An example of this is when 12 processes
handle 12 different months in a year instead of one process handling all
12 months by itself. The improvement in performance can be quite high. |
并行(parallelism)指将一个工作分解,多个进程同时执行查询工作的一部分,而非一个进程完成全部工作。例如,12
个进程分别处理 1 年 12 个月的数据,而非 1 个进程独自处理 12 个月的数据。并行带来的性能提升十分显著。 |
109 |
Parallel execution helps systems scale in performance by making optimal
use of hardware resources. If your system's CPUs and disk controllers
are already heavily loaded, you need to alleviate the system's load or
increase these hardware resources before using parallel execution to
improve performance. |
并行执行能够优化硬件资源的使用,从而使系统性能具备可伸缩性。如果系统的 CPU 或磁盘控制器(disk
controller)上的负载已经很重,用户必须首先减轻负载或增减硬件资源,才能使用并行执行来提升系统性能。 |
110 |
Some tasks are not well-suited for parallel execution. For example, many
OLTP operations are relatively fast, completing in mere seconds or
fractions of seconds, and the overhead of utilizing parallel execution
would be large, relative to the overall execution time. |
有些工作并不适合并行执行。例如,有些 OLTP
操作执行速度较快,能购在数秒甚至不到一秒的时间内完成,此时采用并行执行的额外开销与总执行时间相比就比较大。 |
111 |
See Also: |
另见: |
112 |
How Parallel Execution Works |
16.5.1 并行执行是如何工作的 |
113 |
When parallel execution is not used, a single server process performs
all necessary processing for the sequential execution of a SQL
statement. For example, to perform a full table scan (such as
SELECT * FROM emp), one process performs
the entire operation, as illustrated in
Figure 16-5. |
在未使用并行执行(parallel execution)时,只有一个服务进程(server process)完成了一个 SQL
语句的顺序执行所需的全部处理工作。例如,进行全表扫描(SQL 语句可能为
SELECT * FROM emp)时,一个进程完成了所有的工作,如
图 16-5 所示。 |
114 |
Figure 16-5 Serial Full
Table Scan |
图 16-5 串行全表扫描 |
115 | ||
116 |
Figure 16-6 illustrates several
parallel execution servers performing a scan of the table
emp. The table is divided dynamically (dynamic
partitioning) into load units called granules and each granule is
read by a single parallel execution server. The granules are generated
by the coordinator. Each granule is a range of physical blocks of the
table emp. The mapping of granules to
execution servers is not static, but is determined at execution time.
When an execution server finishes reading the rows of the table
emp corresponding to a granule, it gets
another granule from the coordinator if there are any granules
remaining. This continues until all granules are exhausted, in other
words, until the entire table emp has been
read. The parallel execution servers send results back to the parallel
execution coordinator, which assembles the pieces into the desired full
table scan. |
图 16-6 显示了多个并行执行服务器(parallel
execution server)对表 emp 进行扫描。表被动态地划分(即动态分区(dynamic
partitioning))为称作颗粒(granule)的负载单位(load
unit),每个颗粒由一个并行执行服务器负责读取。颗粒是由并行执行协调器(coordinator)划分的。每个颗粒都是
emp
表的一组物理数据块。颗粒与并行执行服务器之间的对应关系并不是静态的,而是在执行时决定的。当一个并行执行服务器完成了对
emp
表某一颗粒的读取后,能够从并行执行协调器获得剩余的颗粒继续工作。这个过程将持续直至所有颗粒全部完成,即整个
emp
表被读取完毕。每个并行执行服务器都将结果返回给并行执行协调器,后者将所有结果汇总为全表扫描的结果。 |
117 |
Figure 16-6 Parallel Full
Table Scan |
图 16-6 并行全表扫描 |
118 | ||
119 |
Given a query plan for a SQL query, the parallel execution coordinator
breaks down each operator in a SQL query into parallel pieces, runs them
in the right order as specified in the query, and then integrates the
partial results produced by the parallel execution servers executing the
operators. The number of parallel execution servers assigned to a single
operation is the degree of parallelism (DOP) for an operation. Multiple
operations within the same SQL statement all have the same degree of
parallelism. |
通过观察并行执行的 SQL 查询的执行计划(query plan),可以发现并行查询协调器将 SQL
查询的每个操作都进行了分割,并按查询中指定的顺序执行,最后将所有并行执行服务器的执行结果合并。为一个操作所分配的并行执行服务器的数量被称为此操作的并行度(degree
of parallelism,DOP)。同一 SQL 语句内的所有操作的并行度均相同。 |
120 |
See Also: |
另见: |
121 |
Overview of Analytic SQL |
16.6 分析型 SQL 概述 |
122 |
Oracle has introduced many SQL operations for performing analytic
operations in the database. These operations include ranking, moving
averages, cumulative sums, ratio-to-reports, and period-over-period
comparisons. Although some of these calculations were previously
possible using SQL, this syntax offers much better performance. |
Oracle 提供了多种 SQL 功能,用于在数据库中进行分析操作。这些功能包括排名(ranking),移动平均值(moving
average),累积求和(cumulative
sum),百分比(ratio-to-report),以及同期比较(period-over-period
comparison)等。尽管有些计算以前的 SQL 也能实现,但新语法能够实现更好的性能。 |
123 | This section discusses: | 本节将讨论以下内容: |
124 |
SQL for Aggregation |
16.6.1 用于聚合运算的 SQL |
125 |
Aggregation is a fundamental part of data warehousing. To improve
aggregation performance in your warehouse, Oracle provides extensions to
the GROUP BY clause to make querying and
reporting easier and faster. Some of these extensions enable you to:
|
聚合运算(aggregation)是数据仓库系统的基本功能。为了提升数据仓库中聚合运算的性能,Oracle 对 GROUP BY
字句进行了扩展,使查询及报表生成更简单更迅速。这些扩展使用户可以进行以下操作:
|
126 |
These extension let you specify exactly the groupings of interest in the
GROUP BY clause. This allows efficient
analysis across multiple dimensions without performing a
CUBE operation.
Computing a full cube creates a heavy processing load, so replacing
cubes with grouping sets can significantly increase performance.
CUBE, ROLLUP,
and grouping sets produce a single result set that is equivalent to a
UNION ALL of differently grouped rows. |
这些扩展使用户可以利用
GROUP BY 子句获得其感兴趣的汇总结果。用户无需使用
CUBE 操作就能够高效地进行多维分析(analysis across
multiple
dimensions)。完全计算一个立方体(cube)将带来很重的系统负载,因此以聚合运算的结果集来代替立方体能够显著地提升系统性能。使用
CUBE,ROLLUP,及聚合运算都能得到一个单一的结果集,其效果与使用
UNION ALL 语句将多个聚合运算得到的数据行进行合并后相同。 |
127 |
To enhance performance, these extensions can be parallelized: multiple
processes can simultaneously run all of these statements. These
capabilities make aggregate calculations more efficient, thereby
enhancing database performance, and scalability. |
为了进一步提升系统性能,上述扩展操作可以被并行执行:即多个进程同时执行语句。这使聚合运算更为高效,也提高了数据库的性能及可伸缩性(scalability)。 |
128 |
One of the key concepts in decision support systems is multidimensional
analysis: examining the enterprise from all necessary combinations of
dimensions. We use the term
dimension to mean any category used
in specifying questions. Among the most commonly specified dimensions
are time, geography, product, department, and distribution channel, but
the potential dimensions are as endless as the varieties of enterprise
activity. The events or entities associated with a particular set of
dimension values are usually referred to as facts. The facts
might be sales in units or local currency, profits, customer counts,
production volumes, or anything else worth tracking. |
多维分析(multidimensional analysis)是决策支持系统(decision support
system)中的一个关键概念:其含义是以各种可能的维度(dimension)的组合来观察一个企业。此处使用维度这个术语来代指分析中所使用的不同角度。常用的维度有时间(time),地域(geography),产品(product),部门(department),及渠道(distribution
channel)等,但用户可以根据企业自身行为特点总结出各种维度。与一组维度值(imension
value)相关的事件(event)或实体(entity)通常被称为事实(fact)。常见的事实有销售数量及金额,利润,客户数量,产量等,任何值得观察的信息都可以作为事实。 |
129 |
Here are some examples of multidimensional requests:
|
以下是一些多维查询的例子:
|
130 |
All these requests involve multiple dimensions. Many multidimensional
questions require aggregated data and comparisons of data sets, often
across time, geography or budgets. |
上述查询都涉及多个维度。大多数多维查询都会按时间,地域等维度汇总数据,并对汇总结果进行比较。 |
131 |
See Also: |
另见: |
132 |
SQL for Analysis |
16.6.2 用于分析运算的 SQL |
133 |
Oracle has advanced SQL analytical processing capabilities using a
family of analytic SQL functions. These analytic functions enable you to
calculate:
|
Oracle 内含有一套分析型 SQL 函数(analytic SQL function),提供了高级 SQL
分析处理能力。用户可以利用这套函数进行以下计算:
|
134 |
Ranking functions include cumulative distributions, percent rank, and
N-tiles. Moving window calculations allow you to find moving and
cumulative aggregations, such as sums and averages. Lag/lead analysis
enables direct inter-row references so you can calculate
period-to-period changes. First/last analysis enables you to find the
first or last value in an ordered group. |
排名函数包括累积分布(cumulative distribution),百分比排名(percent rank),以及 N-tile
函数。移动窗口计算使用户可以进行移动及累积的汇总或平均计算。滞后/提前分析能够在数据行之间建立直接的关联(direct inter-row
reference),用户可以使用此功能进行同期分析。首位/末位分析的典型应用是帮助用户找出所有订单中最具价值及最无价值的分组。 |
135 |
Other features include the CASE expression.
CASE expressions provide if-then logic
useful in many situations. |
Oracle 还支持 CASE 表达式。CASE
表达式能够处理 if-then 逻辑。 |
136 |
To enhance performance, analytic functions can be parallelized: multiple
processes can simultaneously run all of these statements. These
capabilities make calculations easier and more efficient, thereby
enhancing database performance, scalability, and simplicity. |
为了进一步提升系统性能,分析函数可以被并行执行:即多个进程同时执行语句。这使运算更为高效,也提高了数据库的性能,可伸缩性(scalability),及易用性。 |
137 |
See Also: |
另见: |
138 |
SQL for Modeling |
16.6.3 用于建模的 SQL |
139 |
Oracle's MODEL clause brings a new
level of power and flexibility to SQL calculations. With the
MODEL clause, you can create a
multidimensional array from query results and then apply formulas to
this array to calculate new values. The formulas can range from
basic arithmetic to simultaneous equations using recursion. For some
applications, the MODEL clause can
replace PC-based spreadsheets. Models in SQL leverage Oracle's
strengths in scalability, manageability, collaboration, and
security. The core query engine can work with unlimited quantities
of data. By defining and executing models within the database, users
avoid transferring large datasets to and from separate modeling
environments. Models can be shared easily across workgroups,
ensuring that calculations are consistent for all applications. Just
as models can be shared, access can also be controlled precisely
with Oracle's security features. With its rich functionality, the
MODEL clause can enhance all types of
applications. |
Oracle 的 MODEL 子句能使 SQL 的计算能力更强大更灵活。利用
MODEL 子句,用户可以在结果集的基础上创建一个多维数组(multidimensional
array),并可以使用此数据进行新的公式计算(formula)。公式可以为基本的数学运算,也可以是包含递归运算的联立方程(simultaneous
equations using recursion)。对于某些应用来说,MODEL
子句能够代替在 PC 上运行的电子表格软件。Oracle 的 MODEL
子句增强了其可伸缩性,可管理性,协作能力及安全性。核心查询引擎(core query
engine)对其处理的数据量没有限制。当数据库具备了定义和执行模型的能力后,用户不必在独立的建模系统与数据库间移动大量数据。模型可以在工作组(workgroup)间共享,这使所有应用都能获得具备一致性的数据。同样,对模型的访问也可以依靠
Oracle 的安全特性进行细致的控制。MODEL
子句丰富的功能有助于增强所有类型的应用系统。 |
140 |
See Also: |
另见: |
141 |
Overview of OLAP Capabilities |
16.7 OLAP 功能概述 |
142 |
Oracle OLAP provides the query performance and calculation capability
previously found only in multidimensional databases to Oracle's
relational platform. In addition, it provides a Java OLAP API that is
appropriate for the development of internet-ready analytical
applications. Unlike other combinations of OLAP and RDBMS technology,
Oracle OLAP is not a multidimensional database using bridges to move
data from the relational data store to a multidimensional data store.
Instead, it is truly an OLAP-enabled relational database. As a result,
Oracle provides the benefits of a multidimensional database along with
the scalability, accessibility, security, manageability, and high
availability of the Oracle database. The Java OLAP API, which is
specifically designed for internet-based analytical applications, offers
productive data access. |
Oracle OLAP 增强了 Oracle 关系型数据库的查询性能及计算能力,而这些增强以前只能通过多维数据库实现。此外,Oracle
OLAP 中提供了一套 Java OLAP API,供用户开发适合 internet 环境的分析应用程序。与其他将 OLAP 与 RDBMS
整合的技术不同,Oracle OLAP 多维数据库不需要将数据从关系型数据存储转移到多维数据存储。Oracle OLAP 是一个实现了 OLAP
功能的关系型数据库。因此,Oracle 在提供了多维数据库功能的同时,还具备 Oracle
数据库的可伸缩性,可访问性(accessibility),安全性,可管理性,及高可用性。专为开发适合 internet 环境的分析应用程序的
Java OLAP API 提供了高效的数据访问能力。 |
143 |
See Also: |
另见: |
144 |
Benefits of OLAP and RDBMS Integration |
16.7.1 OLAP 与 RDBMS 集成的优势 |
145 | Basing an OLAP system directly on the Oracle database server offers the following benefits: | 直接在 Oracle 数据库上构建 OLAP 系统具备以下优势: |
146 |
Scalability |
16.7.1.1 可伸缩性 |
147 |
There is tremendous growth along three dimensions of analytic
applications: number of users, size of data, and complexity of analyses.
There are more users of analytical applications, and they need access to
more data to perform more sophisticated analysis and target marketing.
For example, a telephone company might want a customer dimension to
include detail such as all telephone numbers as part of an application
that is used to analyze customer turnover. This would require support
for multi-million row dimension tables and very large volumes of fact
data. Oracle can handle very large data sets using parallel execution
and partitioning, as well as offering support for advanced hardware and
clustering. |
对一个分析应用系统(analytic
application)来说应该考虑三个方面的快速增长:用户数,数据量,及分析的复杂性。分析应用系统的用户会逐渐增长,用户将需要更多的数据以便进行更复杂的分析,或进行目标营销(target
marketing)。例如,电信企业可能需要在客户维度中包含所有电话号码的明细数据,以便分析每个客户的通话量。此时,维表的数据量可能高达数百万行,事实表的数据量也会很大。Oracle
数据库具有并行执行(parallel
execution)及分区(partitioning)能力,还能支持先进的硬件及集群,从而实现对大数据量的处理。 |
148 |
Availability |
16.7.1.2 可用性 |
149 |
Partitioning allows management of precise subsets of tables and indexes,
so that management operations affect only small pieces of these data
structures. By partitioning tables and indexes, data management
processing time is reduced, thus minimizing the time data is
unavailable. Transportable tablespaces also support high availability.
With transportable tablespaces, large data sets, including tables and
indexes, can be added with almost no processing to other databases. This
enables extremely rapid data loading and updates. |
采用分区技术(partitioning)后,用户可以对表及索引进行分段管理,即管理操作只会影响对象的一部分内容。对表及索引进行分区后,数据管理的时间被缩短,从而使数据的可用时间最大化。可移动表空间(transportable
tablespace)技术也有助于提高可用性。采用可移动表空间技术,大量数据(包括表及索引)可以在 Oracle
数据库间移动而无需任何处理。这加速了数据加载以及更新操作。 |
150 |
Manageability |
16.7.1.3 可管理性 |
151 |
Oracle lets you precisely control resource utilization. The Database
Resource Manager, for example, provides a mechanism for allocating the
resources of a data warehouse among different sets of end-users. |
用户可以详细地控制 Oracle 中的资源使用。例如,Database Resource
Manager(数据库资源管理器)能够为不同终端用户组分配数据仓库资源。 |
152 |
Another resource management facility is the progress monitor, which
gives end users and administrators the status of long-running
operations. Oracle maintains statistics describing the percent-complete
of these operations. Oracle Enterprise Manager lets you view a bar-graph
display of these operations showing what percent complete they are.
Moreover, any other tool or any database administrator can also retrieve
progress information directly from the Oracle data server using system
views. |
属于资源管理的功能还有进展监控(progress monitor),终端用户及管理员可以查询长时间执行操作(long-running
operation)的状态。Oracle 能够自动维护描述操作完成百分比的统计信息。用户在 Oracle Enterprise Manager(Oracle
企业管理器)中可以以柱状图的形式查询这些操作的完成百分比。此外,其他工具或数据库管理员也可以通过系统视图直接从数据库服务器中获取相关信息。 |
153 |
Backup and Recovery |
16.7.1.4 备份与恢复 |
154 |
Oracle provides a server-managed infrastructure for backup, restore, and
recovery tasks that enables simpler, safer operations at terabyte scale.
Some of the highlights are:
|
Oracle 中提供了用于备份,复原及恢复的服务器端基础架构(server-managed
infrastructure),使这些操作在大数据量环境下同样简单安全。这个基础架构的主要特征有:
|
155 |
See Also: |
另见: |
156 |
Security |
16.7.1.5 安全性 |
157 |
The security features in Oracle have reached the highest levels of U.S.
government certification for database trustworthiness. Oracle's fine
grained access control enables cell-level security for OLAP users. Fine
grained access control works with minimal burden on query processing,
and it enables efficient centralized security management. |
Oracle 的安全特性已经达到美国政府数据库可信度认证的最高等级。Oracle 细粒度访问控制功能可以为 OLAP
用户实现块级(cell-level)的安全控制。细粒度访问控制对查询操作的性能影响很小,且能够实现高效的集中化安全管理。 |
158 |
Overview of Data Mining |
16.8 数据挖掘概述 |
159 |
Oracle Data Mining (ODM) embeds data mining within the Oracle Database.
The data never leaves the database — the data, data preparation, model
building, and model scoring results all remain in the database. This
enables Oracle to provide an infrastructure for application developers
to integrate data mining seamlessly with database applications. Some
typical examples of the applications that data mining are used in are
call centers, ATMs, ERM, and business planning applications. |
Oracle Data Mining(ODM,Oracle 数据挖掘)组件将数据挖掘功能嵌入了 Oracle
数据库。任何数据都不必离开数据库,即分析所用的数据,数据准备(data preparation),建模(model
building),模型评估结果(model scoring result)都可以保存在数据库中。应用开发者可以把 Oracle
作为基础架构,将数据库应用与数据挖掘应用无缝地集成。将应用与数据挖据集成的典型例子有呼叫中心,ATM,ERM,及业务计划应用(business
planning application)等。 |
160 |
By eliminating the need for extracting data into specialized tools and
then importing the results back into the database, you can save
significant amounts of time. In addition, by having the data and the
data model in the same location (an Oracle database), there is no need
to export the model as code. |
由于无需将数据抽取到专业工具,也无需将结果返回数据库,这大大节省了数据挖掘的时间。此外,由于数据与数据模型存储在同一位置(Oracle
数据库),用户也不必将模型导出为代码。 |
161 |
Data mining functions such as model building, testing, and scoring are
provided through a Java API. |
建模,测试,评估等数据挖掘函数是通过 Java API 提供的。 |
162 |
Oracle Data Mining supports the following algorithms:
|
Oracle Data Mining 支持以下算法:
|
163 |
ODM also includes several feature and performance enhancements. |
ODM 中还提供了一些其他功能及性能增强特性。 |
164 |
|
|
[063] Transportable tablespaces [068] Table functions [068] pipelined [069] incremental pipelining [073] direct-path [095] SQL Access Advisor [133] Lag/lead analysis [154] recovery catalog [159] model scoring |
[056] We refer to the entire process,
including data loading, as ETL. You should understand that ETL refers to
a broad process, and not three well-defined steps. [064] Previously, the most scalable data transportation mechanisms relied on moving flat files containing raw data. [068] Table functions provide the support for pipelined and parallel execution of transformations implemented in PL/SQL, C, or Java. [069] Returning result sets incrementally for further processing as soon as they are created. This is called incremental pipelining [073] External tables enable the pipelining of the loading phase with the transformation phase. [073] For example, the conversion functionality of a conventional load can be used for a direct-path INSERT AS SELECT statement in conjunction with the SELECT from an external table. [098] Efficient maintenance during parallel DML and loads [103] Bitmap indexing also supports parallel create indexes and concatenated indexes. [119] Given a query plan for a SQL query, the parallel execution coordinator breaks down each operator in a SQL query into parallel pieces, runs them in the right order as specified in the query, and then integrates the partial results produced by the parallel execution servers executing the operators. [127] We use the term dimension to mean any category used in specifying questions. [139] The core query engine can work with unlimited quantities of data. [160] In addition, by having the data and the data model in the same location (an Oracle database), there is no need to export the model as code. |
1、看看 incremental pipelining 是怎么回事? [069] incremental pipelining 2、Figure 16-4 Pipelined Data Transformation [074] 图错了? 3、怎么“逐层汇总”? [125] Aggregate at increasing levels of aggregation, from the most detailed up to a grand total 4、有什么“可能组合”? [125] Calculate all possible combinations of aggregations with a single statement 5、为什么分析角度有“budgets”? [130] Many multidimensional questions require aggregated data and comparisons of data sets, often across time, geography or budgets. 6、“N-tiles”是什么东西? [134] N-tiles |