" A Data Warehouse is a subject-oriented, integrated time-variant, non-volatile collection of data in support of management's decisionmaking process." (W. H. Inmon 1992).
Described in more detail:
- Organized around major subjects
- e.g. customers, patients, student, products
- Constructed by integrating multiple, heterogeneous data sources
- Important: quality of data, i.e. its correctness and consistency
- Consistent naming conventions, formats encoding structures; from multiple data sources
- All data values and their changes in time can be compared and analyzed
- can study trends and changes
- Data is stored in a warehouse
- Read-only, periodically refreshed
What is it?Edit
A data warehouse (DW) provides information for analytical processing, decision making and data mining tools. A DW collects data from multiple heterogeneous operational source systems (OLTP - On-Line Transactional Processing) and stores sumerized integrated business data in a central repository used by analytical applications (OLAP - On-Line Analytical Processing) with different user requirements.The data area of a data warehouse usually stores the complete history of a business.
The common process for obtaining decision making information is based on using OLAP tools. These tools have their data source based on the DW data area, in which records are updated by ETL (Extraction, Transformation and Loading) tools. The ETL processes are responsible for identifying and extracting the relevant data from the OLTP source systems, cutsomizing and integrating this data into a common format, cleaning the data and conforming it into an adequate integrated format for updating the data area of the DW and, finally, loading the final formatted data into its database.
Data in a data warehouse is often copy of operational/ original data, and may not be the most up to date versions. Traditionally, it has been well accepted that data warehouse databases are update periodically - typically in a daily, weekly, or even monthly basis - implying that its data is never up-to-date.
The data can also be query-driven (data located at sources), and the warehouse fetches the data from each source, but this has not catched on in industry. Advantages and disadvantages of these two types are described in a later section.
The architecture of a data warehouse is described in the picture below.
From this illustration we see that data is collected from a range of different sources and stored in a single data warehouse. The sources will feed the warehouse with data periodically. For interoperability , an integration layer is usually applied.
Integration and InteroperabilityEdit
To achieve integration and interoperability there are two ways of doing data management.
- Query-driven approach
- Warhouse approach
Doing things query-driven means that the data is not actually stored at the warehouse itself, and only at the souces. The data is then collected, and processed on the sources, and is controlled by a query API. This way the warehouse will act on the most recent data, and there is also no redundant copying of data.
Disadvantages of this approach is the delay of accessing the sources for data, and is not efficient for frequent queries. Also, it has to compete with existing local processing at the sources. (This approach has not caught on in industy)
The second approach is warehousing. Warehousing has the advantages of giving high performance, at the cost of incosnistent/outdated data. it will also operate when sources are down, giving an "allways-up" service.
Data Warehouse vs. Centralized Database System.Edit
Warehouse Models & OperatorsEdit
We have the following (relevant) data models and operators.
- Stars & Snowflakes
- Slice & dice
- roll-up, drill down
Most relevant of these models is the cube model:
Cube Data ModelEdit
Data is represented as a multidimensional (2-n) cube were each dimension is represents a table dimension. A Two-dimensional cube is a standard table. An example of a 3D- cube can be a table of data changing over time, with time being the 3rd dimension. The cube will then be a series of 2D- tables creating a 3D- cube.
- Rotate the cube
- Example: Change the perspective from "Region X Product" to "Region X Time"
- Cut through the cube, so that users can focus on some specific perspectives
- Example: only focus on a specific customer
- Get one cell from the cube (the smallest slice)
- Example: Get the production volume of Armonk, for CellPhone 1001, in January
pic of slice and dice*
The fact and dimension relations can be displayed in an Entity-Relation diagram which looks like a star. The fact table in the middle is connected to a set of dimension tables.
A refinement of star schema where some dimesional hierarchy is normalized into a set of smaller dimension tables, forming a shae similar to snowflakes.
Multi-dimensional databases generally have hierarchies or formula-based relationships of data within each dimension. Aggregation involves computing all of these data relationships for one or more dimensions, for example, adding up all Departments to get Total Division data.
It has been claimed that for complex queries OLAP Cubes can produce an answer in about 0.1% of the time for the same query on OLTP relational data. The single most important mechanism in OLAP, which allows it to achieve such performance, is the use of aggregations.
Many OLAP queries involve aggregation of the data in the fact table.
For example: Add up amounts for each day
SELECT sum(amt) FROM SALE
GOUP BY date
Aggregate operators are: sum, count, max, min, median, avarage.
Population of a data warehouse is done by collecting data from the different sources. This process is called ETL (Extract, Transform and Load). This process is described in more detail below.
The ETL process is divided into three steps: Extract, Transform and Load.
In this step of the ETL process, the warehouse obtains a snapshot of the source data.
Data from the sources of the data warehouse generally stores its own data in different forms, and their metadata is hetereogenous. In this step of ETL, the data is then converted to a single managable data format used in the warehouse.
This step is responsible for:
- cleaning the data, removing errors and inconsistent data.
- All syntactics are corrected, e.g. attribute names that differ in name, but not in meaning: SNN vs. Ssnum.
- Attribute domains: e.g. Integer vs. String.
- Sorting data
In other words: This is where all the data is transformed into data representable, and interoperable with all the other data from the different sources.
Here the transformed data is placed into the warehouse, and indexes are created.
Ps. I believe the slides are wrong, and they should say "Steps in populating a data warehouse" not "steps in data reconciliation"
Why use reconciliation? When extrating data from a source and then transfering the nature of the data can change considerably.
Failures can be due to:
- Inconsistent or non coherent data from source
- Non-integrating data among different sources
- Unclean/ non-profiled data
- Un-handled exceptions
- Constraint violations
- Logical issues/ Inherent flaws in program
- Technical failures like loss of connectivity, loss over network, space issue etc.
Data reconcilitation is often confused with the process of data quality testing. Even worse, sometimes data reconciliation process is used to investigate and pin point the data issues. It should be limited to identify, if at all, any issues with the data. Reconciliation process answer "what" is wrong, not "why"
The matter of data warehouse consistency only applies when the data is actually stored in the warehous, not in a query-driven warehouse.
When source data changes, the data stored in the warehouse becomes inconsistent. Source changes must be detected, and the changes must be propagated in order to achieve consistency. This is usually done in three ways:
- Immediate: The view in the warehouse is updated after each base relation is updated.
- Deferred: Updates the view only when a quesry is issued against the view.
- Periodic: Updates at periodic intervals. The view then becomes temporarily inconsistent, but has low cost.
Warehouse Consistency levels:Edit
- Convergence: after the last update and all activity has ceased, the view is consistent with the source.
- Weak consistency: Convergence + every warehouse state reflects a valid state at the sources.
- Strong consistency: convergence + every warehouse state reflects a set of valid source states reflecting the same globally serializable schedule, and the order of the warehouse states matches the order of the source states.
- Completeness: strong consistency + a complete orderpreserving mapping between the states of the warehouse and the states of the sources.
Describe characteristics of a data warehouse.
Draw an architechture of a data warehouse and explain it.
Explain how a data warehouse is populated with data.
describe also what a data warehouse is typically used for.
The Cube data model is often used in data warehouses, describe the cube model. Illustrate.
Describe also the operations (Cube-operations) that can be used for manipulating data in a warehouse.
Explain why data warehouses use a different data model compared to what is used in source systems.
Why is consistency a challenge in data warhouses.
Describe the characteristics of a data warehouse.
Explain the difference between a data warehouse and i) a centralized database system and ii) a multidatabase system.
Explain how a data warehouse is populated with information. Describe all operations in this process. Describe OLTP and OLAP and thei applications.
Many OLAP queries include aggregation of data. Describe Aggregation and give examples og aggregation operations.
How is consistency maintained in data warehousesm and what needs are there between the warehouse and the source systems?
How is interoperability achieved in data warehouses?