I need attached work done in exact 7 hours
Dont bid before you read the work.
Only oracle master needed who knows star schema.
My budget is fixed USD 40.
If you need more, dont bid else I will report you.
FIT5195 S1 2020 – Take Home Test (10%)
Due date: Week 6, Friday 1-May-2020, 11:55pm
This Take Home Test consists of two Case Studies. You are required to answer all questions.
Case Study 1: Government Hospital Data Warehouse (50 marks)
Currently, the Victorian government has a simple database system implemented to store the
collective data of all government-run hospitals. The database system includes information about their
services, staff, and patients. The simple database system consists of the following tables and
attributes:
Figure 1 : A simplified structure of the Government Hospital’s current E/R Diagram
Due to the nationwide impact of a new pandemic, the Premier of Victoria (i.e. the head of government
in the Australian state of Victoria) requires a data warehouse for analyzing trends and forecasting
purposes. The goal of the analyzing is to identify which frequently used services, popular hospital
locations, etc.
You are required to design a small Data Warehouse to keep track of the statistics. The Premier is
particularly interested in analyzing the total number (population quantity) of patients and total
service charged by services used, time periods, age groups, and hospital locations.
The Premier also wants to be able to drill down based on the cost of services range (low price < $20,
medium price between $20 and $50, and high price >$50).
Tasks:
[1]. Create the operational database based on the above E/R Diagram, and then populate with
sufficient numbers of records in each table. It is suggested that each table would have 5-20
records depending on the use of that table in the operational database. You need to create a script
file containing the CREATE TABLE and INSERT INTO commands.
You can follow the list of services from this link:
https://www.monash.edu/health/medical/services.
[2]. Then develop a GOVERNMENT HOSPITAL star schema. Identify the fact table, dimensions
and attributes required to support the schema. If the star schema consists of a Bridge Table, you
have
to also include the Weight Factor and List Aggregate.
The result of this task is a star schema diagram. You can use any drawing tool, such as
Lucidchart, to draw the star schema.
[3]. Validate your star schema using the Two-Column Table Methodology. You are required to
illustrate some two-column tables for this task based on your star schema design.
[4]. Write the SQL commands to create the fact and dimension tables. You need to create a script
file containing the appropriate SQL commands to create the fact and dimension tables.
[5]. Write the SQL commands to answer the following queries: (you need to make sure that there are
records in your fact and dimensions tables. For each of the following queries, write the SQL and
show the results):
a) Show the total number of patients making appointments during Winter.
b) Show the total service charged for each service cost type.
c) Show the total number of patients by each age group (infant <1, children <18, adult 18+,
senior 65+) in April 2020.
d) Show the total service charged for general medical consultations in each suburb.
Submission checklist for Case Study 1:
A. A pdf file or word document containing:
i. The SQL commands to create tables and insert records to the operational database –
Task 1
ii. The star schema diagram – Task 2
iii. The Two-Column Table Methodology illustration – Task 3
iv. The SQL commands to create the dimension and fact tables, as well as the contents of
these tables – Task 4
v. The SQL commands to answer the queries in Task 5 and the query results
B. The .sql files containing:
i. The SQL commands to create the operational database.
i
i. The SQL commands to create the data warehouse.
iii. The SQL commands to answer the queries in Task 5.
(Notes: The marker of this test will simply run this file. So you need to make sure that the
SQL commands are written correctly in the .sql file. Pay a particular attention to the quote,
especially if you cut and paste from a Word file).
C. A folder:
i. Save all the files from [A] and the SQL files from [B] in one folder.
ii. Name the folder CaseStudy1_yourStudentID.
https://www.monash.edu/health/medical/services
Case Study 2: Accident Records Data Warehouse (50 marks)
Figure 2 : A simplified structure of the Accident Record’s current database system
The above figure displays how the Victorian Roads and Safety (VicR&S) Department stores accident
related data. When there are any road accidents, the details about the accidents such as the location,
date, time and event (e.g. what exactly happened during the accident) are recorded. For each accident,
the information (e.g. name, contact number, employment start date, current employment branch)
about one police officer who is in charge of overseeing the events of the accident (e.g finding out
who/what caused the accident) are stored. In addition, to identify a particular accident, each accident
is given a unique accident number.
On the other hand, one accident may be caused by several vehicles (e.g. a collision between 2
vehicles). All vehicles have owners who may or may not be the same as the vehicle driver during the
accident. Therefore, the VicR&S database stores information about vehicles involved in the accident
(i.e. vehicle number, vehicle model, vehicle make and vehicle manufacturer), the information about
the owners of the vehicles (i.e. owner name, owner address, owner contact number) and the
information about the drivers at the time of the accident (driver name, driver address, driver contact
number, driver licence no, and when the driver started to drive the vehicle). Furthermore, for each
accident, the severity of damage incurred by each vehicle is recorded.
Currently, to reduce the number of road accidents VicR&S department is interested in the following:
– The total number of accidents happening by different locations and by different lighting
periods (daytime: 6AM – 5:59PM and nighttime 6PM – 5:59AM).
– The total number of accidents by each vehicle model.
– The number of vehicles involved in every accident event on different locations.
– The number of accidents taken care of different police officer branches.
Tasks:
[1]. Develop an ACCIDENT RECORDS star schema. Identify the fact table, dimensions and
attributes required to support the schema. If the star schema consists of a Bridge Table, you have
to also include the Weight Factor and List Aggregate.
The result of this task is a star schema diagram. You can use any drawing tool, such as
Lucidchart, to draw the star schema.
[2]. Validate your star schema using the Two-Column Table Methodology. You are required to
illustrate some two-column tables for this task based on your star schema design.
[3]. Write the SQL commands to create the fact and dimension tables. You need to create a script
file containing the appropriate SQL commands to create the fact and dimension tables. The
operational tables are accessible from the ACCIDENT account. The result of this task is the SQL
commands. You will also need to show the contents of the tables that you have created.
[4]. Write the SQL commands to answer the following queries: (you need to make sure that there are
records in your fact and dimensions tables. For each of the following queries, write the SQL and
show the results):
a) Show the total number of accidents happening by different locations and by different
lighting periods (daytime: 6AM – 5:59PM and nighttime 6PM – 5:59AM).
b) Show the total number of accidents by each vehicle model.
c) Show the number of vehicles involved in every accident event on different locations.
d) Show the number of accidents taken care of by different police officer branches.
[5]. You need to come up with additional two more questions and answer these questions using the
SQL commands. Also explain the reason for why the management would like to have such
information.
Submission checklist for Case Study 2:
A. A pdf file or word document containing:
i. The star schema diagram – Task 1
ii. The Two-Column Table Methodology illustration – Task 2
iii. The SQL commands to create the dimension and fact tables, as well as the contents of
these tables – Task 3
iv. The SQL commands to answer the queries in Task 4 and the query results
v. The additional two questions with the SQL commands, query result, and your
explanation – Task 5
B. The .sql files containing:
i. The SQL commands to create the data warehouse.
ii. The SQL commands to answer the queries in Task 4 and Task 5.
(Notes: The marker of this test will simply run this file. So you need to make sure that the
SQL commands are written correctly in the .sql file. Pay a particular attention to the quote,
especially if you cut and paste from a Word file).
C. A folder:
i. Save all the files from [A] and the SQL files from [B] in one folder.
ii. Name the folder CaseStudy2_yourStudentID.
Submission Method:
1. A zip file:
a. Save the CaseStudy1_yourStudentID and CaseStudy2_yourStudentID folders into one
folder named FIT5195_Test_yourname_studentID.
b. ZIP the folder FIT5195_Test_yourname_studentID.zip. This must be a ZIP file and
not other types of compressed folder. The zip file should contain the prescribed folder
structure as follows:
● FIT5195_Test_yourname_studentID/
○ CaseStudy1_yourStudentID/
○ Files for tasks
○ CaseStudy2_yourStudentID/
○ Files for tasks
2. Upload your zip file on Moodle by the due date: Friday 1-May-2020, 11:55pm (No late
submission is accepted).
END OF TAKE HOME TEST
Oracle/FIT5195-2-Star Schema
Week 2 – Star Schema
Semester 1, 2020
FIT5195 – Business Intelligence
and Data Warehousing
Developed by:
Agnes Haryanto
Agnes.Haryanto@monash.edu
MONASH
INFORMATION
TECHNOLOGY
Agenda
1. Notations and Processes
1. Star Schema Notation
2. E/R Diagram Notation
3. Transformation Process (Case Study)
2. Two-Column Table Methodology
Recall – The Big Picture
Using FLUX
1. Visit http://flux.qa/ on your internet enabled device
2. Log in using your Monash account (not required if
you are already logged in to Monash)
3. Click on the “+” to join audience
4. Enter the Audience Code:
• Caulfield – 3GANT7
• Fully Flex – 39WRG8
• Malaysia – VTVPLW
5. Select FIT5195 in the Active Presentation menu
6. Answer questions when they pop up
http://flux.qa/1AW6N8
Recall – Data Warehouse
▪ To address the drawback of operational database, and a need for decision-
making support data, data warehouse is needed.
▪ A data warehouse is a multi-dimensional view of databases, with
aggregates and pre-computed summaries.
➢ In many ways, it is basically doing aggregates in advance; that is exactly pre-
computation done at the design level, rather than at the query level.
Recall – Data Warehouse
Star Schema
▪ A Star Schema is a design representation of a multi-dimensional view. It is a
data modeling technique used to map multidimensional decision support
data into a relational database.
▪ The reason for the star schema’s development is that existing relational
modeling techniques: ER and normalization, did not yield a database
structure that served the advanced data analysis requirements well.
Star Schema Components
▪ There are Three main components of the Star Schema:
1. Facts
2. Dimensions
3. Attributes
Star Schema Components
1. Facts
Facts are numeric measurements (values) that represent a specific business aspect
or activity.
For example, sales figures are numeric measurements that represent product and/or
service sales.
2. Dimensions
Dimensions are qualifying characteristics that provide additional perspectives to a
given fact.
For example, sales might be viewed from specific dimension(s), such as sales
location, sales period, sales product, etc.
Star Schema Notation
▪ A Sales Star Schema
➢ Fact:
• Sales
➢ Dimensions:
• Time
• Product
• Branch
▪ Notation-wise, the Fact uses a bolder line, to differentiate between Fact
from Dimensions.
Star Schema Notation
▪ A Sales Star Schema
➢ Fact:
• Sales
➢ Dimensions:
• Time
• Product
• Branch
▪ The lines that represent a relationship
between the fact and dimensions can be
straight lines or bended lines.
Star Schema Notation
▪ Using the star schema notation, the
number of dimensions can be unlimited.
▪ If there is more dimensions, then we just
add more dimensions linked to the Fact.
Star Schema Components
3. Attributes
Each dimension table contains attributes.
For example:
Product dimension: Prod Type,
Description.
Location dimension: Region,
State,
City.
Time dimension: Year,
Month.
Star Schema Notation
Star Schema Notation
▪ Sales Star Schema
(b) Sales Star Schema complete with the Attributes(a) Outline star schema for Sales
E/R Diagram Notation
E/R Diagram Notation
(a) An Entity in E/R Diagram
(b) Relationships in E/R Diagram
E/R Diagram Notation
E/R Diagram Notation
Associative
Relationship
E/R Diagram Notation
Non-Associative
Relationship
Transformation Process
Transformation Process
Transformation Process
Case Study #1
Case Study #1: International College
The admission office handles enrolment, payment, and marketing campaigns to international
students, often through educational agents located overseas. This admission office has an
operational system that maintains all the details of international students enrolled in the College.
Payment details are also handled by this office. Basically, the operational system has the following
features:
▪ Every student details are kept in the database. This includes the courses that the students enroll.
▪ As the College is a multi-campus university, some courses are offered in a different campus. The
admission office handles international students of all campuses.
▪ Some international students coming to the College are handled by an educational agent. This is
particularly common for the first course that a student enrolls in. Subsequent courses are not
normally handled by an agent, because the students themselves deal directly with the College.
▪ International students pay tuition fees several times (normally once every semester) for each
course they are doing.
Case Study #1: International College
The College now requires a data warehouse for analysis
purposes. The analysis is needed for identifying at least the
following questions:
1. How many students come from certain countries?
2. What is the total income for certain postgraduate
courses?
3. How many students are handled by certain agents?
4. How the number of enrolment of courses fluctuates
across the year?
Case Study #1: International College
▪ College Star Schema
➢ Fact:
• Number of Students
• Total Income
➢ Dimensions:
• Country
• Agent
• Course
• Year
Transformation Process
Case Study #2
Case Study #2: Sales
▪ Suppose that we would like to analyze
Total Sales from various point of
views, such as Quarter, Branch, and
Product Category.
Case Study #2: Sales
▪ Sales Star Schema
➢ Fact:
• Total Sales
➢ Dimensions:
• Time
• Branch
• Product Category
Two Column Table
Methodology
Two Column Table Methodology
When creating a star schema, you need to
imagine that the data you want to analyse
consists of two columns.
The first column is the category (e.g. A, C,
D, E), and the second column is the
statistical numerical figure (e.g. B).
The second column (e.g. B) has to be
consistent throughout all the two-column
tables.
One Fact Measurement:
Two Column Table Methodology
▪ Case Study 1: Analysis of Accountants
Suppose the CPA organization would like to analyze its members (i.e.
accountants) in a particular city. Assume that the organization has the full
details of its members.
Education Number of Accountants
Diploma 84953
Bachelor 349203
Higher Degree 98943
Others 2322
Two Column Table Methodology
▪ We can also look at the figures from the gender point of view, like:
▪ Another way to analyze number of accountants is form the type of the accountant job
itself; something like:
▪ Note that the figures are fictitious, and the “Types” of Accountants (indicating different
roles of accountants) are also fictitious.
Gender Number of Accountants
Male 434322
Female 89932
Type Number of Accountants
Government 3843
Private Business 45303
Personal 45930
etc
etc
Two Column Table Methodology
▪ You can further identify other example to analyze number of accountants. In
the above three tables, the first angle to look at the number of accountants
is from the educational background, the last one is from the type of the
accountant itself, whether it is a private business accountant, etc.
▪ As you can see, the second column is CONSISTENTLY UNIFORM. In the
above example, it is number of accountants. The first column changes
depending on from which angle that you want to see.
Two Column Table Methodology
▪ Therefore, in this case study, the star schema could look like the following:
Two Column Table Methodology
The second column in the two-column
tables, which is the numerical fact
measurement (e.g. column B) can
actually be multiple columns (call them:
B1, B2, B3), as long as all of these
columns (e.g. B1, B2, B3) relate to all of
the categories (e.g. A, C, D, E).
Multiple Fact Measurements:
Two Column Table Methodology
▪ Case Study 2: Student Enrollment
The University Administrator(s) needs to keep track of the number of
enrollment for particular unit or campus and the students’ performance
each year in order to maintain the University performance. The head of
admin has assigned you the task of developing a small Data Warehouse in
which to keep track the enrollment and performance statistics.
Two Column Table Methodology
▪ For example:
▪ Another example could be something like this:
Subject Number of Students Total Score
Database 8 539
Java 5 327
SAP 1 63
Network 2 105
Semester Number of Students Total Score
One 9 618
Two 7 416
F1 F2
F1 F2
Two Column Table Methodology
▪ In analyzing number of students (apart from the subject and semester as
shown above), you could also see the number of student from another
angle, for example from the campus and grade:
Campus Number of Students Total Score
Main 9 658
City 5 271
DE 2 105
F1 F2
Two Column Table Methodology
▪ For example:
Grade Number of Students Total Score
HD 3 253
D 4 300
C 4 256
P 2 105
N 3 120
F1 F2
Two Column Table Methodology
▪ The first columns of the above examples are the dimensions, whereas
the other columns that contain the statistical/summarized/aggregated
values is the fact.
▪ In the above example, the fact is then STUDENT_ENROLLMENT_FACT,
and the dimensions are SUBJECT, SEMESTER, GRADE and CAMPUS.
Two Column Table Methodology
▪ The star schema for the STUDENT ENROLLMENT is shown as follows:
End of Week 2 Lecture
Oracle/FIT5195-3-Bridge Table
Week 3 – Bridge Tables
Semester 1, 2020
FIT5195 – Business Intelligence
and Data Warehousing
Developed by:
Agnes Haryanto
Agnes.Haryanto@monash.edu
Soon Lay-Ki
Soon.LayKi@monash.edu
MONASH
INFORMATION
TECHNOLOGY
Agenda
1. Bridge Tables
2. Temporary Tables
1. Temporary Dimension Tables
2. Temporary Tables in the Operational Database
Using FLUX
1. Visit http://flux.qa/ on your internet enabled device
2. Log in using your Monash account (not required if
you are already logged in to Monash)
3. Click on the “+” to join audience
4. Enter the Audience Code:
• Caulfield – 3GANT7
• Fully Flex – 39WRG8
• Malaysia – VTVPLW
5. Select FIT5195 in the Active Presentation menu
6. Answer questions when they pop up
http://flux.qa/1AW6N8
Bridge Tables
Bridge Tables
▪ A bridge table is a table that links between two dimensions; and only one
of these two dimensions are linked to the fact.
➢ As a result, the star schema becomes a snowflake schema.
Bridge Tables
▪ Two reasons on why a dimension cannot be connected directly to the Fact:
▪
a) The Fact table has a fact measure, and the dimension has a key identity. In
order to connect a dimension to the Fact, the dimension’s key identity must
contribute directly to the calculation of the fact measure. Unfortunately, this
cannot happen if the operational database does not have this data.
b) The operational database does not have this data if the relationship between
two entities in the operational database that hold the information about
dimension’s key identity and the intended fact measure is a many-many
relationship.
Bridge Tables
Case Study #1
Case Study #1 – A Product Sales Case Study
▪ A company management team would like to
analyze the statistics of its product sales
history. The analysis is needed to identify
popular products, suppliers supplying those
products, the best time to purchase more
stock, etc.
▪ A small data warehouse is to be built to keep
track of the statistics.
▪ The management is particularly interested in
analyzing the total sales (quantity * price) by
product, customer suburbs, sales time
periods (month and year), and supplier.
Case Study #1 – A Product Sales Case Study
▪ The management is particularly interested in
analyzing the total sales (quantity * price) by
product, customer suburbs, sales time
periods (month and year), and supplier.
Case Study #1 – A Product Sales Case Study
▪ The management is particularly interested in
analyzing the total sales (quantity * price) by
product, customer suburbs, sales time
periods (month and year), and supplier.
▪ Sales Star Schema
➢ Fact:
• Total Sales
➢ Dimensions:
• Product
• Customer locations/suburbs
• Time period
• Supplier
▪ Possible Two-Column Methodology Tables:
Case Study #1 – A Product Sales Case Study
ProductNo TotalSales
A1 $130,000
B2 $15,900
C3 $2,500,000
… …
TimeID TotalSales
201801 $25,000
201802 $4,700
201803 $3,500
… …
Suburb TotalSales
Caulfield $6,500
Chadstone $12,000
Clayton $1,800
… …
(a) Product point of view (b) Time point of view (c) Suburb point of view
Case Study #1 – A Product Sales Case Study
▪ Sales Star Schema
➢ Fact:
• Total Sales
➢ Dimensions:
• Product
• Customer locations/suburbs
• Time period
• Supplier
Case Study #1 – A Product Sales Case Study
Case Study #1 – A Product Sales Case Study
SupplierID TotalSales
S1 $77,000
S2 $5,700
S3 $12,500
… …
Supplier point of view
Case Study #1 – A Product Sales Case Study
SupplierID TotalSales
S1 $77,000
S2 $5,700
S3 $12,500
… …
Supplier point of view
Case Study #1 – A Product Sales Case Study
Case Study #1 – A Product Sales Case Study
Case Study #1 – A Product Sales Case Study
SupplierID TotalSales
S1 $77,000
S2 $5,700
S3 $12,500
… …
Case Study #1 – A Product Sales Case Study
Case Study #1 – A Product Sales Case Study
Bridge Table
▪ To create Time Dimension:
– create table TimeDim as
select
distinct to_char(SalesDate, ’YYYYMM’) as TimeID,
to_char(SalesDate, ’YYYY’) as Year,
to_char(SalesDate, ’MM’) as Month
from Sales;
▪ To create Customer Location Dimension:
– create table CustLocDim as
select distinct Suburb, Postcode
from Customer;
Case Study #1 – A Product Sales Case Study
▪ To create Product Dimension:
– create table ProductDim as
select distinct ProductNo, ProductName
from Product;
▪ To create Bridge Table:
– create table ProductSupplierBridge as
select *
from StockSupplier;
▪ To create Supplier Dimension:
– create table SupplierDim as
select SupplierID, Name as SupplierName
from Supplier;
Case Study #1 – A Product Sales Case Study
▪ To create Fact Table:
– create table ProductSalesFact as
Select
to_char(S.SalesDate, ’YYYYMM’) as TimeID,
P.ProductNo,
C.Suburb,
sum(SI.QtySold*P.Price) as TotalSales
from Sales S, Product P, Customer C, SalesItem SI
where S.SalesNo = SI.SalesNo
and SI.ProductNo= P.ProductNo
and C.CustomerID = S.CustomerID
group by
to_char(S.SalesDate, ’YYYYMM’), P.ProductNo, C.Suburb;
Case Study #1 – A Product Sales Case Study
Bridge Tables
Case Study #2
Case Study #2 – A Truck Delivery Case Study
▪ A trucking company is responsible for picking up goods from warehouses of a retail chain
company, and delivering the goods to individual retail stores.
▪ A truck carry goods during a single trip, which is
identified by TripID, and delivers these goods to
multiple stores. Trucks have different capacities
for both the volumes they can hold and the
weights they can carry.
▪ At the moment, a truck makes several trips each
week. An operational database is being used to
keep track the deliveries, including the scheduling
of trucks, which provide timely deliveries to
stores.
Case Study #2 – A Truck Delivery Case Study
▪ A trip may pick up goods from many
warehouses
o i.e. a many-many relationship between
Warehouse and Trip
▪ A trip uses one truck only, and a truck may
have many trips in the history
o i.e. a many-1 relationship between Trip and
Truck
▪ A trip delivers goods (e.g. TVs, fridges, etc)
potentially to several stores
o a many-many relationship between Trip and
Store, which is represented by the Destination
table
▪ Sample data in the operational database:
Case Study #2 – A Truck Delivery Case Study
WarehouseID Location
W1 Warehouse1
W2 Warehouse1
W3 Warehouse1
… …
(d) Truck Table
(b) Trip Table (c) TripFrom Table
TripID Date TotalKm TruckID
Trip1 14-Apr-2018 370 Truck1
Trip2 14-Apr-2018 570 Truck2
Trip3 14-Apr-2018 250 Truck3
Trip4 15-Jul-2018 450 Truck1
… … … …
TripID WarehouseID
Trip1 W1
Trip1 W2
Trip1 W3
Trip2 W1
Trip2 W2
… …
TruckID VolCapacity WeightCategory CostPerKm
Truck1 250 Medium $1.20
Truck2 300 Medium $1.50
Truck3 100 Small $0.80
Truck4 550 Large $2.30
Truck5 650 Large $2.50
… … … …
StoreID StoreName Address
M1 MyStore City Melbourne
M2 MyStore Chaddy Chadstone
M3 MyStore HiPoint High Point
M4 MyStore Donc Doncaster
M5 MyStore North Northland
M6 MyStore South Southland
M7 MyStore East Eastland
M8 MyStore Knox Knox
… …
TripID StoreID
Trip1 M1
Trip1 M2
Trip1 M4
Trip1 M3
Trip1 M8
Trip2 M4
Trip2 M1
Trip2 M2
… …
(a) Warehouse Table
(e) Store Table (f) Destination Table
Case Study #2 – A Truck Delivery Case Study
▪ The management of this trucking company would like to analyze the deliver cost, based on
trucks, time period, and store.
Case Study #2 – A Truck Delivery Case Study
▪ Sales Star Schema
➢ Fact:
• Total Delivery Cost
(distance * cost per kilometre)
➢ Dimensions:
• Truck
• Time period
• Store
Case Study #2 – A Truck Delivery Case Study
▪ From the Truck point of view, Truck1 has two trips (e.g. Trip1 and Trip4), with the total kilometres of
820km (370km + 450km). The cost for Truck1 is $1.20. Hence, calculating the cost for Truck1 is
straightforward. Other trucks can be calculated this way.
▪ From the Period point of view, 14-Apr-2018 has three trips (e.g. Trip1,Trip2, and Trip3). Trip1 (370km) is
delivered by Truck1 which costs $1.20/km. Trip2 and Trip 3, on the same day, can be calculated the same
way. Hence, on 14-Apr-2018, the total cost can be calculated.
▪ From the Store point of view; The cost is calculated
based on Trip, but a trip delivers goods to many
stores. Therefore, the delivery cost for each store
cannot be calculated. The delivery cost is for the trip –
not for the store.
Case Study #2 – A Truck Delivery Case Study
Solution Model 1 – Using a Bridge Table
Case Study #2 – A Truck Delivery Case Study
Solution Model 2 – add a Weight Factor attribute
A weight factor is only needed if we want to estimate
the contribution that a dimension made to the fact
Case Study #2 – A Truck Delivery Case Study
StoreID StoreName Address
M1 MyStore City Melbourne
M2 MyStore Chaddy Chadstone
M3 MyStore HiPoint High Point
M4 MyStore Donc Doncaster
M5 MyStore North Northland
M6 MyStore South Southland
M7 MyStore East Eastland
M8 MyStore Knox Knox
… …
TripID StoreID
Trip1 M1
Trip1 M2
Trip1 M4
Trip1 M3
Trip1 M8
Trip2 M4
Trip2 M1
Trip2 M2
… …
(c) Store Table(b) Bridge Table
(a) Trip Dimension Table
▪ To create Trip Dimension:
– create table TripDim2 as
select T.TripID, T.TripDate, T.TotalKm,
1.0/count(*) as WeightFactor
from Trip T, Destination D
where T.TripID = D.TripID
group by T.tripid, T.tripdate, T.totalkm;
Case Study #2 – A Truck Delivery Case Study
Case Study #2 – A Truck Delivery Case Study
Solution Model 3 – a List Aggregate version
Case Study #2 – A Truck Delivery Case Study
TripID StoreID
Trip1 M1
Trip1 M2
Trip1 M4
Trip1 M3
Trip1 M8
Trip2 M4
Trip2 M1
Trip2 M2
… …
(b) Bridge Table
(a) Trip Dimension Table
listagg (Attr1, ‘_’) within group
(order by Attr1) as ColumnName
▪ To create Trip Dimension:
– create table TripDim3 as
select T.TripID, T.TripDate, T.TotalKm,
1.0/count(D.StoreID) as WeightFactor,
listagg (D.StoreID, ’_’) within group
(order by D.StoreID) as StoreGroupList
from Trip T, Destination D
where T.TripID = D.TripID
group by T.TripID, T.TripDate, T.TotalKm;
Case Study #2 – A Truck Delivery Case Study
▪ Joining based on the StoreGroupList attribute in the Trip dimension table
and the StoreID in the Store dimension table:
– select *
from TripDim3 T, StoreDim3 S
where T.StoreGroupList like ’%’||S.StoreID||’%’;
▪ Without the StoreGroupList attribute in the Trip dimension, we need to join
three tables:
– select *
from TripDim3 T, BridgeTable3 B, StoreDim3 S
where T.TripID = B.TripID
and B.StoreID = S.StoreID;
Case Study #2 – A Truck Delivery Case Study
Bridge Tables
Summary
Bridge Tables – Summary
▪ In principal, a Bridge Table is used:
▪
a) When it is impossible to have a dimension connected directed to the Fact table,
because simply there is no relationship between this dimension and the Fact table
(e.g. in the Product Sales case study, it is impossible to have a direct link from
SupplierDim to ProductSalesFact)
b) When an entity (which will become a dimension) has a many-many relationship with
another entity (dimension) in the E/R schema of the operational database (e.g.
Supplier and Stock has a many-many relationship).
c) When temporality aspect (data history) is maintained in the operational database
and the bridge table can be used to accommodate the dimension that has temporal
attributes (e.g. product supply history is maintained in the second snowflake schema
example).
Bridge Tables – Summary
▪ When a Bridge Table is used in the schema, there are two additional options:
▪
a) A Weight Factor is used to estimate the contribution of a dimension in the calculation
of the fact measure. Because this is only an estimate, a weight factor is option.
b) Every snowflake schema (whether it has Weight Factor or not) can be implemented
in two ways: a List Aggregate version, and a non-List Aggregate version.
Temporary Dimension Tables
43
1. Direct copying from a relational table in the operational database
2. Selecting certain attributes (or records) from a relational table in the
operational database
3. Creating the dimension table from scratch
– Insert records into the newly created dimension table
– Suitable for small dimension table
– Temporary dimension table is needed in circumstances where a new
dimension table cannot be created directly
Ways to Create Dimension Tables
44
Temporary Dimension Tables – Sales Case Study
An E/R Diagram for the
Sales Case Study
A Star Schema for the Sales Case Study
QuarterID – ‘YYYYQ’ is a unique
combination between year and quarter,
where Q is between 1 and 4
45
▪ TimeDIM has to be created manually
▪ Questions:
Can we insert the records into TimeDIM
one by one?
▪ If yes, how many records to be
inserted?
▪ If no, why?
Temporary Dimension Tables – Sales Case Study (cont.)
46
Solution for TimeDIM:
1. Create a temporary Time Dimension table that contains Month and Year
2. Add with a new attribute, called QuarterID.
3. Perform a series of updates to fill in the QuarterID column with the correct
values
4. Create the final TimeDIM table
Temporary Dimension Tables – Sales Case Study (cont.)
47
Temporary Dimension Tables – Sales Case Study (cont.)
1
2
3
Exercise:
Construct the SQL command for Step 4
48
Temporary fact table is required
because TimeDIM is not directly copied
from the operational database
Temporary Fact Tables – Sales Case Study
1
49
Temporary Fact Tables – Sales Case Study (cont.)
2
3
4
50
Temporary Tables in Operational Database
Total_Sales = SUM(TotalPrice)
An E/R Diagram for the
Sales Case Study
A Star Schema for the Sales Case Study
51
Temporary Tables in Operational Database – Case Study
An E/R Diagram for the
Sessional Contract Jobs
52
Temporary Tables in Operational Database – Case Study (cont.)
A Star Schema for the
Sessional Contract Jobs
Last Degree of the
Employee
53
▪ One employee may have multiple
degrees.
▪ Most of the work requires a Bachelor
degree, to allow current Master
students to work as a teaching
assistant or as a programmer
▪ Current PhD students who have a
Master degree often work as a
teaching assistant too
▪ BUT, only the last degree is required
in DigreeDIM table.
Temporary Tables in Operational Database – Case Study (cont.)
54
Temporary Tables – EmployeeTemp Table
Retrieves the last degree of the
employees
55
Sample Records for Employee and Emp_Degree Table
Employee
Emp_Degree
56
RANK() OVER Function
Retrieved for
EmployeeTemp
table.
57
Create Fact Table from EmployeeTemp Table
58
Temporary Dimension table
Needed when it is not possible to create a dimension table directly; need to
apply a series of additional steps when creating a dimension table through the
creation of an intermediate table
Temporary table in operational database
Tables in the operational database are not readily used to create the Fact table.
Need transformation phase, where an intermediate table, called the Temporary
Operational Database Table is created.
Summary
Oracle/FIT5195-4-Snowflake Schema
Week 4 – Snowflake Schema
Semester 1, 2020
FIT5195 – Business Intelligence
and Data Warehousing
Developed by:
Agnes Haryanto
Agnes.Haryanto@monash.edu
Soon Lay-Ki
Soon.LayKi@monash.edu
MONASH
INFORMATION
TECHNOLOGY
Agenda
1. Hierarchies
1. Hierarchy vs. Non-Hierarchy
2. Hierarchy versus Multiple Independent Dimensions
3. Linked Dimensions
4. Hierarchy Design Considerations
2. Determinant Dimensions
1. Determinant vs. Non-Determinant Dimensions
2. Determinant Dimension vs. Pivoted Fact Table
3. Non-Type as a Determinant Dimension
4. Multiple Relationship between a Dimension and the Fact
Using FLUX
1. Visit http://flux.qa/ on your internet enabled device
2. Log in using your Monash account (not required if
you are already logged in to Monash)
3. Click on the “+” to join audience
4. Enter the Audience Code:
• Caulfield – 3GANT7
• Fully Flex – 39WRG8
• Malaysia – VTVPLW
5. Select FIT5195 in the Active Presentation menu
6. Answer questions when they pop up
http://flux.qa/1AW6N8
Hierarchies
Hierarchies
▪ A Hierarchy is formed when a dimension is broken down to two or
more dimensions in a hierarchical manner.
➢ As a result, the star schema becomes a snowflake schema.
(a) A simple Star Schema
(b) A Snowflake with Hierarchy
Hierarchies
(a) A Snowflake with a Bridge Table (b) A Snowflake with Hierarchy
Hierarchies
1. Hierarchy versus Non-Hierarchy Dimensions
2. Hierarchy versus Multiple Independent Dimensions
3. Linked Dimensions
4. Hierarchy Design Considerations
1. Hierarchy vs. Non-Hierarchy
1. Hierarchy vs. Non-Hierarchy
1. Hierarchy vs. Non-Hierarchy
HierarchyNon-Hierarchy
1. Hierarchy vs. Non-Hierarchy
CityID City Name State CountryID
MEL Melbourne Victoria AU
SYD Sydney New South Wales AU
KUL Kuala Lumpur Selangor MA
JNB Johannesburg Johannesburg SA
CampusID CampusName Address Suburb Postcode CityID
CL Clayton Campus Wellington Rd Clayton 3800MEL
CA Caulfield Campus Dandenong Rd Caulfield East 3145MEL
PA Parkville Campus Royal Parade Parkville 3052MEL
SY Sydney Campus Opera Boulevard Sydney 2001SYD
MUM Malaysia Campus Jalan Lagoon Bandar Sunway 47500KUL
MSA South Africa Campus Peter St Johannesburg 1725JNB
(a) Campus Dimension Table
CountryID Country Name
AU Australia
MA Malaysia
SA South Africa
(b) City Dimension Table
(c) Country Dimension Table
1. Hierarchy vs. Non-Hierarchy
(a) Correct Hierarchy (b) A Wrong Hierarchy
1. Hierarchy vs. Non-Hierarchy
1. One table vs. many tables
The consequence is in the join query processing when producing reports.
2. Normalized vs. Un-normalized
With the hierarchy option, the tables are normalized, which follows the relational
model. In contrast, the non-hierarchy option, the table (e.g. the Campus
Dimension table) is unnormalized, which is basically 1NF with visible replication
on information, which is prone to anomalies (e.g. insert, update, and delete
anomalies).
3. Drilling down and rolling up
The hierarchy model is not about drilling down information exploration.
1. Hierarchy vs. Non-Hierarchy – Summary
▪ From the query point of view, both versions need two queries.
▪ From the query processing point of view, the non-hierarchy version uses
one join operation only, because it only needs to join the fact and one
dimension.
▪ From the conceptual point of view, the hierarchy model does not actually
offer a better roll up or drill down features.
2. Hierarchy vs. Multiple Independent Dimensions
2. Hierarchy vs. Multiple Independent Dimensions
(a) Separate Dimensions (b) Combined Dimensions
(c) Hierarchy Dimensions
2.1. Separate vs Combined Dimension
(a) Separate Dimensions (b) Combined Dimensions
2.1. Separate vs Combined Dimension
(a) Separate Dimension Model – Campus Dimension Table
(b) Separate Dimension Model –
Country Dimension Table
(c) Separate Dimension Model – Fact Table
2.1. Separate vs Combined Dimension
(d) Combined Dimension Model – Campus Dimension Table
(e) Combined Dimension Model
– Fact Table
2.1. Separate vs Combined Dimension
(c) Separate Dimension Model – Fact Table (e) Combined Dimension Model – Fact Table
2.1. Separate vs Combined Dimension
(a) Separate Dimension Model – Campus Dimension Table (b) Separate Dimension Model –
Country Dimension Table
(d) Combined Dimension Model – Campus Dimension Table
2.1. Separate vs Combined Dimension
(a) Separate Dimensions (b) Combined Dimensions
2.2. Combined Dimension vs. Hierarchy
(b) Combined Dimensions (c) Hierarchy Dimensions
2.2. Combined Dimension vs. Hierarchy
(a) Hierarchy Model – Campus Dimension Table
(c) Hierarchy Model – Fact Table
(b) Hierarchy Model – Country Dimension Table
2.2. Combined Dimension vs. Hierarchy
▪ Reasons on why the Hierarchy model is not ideal:
➢ An efficient query processing to query the fact and dimensions,
➢ The identifier of the Campus Dimension (e.g. CampusID) already covers the
child dimension (e.g. Country Dimension), and
➢ Campus and Country information is often regarded as one entity; at least in this
case study.
(b) Combined Dimensions (c) Hierarchy Dimensions
3. Linked Dimensions
3. Linked Dimensions
(a) Linked Dimension Model – Campus Dimension Table
(c) Linked Dimension Model – Fact Table
(b) Linked Dimension Model – Country Dimension Table
4. Hierarchy Design Considerations
Hierarchy – Summary
▪ A dimension hierarchy is connecting two or more dimensions in a hierarchical
manner, using a many-1 relationship.
➢ As a result, the dimensions in a hierarchy are normalized, in 3NF, using the
context of Relational Database Design.
▪ Compare and contrasts five different models:
1. Separate Dimension model
2. Combined Dimension model
3. Hierarchy model
4. Linked Dimension model
5. Hierarchy Design Considerations
Determinant Dimensions
31
Dimensions and Fact Tables – Recap
A Star Schema for the Sales Case Study
32
Example #1:
Querying Fact Table
33
Example #2:
Querying Fact Table (cont.)
34
A star schema may have a special dimension
all data retrieval from this star schema must use this dimension
else, the retrieved data will become meaningless.
Determinant Dimensions
Determinant
_DIM
Determinant_
Att
..
..
Special dimension
Key identifier
35
Introducing a Determinant Dimension – Petrol Station Case Study
Sample operational database of a company about petrol prices from all petrols
stations in Victoria:
36
The requirements for the data warehouse
are to answer the questions related to
(a) average petrol price,
(b) minimum petrol price, and
(c) maximum petrol price
Petrol Station – Star Schema Version-1
Chosen Dimensions:
(i) day of week,
(ii) suburb, and
(iii) petrol company.
Two-column Table for Petrol Prices based on Day of Week
Category Fact Measures
37
Minimum price for which petrol type?
Petrol Station – Star Schema Version-1 (cont.)
38
▪ Not all petrol stations sell all types of petrol, e.g. Premium 98.
▪ Calculating the average price for Premium 98 on Monday cannot be:
Total_Petrol_Price_P98 / Number_of_Petrol_Station
▪ The number of petrol stations is different for different petrol type!
Petrol Station – Star Schema Version-1 (cont.)
39
▪ Solution: Split the Number of Petrol Station column into six columns – one
for each petrol type.
▪ Question: How many columns in total?
Petrol Station – Star Schema Version-1 (cont.)
Two-column
Table for Petrol
Prices based
on Suburb
Two-column Table for Petrol Prices based on Company
40
Petrol Station – Star Schema Version-1 (cont.)
41
Petrol Station – Star Schema Version-2
42
Petrol Station – Star Schema Version-2
Min. price for
which petrol
type??
43
select DayofWeek,
Min_Petrol_Price
from PetrolFact
where DayofWeek = ’Monday’
and PetrolType = ’Unleaded’;
Petrol Station – Star Schema Version-2
select DayofWeek, PetrolType,
Min_Petrol_Price
from PetrolFact
where DayofWeek = ’Monday’;
44
Petrol Station – Star Schema Version-2
Determinant Dimension uses dotted box in the star schema.
The new star schema with a Determinant Dimension for Petrol Case Study
Determinant vs. Non-
Determinant
Dimensions
46
The Olympic Games Case Study
▪ The Olympic Games committee maintains an operational database that
stores all matches, games, as well as the medal winners of the Olympic
Games over the years
▪ Let’s build a data warehouse to analyse the medal counts, by each country,
sport, and at which Olympic Games
Determinant vs. Non-Determinant Dimensions
47
The Olympic Games Case Study – The Star Schemas
Version-1
Version-2
The questions are:
1. What is the difference
between these two
versions of star schema?
2. Is Medal Type Dimension
a Determinant
Dimension?
48
The Olympic Games Case Study – The Star Schema Version-1
Two-column tables
(pivot tables) for each
dimension category in
star schema version-1
49
The Olympic Games Case Study – The Star Schema Version-2
50
▪ Is MedalTypeDIM a determinant dimension?
– To retrieve the data from version-2 star schema, must we have the
information from Medal Type Dimension?
Determinant vs. Non-Determinant Dimensions
Version-2
COUNT = total medals from
other dimensions, without the
medal type dimension)
51
Fact Table for Version-1
The Olympic Games Case Study – The Star Schema Version-1 vs Version-2
Fact Table for Version-2
52
The Olympic Games Case Study – The Star Schema Version-1 vs Version-2
Differenc
es
Version 1 Version 2 (with
MedalTypeDIM)
Storage Lower with only three records Nine records
Modelling The schema looks more
complex and crowded; but the
storage cost is lower
More concise and compact with
less number of measures in the
fact, easier to understand
Query
Processin
g
Less join processing between
dimension tables and the fact
table.
Requires additional join with
MedalTypeDIM
Determinant
Dimensions vs.
Pivoted Fact Table
54
The PTE Academic Test Case Study
▪ PTE Test consists of four main components: Listening, Reading, Speaking, and
Writing
▪ There are many test venues (all major cities in Australia, and overseas as well), each
country may have a different price setting
▪ A student taking a PTE test will have one score for each of these four components,
as well as one overall score
▪ The score is a numerical score, ranging from 10 to 90
▪ The results will normally be released five days after the actual test date
▪ A student may take a PTE Academic Test several times, in order to improve the
score
▪ A student is identified by his/her RegistrationID, which stays the same student stays
the same
Determinant Dimensions vs. Pivoted Fact Table
55
ERD for PTE Academic Test
56
The PTE Academic Test Case Study – Sample Records for 8 Students
57
The PTE Academic Test Case Study – Sample Records for 8 Students
58
The PTE Academic Test Case Study – Star Schema with Determinant Dim.
59
To implement the star schema in SQL:
create table CountryVenueDim
as
select distinct C.CountryCode, C.CountryName, T.TestPrice
from Test_Venue T, Country C
where T.CountryCode = C.CountryCode;
create table CitizenshipDim
as
select distinct C.CountryCode as Citizenship, C.CountryName
from Student S, Country C
where S.Citizenship = C.CountryCode;
create table YearDim
as
select distinct to_char(TestDate, ’YYYY’) as Year from Test;
The PTE Academic Test – Star Schema with Determinant Dim. (cont.)
60
create table GradeDim (
Grade varchar2(3),
Description varchar2(20),
MinScore number,
MaxScore number
);
insert into GradeDim values (’4.5’, ’Functional’, 30, 35);
insert into GradeDim values (’5’, ’Vocational’, 36, 49);
insert into GradeDim values (’6’, ’Competent’, 50, 64);
insert into GradeDim values (’7’, ’Proficient’, 65, 78);
insert into GradeDim values (’8-9’, ’Superior’, 79, 90);
The PTE Academic Test – Star Schema with Determinant Dim. (cont.)
Superior
61
create table TestComponentDim (
TestComponent varchar2(20)
);
insert into TestComponentDim values
(’Listening’);
insert into TestComponentDim values
(’Reading’);
insert into TestComponentDim values
(’Writing’);
insert into TestComponentDim values
(’Speaking’);
insert into TestComponentDim values
(’Overall’);
The PTE Academic Test – Star Schema with Determinant Dim. (cont.)
Create temporary fact table:
create table TempFact
as
select TV.CountryCode, S.Citizenship,
to_char(T.TestDate, ’YYYY’) as Year,
TR.ListeningScore, TR.ReadingScore,
TR.WritingScore, TR.SpeakingScore,
TR.OverallScore, TR.RegistrationID
from Test_Venue TV, Test T, Student S,
Test_Result TR
where TV.VenueID = T.VenueID
and T.TestNo = TR.TestNo
and TR.RegistrationID = S.RegistrationID;
62
alter table TempFact
add (
GradeOverall varchar2(3),
GradeListening varchar2(3),
GradeReading varchar2(3),
GradeWriting varchar2(3),
GradeSpeaking varchar2(3)
);
The PTE Academic Test – Star Schema with Determinant Dim. (cont.)
update TempFact
set GradeOverall = (
case
when OverallScore >= 30 and
OverallScore <= 35 then ’4.5’ when OverallScore >= 36 and
OverallScore <= 49 then ’5’ when OverallScore >= 50 and
OverallScore <= 64 then ’6’ when OverallScore >= 65 and
OverallScore <= 78 then ’7’ when OverallScore >= 79 and
OverallScore <= 90 then ’8-9’ end); Repeat for GradeListening, GradeReading, GradeWriting, and GradeSpeaking Superior 63 ▪ The TempFact table has the correct grades for each student. ▪ The problem now is to breakdown the grades for a student into multiple test components – Create another temporary fact for each of the test components, eg. a temporary fact called ’OverallFact’ for the Overall Score, and so on so forth create table OverallFact as select CountryCode, Citizenship, Year, GradeOverall As Grade, ’Overall’ as TestComponent, count(RegistrationID) as Total_Students_Overall from TempFact group by CountryCode, Citizenship, Year, GradeOverall, ’Overall’; The PTE Academic Test - Star Schema with Determinant Dim. (cont.) Repeat for GradeListening, GradeReading, GradeWriting, and GradeSpeaking 64 ▪ There were eleven students taking the test which are now grouped into nine records in Table OverallFact The PTE Academic Test - Star Schema with Determinant Dim. (cont.) 65 ▪ There are also three other fact tables: ReadingFact, WritingFact, and SpeakingFact, with identical table structures ▪ These five fact tables can now be ’combined’ to form one final fact table using a union operator: create table FinalFact as select CountryCode, Citizenship, Year, Grade, TestComponent, Total_Students_Overall as Total_Students from OverallFact union select * from ListeningFact union select * from ReadingFact union select * from WritingFact union select * from SpeakingFact; The PTE Academic Test - Star Schema with Determinant Dim. (cont.) 66 ▪ The final fact table contains 45 records ▪ Conclusion: – When the input record in the operational database contains the scores of all the test components as one record, it will be challenging to break down the records into multiple records in the final fact table, because there is a Test Component dimension. The PTE Academic Test - Star Schema with Determinant Dim. (cont.) 67 The PTE Academic Test - Star Schema without Determinant Dim. ● Fact measure for all the test components ● This means that the fact table is a Pivoted Fact Table ○ Shifts the Test Component into the fact measure; ○ The Determinant Dimension - TestComponentDim is then removed from the star schema 68 ▪ How many records do we have in fact table with five fact measures? ▪ However, since our test result data is incomplete, we will have problems with the final fact, because in the non-determinant dimension version, we need to keep track the zeroes – Even if the data is the test result is a lot, but still does not cover all possible combinations from the dimension! The PTE Academic Test - Star Schema without Determinant Dim. (cont.) Answer: 60 records in the final fact table = 2 countries of test venues * 6 countries of citizenship * 1 year * 5 grades 69 ▪ ‘Superior’ grade is only available for Listening and Reading scores ▪ Using the normal fact table, there will not be entry in the fact table for Superior-Speaking, and for Superior-Writing; ▪ There will not be zero value in TotalStudents, where the Grade column is ’Superior’ and the Test Component column is either ’Speaking’ or ’Writing’ → less than 60 rows of records in the fact table The PTE Academic Test - Star Schema without Determinant Dim. (cont.) Superior 70 Solution: Step1 Get all possible combinations from all dimensions using Cartesian product between all dimensions: create table AllDimensions as select CO.CountryCode, CI.Citizenship, Y.Year, G.Grade from CountryVenueDim CO,CitizenshipDim CI, YearDim Y, GradeDim G; The PTE Academic Test - Star Schema without Determinant Dim. (cont.) 71 Solution: Step 2 Re-create five temporary fact tables, one for each test component, using Outer Join operation between AllDimensions table and each of the temporary fact tables created in the previous section (see slide #35). The PTE Academic Test - Star Schema without Determinant Dim. (cont.) create table OverallFactNew as select A.CountryCode, A.Citizenship, A.Year, A.Grade, nvl(O.Total_Students_Overall, 0) as Total_Students_Overall from AllDimensions A, OverallFact O where A.CountryCode = O.CountryCode(+) and A.Citizenship = O.Citizenship(+) and A.Year = O.Year(+) and A.Grade = O.Grade(+); create table ListeningFactNew as select A.CountryCode, A.Citizenship, A.Year, A.Grade, nvl(O.Total_Students_Listening, 0) as Total_Students_Listening from AllDimensions A, ListeningFact O where A.CountryCode = O.CountryCode(+) and A.Citizenship = O.Citizenship(+) and A.Year = O.Year(+) and A.Grade = O.Grade(+); create table ReadingFactNew as select ...; create table WritingFactNew as select ...; create table SpeakingFactNew as select ...; 72 Solution: Step 3 Create the final fact table using the join operation The PTE Academic Test - Star Schema without Determinant Dim. (cont.) create table FinalFact2 as select O.CountryCode, O.Citizenship, O.Year, O.Grade, O.Total_Students_Overall, L.Total_Students_Listening, R.Total_Students_Reading, W.Total_Students_Writing, S.Total_Students_Speaking from OverallFactNew O, ListeningFactNew L, ReadingFactNew R, WritingFactNew W, SpeakingFactNew S where O.CountryCode = L.CountryCode and L.CountryCode = R.CountryCode and R.CountryCode = W.CountryCode and W.CountryCode = S.CountryCode and O.Citizenship = L.Citizenship and L.Citizenship = R.Citizenship and R.Citizenship = W.Citizenship and W.Citizenship = S.Citizenship and O.Year = L.Year and L.Year = R.Year and R.Year = W.Year and W.Year = S.Year and O.Grade = L.Grade and L.Grade = R.Grade and R.Grade = W.Grade and W.Grade = S.Grade; 73 Solution: Step 4 Because our sample data is small, there are many records with zeroes values in the Total Students columns. Delete records where Total Students in all of the five test components are equal to zero. The PTE Academic Test - Star Schema without Determinant Dim. (cont.) delete from FinalFact2 where Total_Students_Overall = 0 and Total_Students_Listening = 0 and Total_Students_Reading = 0 and Total_Students_Writing = 0 and Total_Students_Speaking = 0; Non-Type as a Determinant Dimension 75 ▪ Type Dimension maybe a candidate for a Determinant Dimension: – PetroTypeDim is a Determinant Dimension in the Petrol Price case study, – MedalTypeDim in the Olympic Games case study is a not a Determinant Dimension ▪ Type Dimension is a good candidate for a Determinant Dimension, but needs further examination to check if it is really a Determinant Dimension ▪ Other dimensions - non-type dimension can potentially be a Determinant Dimension Non-Type as a Determinant Dimension 76 ▪ The Class Type Dimension in this example is not a Determinant Dimension because we could get Number of Students without referring to any of the class types. ▪ E.g. retrieving number of students in the Science Faculty last year is certainly a useful information, even without Class Type information. Non-Type as a Determinant Dimension – University Enrolment Case Study Type Dimension 77 Total student of Science Faculty Non-Type as a Determinant Dimension – University Enrolment Case Study Total student of Science Faculty last year vs. The information is dependent on the YearDIM - determinant dimension 78 Converting a Determinant Dimension to a Pivoted Fact Table Non-Type as a Determinant Dimension – University Enrolment Case Study 79 Question 1: What if the number of records in the Determinant Dimension is huge? – Number of fact measures increased to 20, 50? – May not be practical Solution: Enforce Determinant Dimension through the User-Interface level ▪ Users are forced to enter/choose a criteria from the Determinant Dimension. ▪ In filling a web-based form, this enforcement is normally indicated by a star (*) next to the text entry to indicate that this text entry must be filled in, before the form can proceed. Pivoted Fact Table for University Enrolment Case Study 80 Question 2: What if there are a lot of attributes in the Determinant Dimension? – If we shift the Determinant Dimension to the fact measures, only the key identifier is kept in the fact measure. – All other attributes from the Determinant Dimension will be lost from the star schema. This issue will be discussed in the next section. Pivoted Fact Table for University Enrolment Case Study (cont.) Multiple Relationship between a Dimension and the Fact 82 ▪ Private Taxi Case Study – The dimension (or a Determinant Dimension) to be shifted to the fact measure (e.g. Pivoted Fact Table) has many attributes. – If only the key identifier of the dimension (or the Determinant Dimension) is used as the fact measure, all other attributes of the dimension (or the Determinant Dimension) are no longer exists in the star schema Multiple Relationship between a Dimension and the Fact 83 ▪ Assume that – The business rules stated that there is no Determinant Dimension. Even without a Determinant Dimension, we could potentially shift a dimension to become fact measures – The company is a small private taxi company, has only five cars ▪ In the fact table, instead of having three fact measures, for each fact measure, we could replicate five times – one for each car. Private Taxi Case Study Replicate for each car ===> 15 fact
measures
Remove this
CarDIM?
84
▪ CarDIM should not be removed
– Other details of the car are
included in the data retrieval
▪ Because Car Dimension is not a
Determinant Dimension, we can
simply keep the original star
schema
– Include only CarNo in the fact
table
– Use join operation to get more
details about car in query
Private Taxi Case Study (cont.)
select F.MonthYear, D.Make,
D.Model,
sum(F.Total_Kilometers)
from PrivateTaxiFact1 F,
CarDim D
where F.CarNo = D.CarNo
and F.CarNo = 5
and F.MonthYear = ’1908’
group by F.MonthYear, D.Make,
D.Model;
85
Insist that fact measures are broken into five cars
▪ Need to have five Car Dimension’s key
identifier in the fact
Private Taxi Case Study (cont.)
To retrieve Total Kilometers of five cars:
select F.MonthYear, D1.CarMake, D1.Year,
D2.CarMake, D2.Year, D3.CarMake, D3.Year,
D4.CarMake, D4.Year, D5.CarMake, D5.Year,
sum(Total_Kilometers)
from PrivateTaxiFact2 F, CarDim D1, CarDim
D2, CarDim D3, CarDim D4, CarDim D5
where F.CarNo1 = D1.CarNo
and F.CarNo2 = D2.CarNo
and F.CarNo3 = D3.CarNo
and F.CarNo4 = D4.CarNo
and F.CarNo5 = D5CarNo
and MonthYear = ’1908’
group by F.MonthYear, D1.CarMake, D1.Year,
D2.CarMake, D2.Year, D3.CarMake, D3.Year,
D4.CarMake, D4.Year, D5.CarMake, D5.Year;
86
▪ Although conceptually the dimension is already shifted to the fact measure,
we still need to keep the original dimension in the star schema.
▪ This is the only way to preserve all other attributes of that dimension.
▪ This relationship replication is applicable to shifting a Determinant
Dimension to the fact measure.
Multiple Relationship between a Dimension and the Fact
87
Two alternatives to enforce the implementation of a star schema that features a
Determinant Dimension
1. Enforce through the user-interface, whereby the user has to fill in or to
choose the Determinant Attribute during the search.
2. Shift the Determinant Dimension into a Pivoted Fact Table, whereby the key
identifier of the Determinant Dimension is incorporated into the fact
measure.
Determinant Dimension – Conclusion
Oracle/FIT5195-5-MoreComplexProcesses_update (1)
Week 5 – More Complex Processes
Semester 1, 2020
FIT5195 – Business Intelligence
and Data Warehousing
Developed by:
Agnes Haryanto
Agnes.Haryanto@monash.edu
Soon Lay-Ki
Soon.LayKi@monash.edu
MONASH
INFORMATION
TECHNOLOGY
Agenda
1. More Complex Processes
1. Use of count Function
2. Average in the Fact
3. Outer Join
2. One Attribute Dimensions
Using FLUX
1. Visit http://flux.qa/ on your internet enabled device
2. Log in using your Monash account (not required if
you are already logged in to Monash)
3. Click on the “+” to join audience
4. Enter the Audience Code:
• Caulfield – 3GANT7
• Fully Flex – 39WRG8
• Malaysia – VTVPLW
5. Select FIT5195 in the Active Presentation menu
6. Answer questions when they pop up
http://flux.qa/1AW6N8
More Complex Processes
in Creating Fact
Recall – Star Schema Components
▪ There are Three main components of the Star Schema:
1. Facts
2. Dimensions
3. Attributes
Recall – Fact
▪ A Fact Table consists of key attributes from each dimension,
and fact measures.
▪ A Fact Table is created by a join operation, that joins several
tables from the operational database.
▪ Fact tables are created either through TempFact or directly
retrieval from the tables in the operational database.
▪ The fact measure itself is an aggregated value.
➢ In the SQL command, the fact measure attribute in the Fact
Table is created using an aggregate function, such as count
or sum, and the group by operation.
Use of count Function
▪ The count function is one of the most common aggregate functions used to
create the fact measures.
▪ In SQL, there are three ways of using the count function:
1. count(*) – counts number of records in the query result
2. count(attribute) – counts number of records of that attribute excluding
null values
3. count(distinct attribute) – counts number of unique values in that
mentioned attribute
Case Study #1: Mobile Apps Repositories
Monalisa University is an international university. It has an online mobile app store: Monalisa App Store.
The Monalisa App Store allows students from any university in the world to publish their applications and
receive feedbacks. This app store is considered as a research environment where applications developed
during researches and studies can be tested and used by real users. The applications are free and open
source.
Basically, the operational system has the following features:
▪ Every user details are kept in the database, which includes the universities that the students enroll.
▪ Users can publish their apps or download other apps.
▪ Users can give feedbacks and ratings to other apps.
▪ The download and feedback statistics are stored in the database.
▪ Apps are organized into different categories; and many authors may have more than one app in
different categories.
Case Study #1: Mobile Apps Repositories
▪ A data warehouse is needed
analyze the ratings and feedbacks
by different authors and apps, so
that the staff at Monalisa University
can connect with the talented
authors and send them awards
annually.
▪ The author with the highest
average apps rating will be named
author of the year, and will receive
an award.
Case Study #1: Mobile Apps Repositories
Case Study #1: Mobile Apps Repositories
create table CategoryDim as
select * from Category;
Case Study #1: Mobile Apps Repositories
create table UniversityDim as
select distinct UniversityID,
UniversityName
from University;
Case Study #1: Mobile Apps Repositories
create table LocationDim as
select distinct
Country || City as LocationID,
City,
Country
from University;
Case Study #1: Mobile Apps Repositories
create table TimeDim as
select distinct
to_char(DownloadDate, ’YYYYMM’) as TimeID,
to_char(DownloadDate, ’MM’) as Month,
to_char(DownloadDate, ’YYYY’) as Year
from Download;
Case Study #1: Mobile Apps Repositories
Case Study #1: Mobile Apps Repositories
create table AppsDownloadFact as
select
to_char(D.DownloadDate, ’YYYYMM’) as DownloadMonth,
U.Country || U.City as LocationID,
A.CategoryID,
A.ApplicationID,
U.UniversityID,
count(*) as TotalDownloads
from University U, App_User R, Download D, Application A
where
U.UniversityID = R.UniversityID and
R.UserID = D.DownloaderID and
D.ApplicationID = A.ApplicationID
group by
to_char(D.DownloadDate, ’YYYYMM’),
U.Country || U.City,
A.CategoryID,
A.ApplicationID,
U.UniversityID;
Case Study #1: Mobile Apps Repositories
Case Study #1: Mobile Apps Repositories
create table TimeDim as
select distinct
to_char(CreationDate, ’YYYYMM’) as TimeID,
to_char(CreationDate, ’MM’) as Month,
to_char(CreationDate, ’YYYY’) as Year
from Application;
Case Study #1: Mobile Apps Repositories
create table AppsFact as
select
to_char(A.CreationDate, ’YYYYMM’) as CreationMonth,
U.Country || U.City as LocationID,
A.CategoryID,
A.ApplicationID,
U.UniversityID,
count(distinct ApplicationID) as TotalApps
from University U, App_User R, Download D, Application A
where
U.UniversityID = R.UniversityID and
R.UserID = D.DownloaderID and
D.ApplicationID = A.ApplicationID
group by
to_char(A.CreationDate, ’YYYYMM’),
U.Country || U.City,
A.CategoryID,
A.ApplicationID,
U.UniversityID;
Average in the Fact
Average in the Fact
Unit Code Unit Title Semester Student First Name Score
IT001 Database 1 Mirriam 81
IT001 Database 1 Allan 41
IT001 Database 1 Ben 74
IT001 Database 1 Kate 85
IT001 Database 1 Larry 87
IT001 Database 1 Leonard 75
IT001 Database 2 Juan 64
IT001 Database 2 Andy 32
IT002 Java 1 Ally 65
IT002 Java 1 Menson 47
IT002 Java 2 Mirriam 78
IT002 Java 2 Ben 73
IT002 Java 2 Larry 64
IT003 SAP 1 Ally 63
IT004 Network 2 Juan 53
IT004 Network 2 Menson 52
Average in the Fact
Unit Code Unit Title Semester Student First Name Score
IT001 Database 1 Mirriam 81
IT001 Database 1 Allan 41
IT001 Database 1 Ben 74
IT001 Database 1 Kate 85
IT001 Database 1 Larry 87
IT001 Database 1 Leonard 75
IT001 Database 2 Juan 64
IT001 Database 2 Andy 32
IT002 Java 1 Ally 65
IT002 Java 1 Menson 47
IT002 Java 2 Mirriam 78
IT002 Java 2 Ben 73
IT002 Java 2 Larry 64
IT003 SAP 1 Ally 63
IT004 Network 2 Juan 53
IT004 Network 2 Menson 52
The operational database
contains:
• 9 records of Semester one
• 7 records of Semester two
Average in the Fact
Unit Code Unit Title Semester Student First Name Score
IT001 Database 1 Mirriam 81
IT001 Database 1 Allan 41
IT001 Database 1 Ben 74
IT001 Database 1 Kate 85
IT001 Database 1 Larry 87
IT001 Database 1 Leonard 75
IT001 Database 2 Juan 64
IT001 Database 2 Andy 32
IT002 Java 1 Ally 65
IT002 Java 1 Menson 47
IT002 Java 2 Mirriam 78
IT002 Java 2 Ben 73
IT002 Java 2 Larry 64
IT003 SAP 1 Ally 63
IT004 Network 2 Juan 53
IT004 Network 2 Menson 52
The operational database
contains:
• 9 records of Semester one
• 7 records of Semester two
• 8 records of Database Unit
(6 Semester one and 2
Semester two)
Average in the Fact
Unit Code Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
Unit Code Unit Title
IT001 Database
IT002 Java
IT003 SAP
IT004 Network
Semester
1
2
(a) Fact
(b) Subject Dimension
(c) Semester Dimension
Average in the Fact
Unit Code Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
Unit Code Unit Title
IT001 Database
IT002 Java
IT003 SAP
IT004 Network
Semester
1
2
(a) Fact
(b) Subject Dimension
(c) Semester Dimension
Average Score for the Database Unit in:
• Semester one: (81+41+74+85+87+75)/6 = 73.833
Average in the Fact
Unit Code Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
Unit Code Unit Title
IT001 Database
IT002 Java
IT003 SAP
IT004 Network
Semester
1
2
(a) Fact
(b) Subject Dimension
(c) Semester Dimension
Average Score for the Database Unit in:
• Semester one: (81+41+74+85+87+75)/6 = 73.833
• Semester two: (64+32)/2 = 48
Average in the Fact
Unit Code Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
Unit Code Unit Title
IT001 Database
IT002 Java
IT003 SAP
IT004 Network
Semester
1
2
(a) Fact
(b) Subject Dimension
(c) Semester Dimension
Average Score for the Database Unit in:
• Semester one: (81+41+74+85+87+75)/6 = 73.833
• Semester two: (64+32)/2 = 48
These are actually
incorrect!
Average in the Fact
Unit Code Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
Unit Code Unit Title
IT001 Database
IT002 Java
IT003 SAP
IT004 Network
Semester
1
2
(a) Fact
(b) Subject Dimension
(c) Semester Dimension
The SQL command:
select avg(Average_Score)
from EnrolmentFact
where UnitCode = ’IT001’;
Query: Calculate Average Score for the Database Unit:
• (73.833 + 48) / 2 = 60.9165
Average in the Fact
Unit Code Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
(a) Fact
(b) Operational Database
Query: Calculate Average Score for the Database Unit.
Calculation using Fact:
• (73.833 + 48) / 2 = 60.9165
Calculation based on the Operational Database:
• (81+41+74+85+87+75+64+32) / 8 =
539 / 8 = 67.375
Unit Code Unit Title Semester Student First Name Score
IT001 Database 1 Mirriam 81
IT001 Database 1 Allan 41
IT001 Database 1 Ben 74
IT001 Database 1 Kate 85
IT001 Database 1 Larry 87
IT001 Database 1 Leonard 75
IT001 Database 2 Juan 64
IT001 Database 2 Andy 32
IT002 Java 1 Ally 65
IT002 Java 1 Menson 47
IT002 Java 2 Mirriam 78
IT002 Java 2 Ben 73
IT002 Java 2 Larry 64
IT003 SAP 1 Ally 63
IT004 Network 2 Juan 53
IT004 Network 2 Menson 52
Average in the Fact
Unit Code Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
(a) Fact
(b) Operational Database
Query: Calculate Average Score for the Java Unit in both Semesters.
Calculation using Fact:
• (56+71.667) / 2 = 63.833
Calculation based on the Operational Database:
• (65+47+78+73+64) / 5 = 65.4
Unit Code Unit Title Semester Student First Name Score
IT001 Database 1 Mirriam 81
IT001 Database 1 Allan 41
IT001 Database 1 Ben 74
IT001 Database 1 Kate 85
IT001 Database 1 Larry 87
IT001 Database 1 Leonard 75
IT001 Database 2 Juan 64
IT001 Database 2 Andy 32
IT002 Java 1 Ally 65
IT002 Java 1 Menson 47
IT002 Java 2 Mirriam 78
IT002 Java 2 Ben 73
IT002 Java 2 Larry 64
IT003 SAP 1 Ally 63
IT004 Network 2 Juan 53
IT004 Network 2 Menson 52
Average in the Fact
Unit Code Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
(a) Fact
(b) Operational Database
Query: Calculate Average Score for Semester One.
Calculation using Fact:
• (73.833+56+63) / 3 = 64.287
Calculation based on the Operational Database:
• (81+41+74+85+87+75+65+47+63) / 9 = 68.667
Unit Code Unit Title Semester Student First Name Score
IT001 Database 1 Mirriam 81
IT001 Database 1 Allan 41
IT001 Database 1 Ben 74
IT001 Database 1 Kate 85
IT001 Database 1 Larry 87
IT001 Database 1 Leonard 75
IT001 Database 2 Juan 64
IT001 Database 2 Andy 32
IT002 Java 1 Ally 65
IT002 Java 1 Menson 47
IT002 Java 2 Mirriam 78
IT002 Java 2 Ben 73
IT002 Java 2 Larry 64
IT003 SAP 1 Ally 63
IT004 Network 2 Juan 53
IT004 Network 2 Menson 52
Average in the Fact
Unit Code Semester Average_Score
IT001 1 73.833
IT001 2 48
IT002 1 56
IT002 2 71.667
IT003 1 63
IT004 2 52.5
(a) Fact
(b) Operational Database
Query: Calculate Average Score for Semester Two.
Calculation using Fact:
• (48+71.667+52.5) / 3 = 57.389
Calculation based on the Operational Database:
• (64+32+78+73+64+53+53) / 7 = 59.4286
Unit Code Unit Title Semester Student First Name Score
IT001 Database 1 Mirriam 81
IT001 Database 1 Allan 41
IT001 Database 1 Ben 74
IT001 Database 1 Kate 85
IT001 Database 1 Larry 87
IT001 Database 1 Leonard 75
IT001 Database 2 Juan 64
IT001 Database 2 Andy 32
IT002 Java 1 Ally 65
IT002 Java 1 Menson 47
IT002 Java 2 Mirriam 78
IT002 Java 2 Ben 73
IT002 Java 2 Larry 64
IT003 SAP 1 Ally 63
IT004 Network 2 Juan 53
IT004 Network 2 Menson 52
Average in the Fact
Average in the Fact
Unit Code Semester Total_Score Number_of_Students
IT001 1 443 6
IT001 2 96 2
IT002 1 112 2
IT002 2 215 3
IT003 1 63 1
IT004 2 105 2
(a) Fact Version 2
Unit Code Unit Title
IT001 Database
IT002 Java
IT003 SAP
IT004 Network
Semester
1
2
(b) Subject Dimension
(c) Semester Dimension
Average in the Fact
Unit Code Semester Total_Score Number_of_Students
IT001 1 443 6
IT001 2 96 2
IT002 1 112 2
IT002 2 215 3
IT003 1 63 1
IT004 2 105 2
(a) Fact Version 2
The SQL command:
select sum(Total_Score)/
sum(Number_of_Students)
as Average_Score
from EnrolmentFact2
where UnitCode = ’IT001’;
Query: Calculate Average Score for
the Database Unit:
• (443+96)/(6+2) = 67.375
Average in the Fact
▪ The problem of Average in the Fact is known as the Average of an Average
problem.
➢ This problem is well known in Mathematics and Statistics.
➢ Average of an average will simply produce an incorrect average result.
▪ Hence, it is not desirable to have an average measure in the fact.
➢ Exceptional case: when the analysis ALWAYS uses all the dimensions (e.g.
Determinant Dimensions).
Min & Max in the Fact
▪ If Average should not be used in the fact, how about Min or Max?
➢ Yes, we can.
➢ Because Max of Max is always a global Max, and Min of Min is always a global
Min.
Ucode Semester Min_Score Max_Score
IT001 1 41 87
IT001 2 32 64
IT002 1 47 65
IT002 2 64 78
IT003 1 63 63
IT004 2 52 53
Unit Code Unit Title
IT001 Database
IT002 Java
IT003 SAP
IT004 Network
Semester
1
2
(a) Fact
(b) Subject Dimension
(c) Semester Dimension
Min & Max in the Fact
▪ Query: Find the Maximum Score of Database Unit.
• Max of {87, 64} is 87.
▪ The SQL command:
select max(Max_Score) from EnrolmentFact
where UnitCode = ’IT001’;
Ucode Semester Min_Score Max_Score
IT001 1 41 87
IT001 2 32 64
IT002 1 47 65
IT002 2 64 78
IT003 1 63 63
IT004 2 52 53
Unit Code Unit Title
IT001 Database
IT002 Java
IT003 SAP
IT004 Network
Semester
1
2
(a) Fact
(b) Subject Dimension
(c) Semester Dimension
Min & Max in the Fact
▪ Query: Find the Minimum Score of Database Unit.
• Min of {41, 32} is 32.
▪ The SQL command:
select min(Min_Score) from EnrolmentFact
where UnitCode = ’IT001’;
Ucode Semester Min_Score Max_Score
IT001 1 41 87
IT001 2 32 64
IT002 1 47 65
IT002 2 64 78
IT003 1 63 63
IT004 2 52 53
Unit Code Unit Title
IT001 Database
IT002 Java
IT003 SAP
IT004 Network
Semester
1
2
(a) Fact
(b) Subject Dimension
(c) Semester Dimension
Average in the Fact – Conclusion
▪ Average in the Fact is not desirable, although technically it satisfies the two
criteria of the fact (e.g. must be a numerical and aggregate value).
▪ Min and Max in the Fact can still be used, since Min Score and Max Score are
valid fact measures (e.g. they are numerical and aggregated values).
▪ In general, count and sum are more common.
Outer Join
Case Study #2: Employment Agency
The Employment Agency has a file of candidates who are willing to work.
▪ They record the candidates number, name, contact address, contact phone number and
maximum hours the candidate is available per week.
▪ Each candidate may have several qualifications.
▪ The Agency uses special codes to record the candidate’s qualification for a job opening.
▪ As well as recording the code, the Agency also records the experience of the candidate in
each qualification area, expressed as the number of months experience.
Case Study #2: Employment Agency
The Agency also has a list of companies that request temporaries.
▪ Each company is assigned a company number as an identifier.
▪ The company name, type of business and principal contact for employment placements is
also recorded.
▪ Each time a company requests a temporary employee, the Agency makes an entry in the
(job) openings file. This file contains an opening number, the company requesting an
employee, required qualification, starting date, anticipated ending date, and hourly pay.
▪ Each opening requires only one specific or main qualification.
▪ The Agency may be able to fill the opening from the staff on its books, however in some
circumstances it cannot fill the request.
Case Study #2: Employment Agency
▪ When a candidate matches the qualification, and is available, he/she is given the job, and
an entry is made in the placement record folder.
➢ This folder contains an opening number, candidate number, actual start date, total
hours worked to date, and end date if the placement is completed.
➢ The placement record folder is used by the Agency as a source of placement
histories for it’s various temporaries.
Case Study #2: Employment Agency
▪ Note:
➢ Not all Openings have
Placements, but each Placement
comes from the Opening.
➢ A Candidate may have several
Placements.
➢ A Company may open several
Openings.
➢ Each Opening has one main
qualification requirement.
Case Study #2: Employment Agency
▪ A data warehouse to analyze
the total openings and total
placements by job durations,
qualification requirements, and
placement starting month.
▪ Several types of job durations:
➢ Short-term (<10 days) ➢ Medium-term (10 to 30 days) ➢ Long-term (>30 days)
Case Study #2: Employment Agency
▪ Star Schema
➢ Fact:
• Total Opening
• Total Placement
➢ Dimensions:
• Duration
• Qualification
• Month
Case Study #2: Employment Agency
▪ Star Schema
➢ Fact:
• Total Opening
• Total Placement
➢ Dimensions:
• Duration
• Qualification
• Month
Case Study #2: Employment Agency
▪ Star Schema
➢ Fact:
• Total Opening
• Total Placement
➢ Dimensions:
• Duration
• Qualification
• Month
▪ Several types of job durations:
➢ Short-term (<10 days) ➢ Medium-term (10 to 30 days) ➢ Long-term (>30 days)
Case Study #2: Employment Agency
▪ Star Schema
➢ Fact:
• Total Opening
• Total Placement
➢ Dimensions:
• Duration
• Qualification
• Month
Case Study #2: Employment Agency
▪ Star Schema
➢ Fact:
• Total Opening
• Total Placement
➢ Dimensions:
• Duration
• Qualification
• Month
Case Study #2: Employment Agency
▪ Star Schema
➢ Fact:
• Total Opening
• Total Placement
➢ Dimensions:
• Duration
• Qualification
• Month
1. How many short-term job openings?
2. How many jobs requiring certain qualifications?
3. How many people started working in a certain
month?
Case Study #2: Employment Agency
create table QualificationDim as
select * from Qualification;
Case Study #2: Employment Agency
create table MonthDim as
select distinct
to_char(ActualStartDate, ’Month’)
as MonthName
from Placement;
Case Study #2: Employment Agency
create table DurationDim
(DurationID number,
DurationDesc varchar2(20));
insert into DurationDim values (1, ’Short-Term’);
insert into DurationDim values (2, ’Medium-Term’);
insert into DurationDim values (3, ’Long-Term’);
Case Study #2: Employment Agency
create table TempFact as
select O.QCode,
O.StartDate,
O.EndDate,
to_char(P.ActualStartDate, ’Month’)
as MonthName,
O.OpenNo,
P.CandNo
from Opening O, Placement P
where O.OpenNo = P.OpenNo (+);
Case Study #2: Employment Agency
create table TempFact as
select O.QCode,
O.StartDate,
O.EndDate,
to_char(P.ActualStartDate, ’Month’)
as MonthName,
O.OpenNo,
P.CandNo
from Opening O, Placement P
where O.OpenNo = P.OpenNo (+);
Left Outer Join
Case Study #2: Employment Agency
alter table TempFact
add (DurationID number);
update TempFact
set DurationID = 1
where EndDate – StartDate < 10; update TempFact set DurationID = 2 where EndDate - StartDate >= 10
and EndDate – StartDate <=30; update TempFact Set DurationID = 3 where EndDate - StartDate > 30;
Case Study #2: Employment Agency
create table AgencyFact as
select
QCode, DurationID, MonthName,
count(OpenNo) as TotalOpening,
count(CandNo) as TotalPlacement
from TempFact
group by QCode, DurationID, MonthName;
One-Attribute
Dimensions
61
One-attribute dimensions
One-attribute dimensions
● Dimensions with only one attribute, i.e. key attribute
62
Column-based Solution in the Fact
● Medal Type Dimension table seems unnecessary, because Medal Type also
exists in the fact table
● Hence, there is no need to reference to the Medal Type Dimension
63
Column-based Solution in the Fact (cont.)
Column-based solution
● Convert the dimension attribute into
new columns (fact measures) in the
fact table
● Only applicable for dimensions with
low cardinality, because it is not
practical to create a new column in
the fact table for each record of the
dimension
64
Row-based Solution in the Fact
● What happens if the one-attribute dimension has a lot of records?
● The solution:
○ Remove the dimension and keep the attribute of the dimension as a dimension-
less key in the fact table
Row-based solution whereby
each medal type is
represented by records in the
fact table
65
Row-based Solution in the Fact – Sales Case Study
● Assume that there are only four quarters (e.g. Q1, Q2, …, Q4), regardless
the year
● The Quarter could be shifted to the fact measures
● Result: Pivoted fact table
Q1 to Q4
A Sales Star Schema with Pivoted Fact
66
▪ What if the Quarter also embeds the Year, e.g. Q12020, Q22018, and etc.?
– Pivot Fact table cannot be the solution, because the number of possible records
in the Time Dimension can be large
▪ The solution:
– Include the Quarter attribute as a Dimension-less Key in the fact, and to remove
the Time Dimension from the star schema
– Result: Row-based solutions – not restricted by the number of records in the
one-attribute dimension
Row-based Solution in the Fact (cont.)
67
▪ Similarity:
– The contents of the fact table are identical.
▪ Difference:
– The dimension-less key attribute has no references to the Time Dimension,
because the Time Dimension has been removed, whereas the Quarter in the star
schema has a reference to the Time Dimension.
Star Schema with Dimension-less Key vs. One-Attribute Dimension
vs
Star Schema with Dimension-less key
Star Schema with TimeDIM
68
▪ Looking at the high level design, it is
often more desirable to keep the
one-attribute dimension, i.e. the Time
Dimension in this case
– Provides a clearer design,
considering that the fact table
remain the same, with or without
the one-attribute dimensions
Star Schema with Dimension-less Key vs. One-Attribute Dimension (cont.)
69
One-attribute dimensions
70
▪ If there is more than one dimension with single attribute, and these
dimensions are not related to each other, it is often desirable to combine
them into a junk dimension
▪ Case study:
– Analyse the number of sales based on
these dimensions
Combine all one-attribute dimensions
Sales Fact Table
71
Combine all one-attribute dimensions (cont.)
Cartesian product of 4
dimension tables
72
73
▪ Assuming that each book can only have one category, then Category is
related to Book
▪ In this case, we can simply move the category into the book dimension
Combine with other normal dimensions – Book Sales Case Study
74
Combine with other normal dimensions – Book Sales Case Study (cont.)
75
Determinant Dimension with one-attribute only – Weather Data Case Study
76
Determinant Dimension with one-attribute only – Weather Data Case Study
▪ WeatherMonth attribute is a Determinant Dimension in this case.
▪ Hence, WeatherMonth Dimension must be kept in the star schema
– To enforce that WeatherMonth Dimension (or WeatherMonth attribute)
must be used in all queries to this star schema
WeatherMonthDI
M
77
Questions:
1. What is the granularity of this
star schema?
2. How can we have a higher
granularity in this star schema
(no category to go for)?
Answer to Question #2:
▪ Combine books into groups
based on the authors
One-attribute dimension with bridge – Book Sales Case Study
78
▪ The ISBN Group List can be
created by using the
listagg function
▪ The Book Group List
Dimension is a one-attribute
dimension
▪ It is not possible to remove
this dimension as it links the
fact and the bridge table
One-attribute dimension with bridge – Book Sales Case Study (cont.)
79
1. To move the one attribution dimensions to the fact table
– The one-attribute dimension must have low cardinality
– Else, keep the dimension attribute in the fact as a dimension-less key
– From high level point of view, sometimes it is desirable to keep the
attribute dimension in the design
2. To combine the one-attribute dimension with other dimensions
– Form a junk dimension that combines several one-attribute dimensions
– Move the one-attribute dimension into another normal dimension
– One-attribute dimensions cannot be removed if:
▪ It is a determinant dimension
▪ It is connected to a bridge table
Summary for One-Attribute Dimensions
End of Week 5 Lecture
Oracle/Requiremnts-Oracle
FIT5195 S1 2020 – Take Home Test (10%)
Due date: Week 6, Friday 1-May-2020, 11:55pm
This Take Home Test consists of two Case Studies. You are required to answer all questions.
Case Study 1: Government Hospital Data Warehouse (50 marks)
Currently, the Victorian government has a simple database system implemented to store the
collective data of all government-run hospitals. The database system includes information about their
services, staff, and patients. The simple database system consists of the following tables and
attributes:
Figure 1 : A simplified structure of the Government Hospital’s current E/R Diagram
Due to the nationwide impact of a new pandemic, the Premier of Victoria (i.e. the head of government
in the Australian state of Victoria) requires a data warehouse for analyzing trends and forecasting
purposes. The goal of the analyzing is to identify which frequently used services, popular hospital
locations, etc.
You are required to design a small Data Warehouse to keep track of the statistics. The Premier is
particularly interested in analyzing the total number (population quantity) of patients and total
service charged by services used, time periods, age groups, and hospital locations.
The Premier also wants to be able to drill down based on the cost of services range (low price < $20, medium price between $20 and $50, and high price >$50).
Tasks:
[1]. Create the operational database based on the above E/R Diagram, and then populate with
sufficient numbers of records in each table. It is suggested that each table would have 5-20
records depending on the use of that table in the operational database. You need to create a script
file containing the CREATE TABLE and INSERT INTO commands.
You can follow the list of services from this link:
https://www.monash.edu/health/medical/services.
[2]. Then develop a GOVERNMENT HOSPITAL star schema. Identify the fact table, dimensions
and attributes required to support the schema. If the star schema consists of a Bridge Table, you
have to also include the Weight Factor and List Aggregate.
The result of this task is a star schema diagram. You can use any drawing tool, such as
Lucidchart, to draw the star schema.
[3]. Validate your star schema using the Two-Column Table Methodology. You are required to
illustrate some two-column tables for this task based on your star schema design.
[4]. Write the SQL commands to create the fact and dimension tables. You need to create a script
file containing the appropriate SQL commands to create the fact and dimension tables.
[5]. Write the SQL commands to answer the following queries: (you need to make sure that there are
records in your fact and dimensions tables. For each of the following queries, write the SQL and
show the results):
a) Show the total number of patients making appointments during Winter. (attribute1 data_type attribute2 data_type, … PRIMARY KEY (attribute1), FOREIGN KEY (attribute1) REFERENCES 88 Create Tables – Data Types • Data type denotes a kind of data of an attribute value • Character data types: VARCHAR2 and CHAR Make VARCHAR2(20) Model VARCHAR2(30) • Number data types: NUMBER Year NUMBER(4) StampDuty NUMBER(8,2) • Date data type: DATE Salesdate DATE B 99 Create Tables – Data Types • Date Data Type • DATE stores dates from 1/1/4712 BC to 12/31/4712 AD • Default date format: DD-MON-YYYY • example: 05-JUN-2011 • Example declaration: salesDate DATE • DATE data type also stores time values B 1010 Create Tables – Data Types • Default time format: HH:MI:SS A.M. • If no time value is given when a date is inserted, default value is 12:00:00 A.M. • If no date value is given when a time is inserted, default date is first day of current month • Example salesDate field: 07-JUN-2016 12:00:00 A.M. B 1111 Create Tables – Constraints • Integrity Constraints • Primary Key attribute • NOT NULL constraints • Specifies that a field cannot be NULL • Sample Declaration: Field_name data_type NOT NULL • Foreign Key attribute in a table refers to another record in another table B 1212 Create Tables – Example 1 • Example: create CAR, CARSALES and CUSTOMER tables • Commands: • CAR table CREATE TABLE Car (carID NUMBER(5) NOT NULL, make VARCHAR2(20) NOT NULL, model VARCHAR2(30), year NUMBER(4), colour VARCHAR2(25), advertisedPrice NUMBER(10), PRIMARY KEY (carID) ); B 1313 Create Tables – Example 2 • CUSTOMER table CREATE TABLE Customer (customerID NUMBER(5) NOT NULL, cNAME VARCHAR2(20) NOT NULL, cPhone VARCHAR2(10), address VARCHAR2(50), PRIMARY KEY (customerID) ); B 1414 Create Tables – Example 3 • CARSALES table CREATE TABLE Carsales (carID NUMBER(5) NOT NULL, customerID NUMBER(5) NOT NULL, salesDate DATE NOT NULL, purchasedPrice NUMBER(10) NOT NULL, stampDuty NUMBER(8,2) NOT NULL, PRIMARY KEY (carID,customerID), FOREIGN KEY (carID) REFERENCES Car(carID), FOREIGN KEY (customerID) REFERENCES Customer(customerID) ); B 1515 Create Tables – by Copying B • You can create a new table by copying from an existing table: CREATE TABLE AS SELECT * FROM … ; • For example: CREATE TABLE car AS SELECT * FROM dtaniar.car; • Notes: • It creates and copy the records from the existing table • However, it does not copy the PK and FK • In the above example, it copies the table from dtaniar account. 1616 Create Tables – View Tables • Viewing Information about Tables • To view all tables in the database, the general syntax is: SELECT * FROM TAB; • For example: SELECT * FROM tab; TNAME TABTYPE CLUSTERID —————————— ——- ———- CAR TABLE CARSALES TABLE CUSTOMER TABLE B 1717 Create Tables – Describe Tables • To view the table structure, the general syntax is: DESCRIBE • For example: DESC car; Name Null Type ————— ——– ———— CARID NOT NULL NUMBER(5) MAKE NOT NULL VARCHAR2(20) MODEL VARCHAR2(30) YEAR NUMBER(4) COLOUR VARCHAR2(25) ADVERTISEDPRICE NUMBER(10) B 1818 Create Tables – Drop Tables • To drop an unwanted table, the general syntax is: DROP TABLE • For example: DROP TABLE Car; • If the table that you want to delete (e.g. table Car) is being used as a FK by another table (e.g. table Carsales), then you cannot delete table Car. • In this case, you need to delete table Carsales first, before deleting table Car. B 1919 C. Insert Into • General Syntax • To insert values for every single attribute in a record: INSERT INTO VALUES (attribute1_value, attribute2_value,….); • Example: INSERT INTO Car VALUES (1, ‘Holden’, ‘Cruze’, 2015, ‘Black’, 25780); • Strings are enclosed in single quotes (‘) and are case-sensitive (e.g. ‘General Practice’ is different from ‘general practice’) 20 Insert Into • To insert a value of selected attributes: INSERT INTO VALUES (attribute1_value, attribute2_value,….) • Example: INSERT INTO Car (carID, make) VALUES (’16’, ’Audi’); C 2121 Insert Into • The TO_DATE function: • TO_DATE (‘date_value’, ’format mask’); • Example: INSERT INTO Carsales VALUES (1,4, TO_DATE(’04/Feb/2015′, ‘DD/MON/YYYY’),25780,824.96); C 2222 Insert Into • The common DATE format masks Format Mask Formatted Data DD-MON-YYYY 05-FEB-2007 MM/DD/YYYY 02/05/2007 HH:MI AM 02:30 PM MONTH DAY, YYYY FEB 5, 2007 MM/DD/YYYY HH:MI AM 02/05/2007 02:30 PM Sample DATE format masks C 23 Insert Into • Insert multiple records one-by-one: INSERT INTO Car VALUES (2, ‘BMW’, ‘520d’, 2016, ‘Grey’, 98800); INSERT INTO Car VALUES (3, ‘Audi’, ‘A5’, 2016, ‘Black’, 68200); INSERT INTO Car VALUES (4, ‘Holden’, ‘Commodore’, 2008, ‘Grey’, 12650); • Insert multiple records at once: INSERT ALL INTO Car VALUES (2, ‘BMW’, ‘520d’, 2016, ‘Grey’, 98800) INTO Car VALUES (3, ‘Audi’, ‘A5’, 2016, ‘Black’, 68200) INTO Car VALUES (’ 4, ‘Holden’, ‘Commodore’, 2008, ‘Grey’, 12650) SELECT * FROM DUAL; C 2424 D. Simple Query Retrieval • Simple Retrieval • Retrieve all Records • General Syntax : SELECT * FROM • Example: retrieve everything from the CAR table: SELECT * FROM car; • Output: 2525 Simple Query Retrieval • Retrieve Specific Fields • General Syntax: SELECT FROM • Example: select only the model from the CAR table SELECT model FROM car; D • Output: 2626 Simple Query Retrieval • Eliminating Duplicated Records (DISTINCT qualifier) • General Syntax: SELECT DISTINCT FROM • Example: eliminating duplicates for the MAKE values: SELECT DISTINCT make FROM car; D 2727 Simple Query Retrieval • Output (with DISTINCT): D • Original Output (without DISTINCT): 2828 Simple Query Retrieval • DISTINCT Multiple Attributes • Example: eliminating duplicates for the MAKE and YEAR values: SELECT DISTINCT make, year FROM car; D MAKE YEAR ——————– ———- Volkwagen 2015 Toyota 2013 Mini 2009 Holden 2015 BMW 2016 Honda 2012 Subaru 2014 Holden 2008 Jeep 2006 Mazda 2011 Audi 2016 Lexus 2009 Honda 2010 Nissan 2010 Mercedes Benz 2006 15 rows selected 2929 Simple Query Retrieval • Conditional Retrieval • Search Conditions specified for more complex data retrieval • The WHERE Clause • Operators: • equal (=) • greater than (>) • less than (<)
• greater than or equal to (>=) • less than or equal to (<=)
• Not equal (<>) D 3030 Simple Query Retrieval • General Syntax: SELECT FROM WHERE • Example: SELECT year, model, advertisedPrice FROM car WHERE make=’Holden’ YEAR MODEL ADVERTISEDPRICE ———- ———— ————— 2015 Cruze 25780 2008 Commodore 12650 D 3131 Simple Query Retrieval • “BETWEEN” • Example: list the carid, salesDate and purchasedPrice after 2014 and before 2016 SELECT carid, salesdate, purchasedPrice FROM carsales WHERE salesdate BETWEEN TO_DATE(’01-JAN-2014′, ‘DD-MON-YYYY’) AND TO_DATE(’31-DEC-2015′,’DD-MON-YYYY’); CARID SALESDATE PURCHASEDPRICE ———- ———– ————– 1 04-FEB-2015 25780 8 12-DEC-2015 13999 D 3232 Simple Query Retrieval • Using a String Comparison • Example: list the carid, salesDate and purchasedPrice after 2014 and before 2016 SELECT carid, salesdate, purchasedPrice FROM carsales WHERE TO_CHAR(salesdate, ‘YYYYMMDD’) >’20140101’ AND TO_CHAR(salesdate, ‘YYYY’MMDD’) < ‘20151231’;
CARID SALESDATE PURCHASEDPRICE
---------- ----------- --------------
1 04-FEB-2015 25780
8 12-DEC-2015 13999
D
3333
Simple Query Retrieval
• “AND” or “OR”
• AND: both conditions must be true
• Example:
SELECT make, model, year
FROM car
WHERE year>2014 AND make=’Holden’; MAKE MODEL YEAR ——— —————- ———- Holden Cruze 2015 D 3434 Simple Query Retrieval • OR: either one of the condition is true • Example: SELECT make, model, colour, year, advertisedPrice FROM car WHERE colour=’Red’ OR year>2014; MAKE MODEL COLOUR YEAR ADVERTISEDPRICE ——— ———- ———- ———- —————- Holden Cruze Black 2015 25780 BMW 520d Grey 2016 98800 Audi A5 Black 2016 68200 Honda Civic Red 2012 13488 Lexus RX350 Red 2009 33500 Volkwagen Golf White 2015 39888 6 rows selected D 3535 Simple Query Retrieval • Other Conditions • LIKE/NOT LIKE • Example: displaying all MAKE from car table that has their first character as ‘M’ SELECT make FROM CAR WHERE make LIKE ‘M%’; MAKE —————- Mazda Mercedes Benz Mini D 3636 Simple Query Retrieval • IN/NOT • suitable to perform a set member search • Example: • IN – displaying all MODEL that its MAKE is either ‘Holden’ or ‘Honda’ SELECT make, model FROM car WHERE make IN (‘Holden’, ‘Honda’); MAKE MODEL ——————– ————- Holden Cruze Holden Commodore Honda Civic Honda City D 3737 Simple Query Retrieval • IN/NOT • Example: • NOT IN – displaying make, model and colour of cars that are not ‘Black’ and ‘White’ SELECT make, model, colour FROM car WHERE colour NOT IN (‘Black’, ‘White’); MAKE MODEL COLOUR ————– ———— ———— BMW 520d Grey Holden Commodore Grey Honda Civic Red Lexus RX350 Red Mazda 3Series Blue Mercedes Benz C200 Blue Honda City Grey D 3838 Simple Query Retrieval • NULL/NOT NULL • Example: • NULL Operator: SELECT carid, make FROM car WHERE colour IS NULL; CARID MAKE —— ——— 12 Subaru D 3939 Simple Query Retrieval SELECT carid, make FROM car WHERE colour IS NOT NULL; CARID MAKE ——- ——————– 1 Holden 2 BMW 3 Audi 4 Holden 5 Honda 6 Jeep 7 Lexus 8 Mazda 9 Nissan 10 Volkwagen 11 Mercedes Benz 13 Honda 14 Mini 15 Toyota 14 rows selected D 4040 Simple Query Retrieval • MULTIPLE OPERATORs • Example: list cars make start with ‘M’ and was made before 2016 SELECT year, make, model FROM car WHERE make LIKE ‘M%’ AND year<2016;
YEAR MAKE MODEL
---- ---------------- -------------
2011 Mazda 3Series
2006 Mercedes Benz C200
2009 Mini Cooper
D
4141
Simple Query Retrieval
• Sorting
• specify to sort the output by using ORDER BY
• General Syntax:
SELECT FROM ORDER BY D 4242 Simple Query Retrieval order of the advertisedPrice FROM car ORDER BY advertisedPrice DESC; MAKE MODEL ADVERTISEDPRICE COLOUR ————— —————— ————— ——– BMW 520d 98800 Grey Audi A5 68200 Black Volkwagen Golf 39888 White Subaru Outback 33950 NULL Lexus RX350 33500 Red Holden Cruze 25780 Black Toyota Corolla 23888 White Jeep Grand Cherokee 21999 Black Mini Cooper 19750 Black Nissan Dualis 18888 White Mercedes Benz C200 16995 Blue Mazda 3Series 13999 Blue Honda Civic 13488 Grey Holden Commodore 12650 Grey Honda City 7490 Grey 15 rows selected D 43 Simple Query Retrieval • “AS”: to rename a column • Example: list all SalesDate, PurchasedPrice, StampDuty and TotalPrice (TotalPrice = PurchasedPrice+StampDuty ) SELECT salesdate, purchasedPrice, stampduty, (purchasedPrice+StampDuty) as TotalPrice FROM carSales; SALESDATE PURCHASEDPRICE STAMPDUTY TOTALPRICE ———– ————– ———- ———- 04-FEB-2015 25780 824.96 26604.96 13-JUL-2016 12650 506 13156 12-DEC-2015 13999 559.96 14558.96 14-JUN-2016 39888 1276.42 41164.42 18-MAY-2016 98800 5631.6 104431.6 4444 E. Updating and Deleting Records • In the created tables • UPDATE command – updating • DELETE command – deletion 4545 Updating Records • Update • General Syntax: UPDATE SET WHERE • records can be updated in only one table at a time • update multiple fields that are within the same table • WHERE clause – make the command updates specific records only E 46 Updating Records • Example: • Updating colour of carID ‘5’ from ‘Red’ to ‘Grey’ UPDATE car SET colour=’Grey’ WHERE carID =5; 46 E 47 Deleting Records • Delete • General Syntax: DELETE FROM WHERE • remove specific records from a database table • use WHERE clause to specify multiple records to delete multiple records at one time • If the search condition is omitted, all records in the table are deleted. 47 E 48 Deleting Records • Example: • deleting a single record from the CAR table DELETE FROM car WHERE carid= 5; • deleting multiple records from the CAR table that contain MAKE starting with ‘M’ DELETE FROM car WHERE make LIKE ‘M%’; • deleting all records from the CAR table DELETE FROM car; 48 E 49 Deleting Records • Notice: • not allowed to delete a primary key record that has its corresponded foreign key somewhere else in another table • Example: delete a primary key CarId is 5 in the CAR table that has a foreign key record that CarId is 5 in the CARSALES table DELETE FROM car WHERE carid=5; DELETE FROM car * ERROR at line 1: ORA-02292: integrity constraint (SYSTEM.SYS_C005453) violated – child record found 49 E 5050 F. Commit • Commit • When inserted data by issuing the INSERT command • the changes are only saved in the local database buffer • are not saved in the database • until you COMMIT the transaction • it is important to remember to COMMIT whenever you have finished inserting values or make changes to the database values 5151 Commit • General Syntax: • Sample of inserting a new record with commit Effect of Inserting New Record with Commit F Insert a new record ‘16’ with commit All original records are currently in CAR table 5252 G. Joining Multiple Tables • Join • database query to join multiple database tables together • the data needed or the conditions specified come from more than one table. • Syntax: SELECT FROM WHERE AND 53 Joining Multiple Tables • joining the CAR, CARSALES and CUSTOMER tables: SELECT customer.cname, customer.cphone, carsales.salesdate, carsales.purchasedPrice, carsales.stampDuty, car.make, car.model FROM customer, carsales, car WHERE customer.customerID = carsales.customerID AND car.carID = carsales.carID; • using prefix when joining tables: • when more than one table is involved, a prefix for each attribute is recommended to avoid ambiguity SELECT ct.cname, ct.cphone, cs.salesdate, cs.purchasedPrice,cs.stampDuty, c.make, c.model FROM customer ct, carsales cs, car c WHERE ct.customerID = cs.customerID AND c.carID = cs.carID;; 53 G • The output of both example: 54 Joining Multiple Tables • Subquery • Example: retrieving the make, model and advertisedPrice of the cars that are not sold SELECT make, model, advertisedPrice FROM car WHERE carid NOT IN (SELECT carid FROM carsales); MAKE MODEL ADVERTISEDPRICE —————– —————– ————— Subaru Outback 33950 Lexus RX350 33500 Mini Cooper 19750 Nissan Dualis 18888 Mercedes Benz C200 16995 Jeep Grand Cherokee 21999 6 rows selected 5555 H. Aggregate Functions and Group By • Aggregate Functions • summarize the input table • often used include: • COUNT – count number of records in the input table • SUM – calculate the sum of a numerical attribute • MIN and MAX – find the smallest and the largest value of a certain attribute 56 Aggregate Functions • Count(*) • Example: returning the number of records that is available from the CAR table SELECT COUNT(*) FROM car; COUNT(*) —————— 15 H 57 Aggregate Functions • Example: returning the number of values of colour that is available from the CAR table SELECT COUNT(colour) FROM car; COUNT(COLOUR) ————- 14 • Example: returning the number of distinct values of colour that is available from the CAR table SELECT COUNT(DISTINCT colour) FROM car; COUNT(DISTINCTCOLOUR) ——————— 5 COUNT(attribute) & COUNT(DISTINCT attribute) Cause by a record with null value include one NULL value Output: 58 Aggregate Functions • MIN and MAX • Example: calculate the minimum and maximum advertisedPrice of the cars SELECT MIN(advertisedPrice) AS “Lowest Price”, MAX(advertisedPrice) AS “Highest Price” FROM car; Lowest Price Highest Price ———— ————– 7490 98800 5959 Aggregate Functions and Group By • Group By • group an input table into a number of groups based on one or more nominated attributes • often used in conjunction with aggregate functions • CAR table: group by Make SELECT make, count(*) FROM car GROUP BY make; H MAKE COUNT(*) ——————– ———- Holden 2 Lexus 1 Subaru 1 BMW 1 Mazda 1 Nissan 1 Audi 1 Volkwagen 1 Toyota 1 Honda 2 Mercedes Benz 1 Jeep 1 Mini 1 60 Aggregate Functions and Group By • CAR table: grouping by Make with combination of getting only the record groups that contain the count value greater than 1 is as follows SELECT make, COUNT(DISTINCT model) FROM car GROUP BY make HAVING COUNT(DISTINCT model)>1; MAKE COUNT(DISTINCTMODEL) ——————– ——————– Holden 2 Honda 2 H 61 Aggregate Functions and Group By • multiple tables: selecting the number of sold cars for each colour SELECT colour, COUNT(DINSTINCT customerId) FROM car c, carsales cs WHERE c.carid=cs.carid GROUP BY colour; COLOUR COUNT(DISTINCTCUSTOMERID) ———– ————————- White 2 Grey 4 Blue 1 Black 2 Red 1 H 6262 Aggregate Functions • Count vs. Sum • COUNT – count number of records in the input table • SUM – calculate the sum of a numerical attribute SELECT FROM GROUP BY SELECT FROM GROUP BY H Aggregate Functions 63 • Sum without Group By • Example: calculate the sum of purchasedPrice SELECT SUM(purchasedPrice) AS totalsales FROM carsales; TOTALSALES ———- 402983 Aggregate Functions 64 • Sum with Group By • Example: calculate the sum of purchasedPrice for each make in carsales table SELECT c.make, SUM(cs.purchasedPrice) AS totalsales FROM carsales cs, car c WHERE c.CARID=cs.CARID GROUP BY c.make; MAKE TOTALSALES ——————– ———- Holden 38430 BMW 197600 Mazda 13999 Audi 68200 65 Simple Query Retrieval • SUM with Calculations • Example: list all Cname, SalesDate and calculate the TotalPrice for each customer (TotalPrice = PurchasedPrice+StampDuty ) SELECT ct.cname, cs.salesdate, SUM(purchasedPrice+StampDuty) as TotalPrice FROM carSales cs, customer ct WHERE cs.customerid=ct.customerid GROUP BY ct.cname, cs.salesdate; CNAME SALESDATE TOTALPRICE ——————– ———– ———- Jone 14-JUN-2016 14027.52 Rex 12-DEC-2015 14558.96 Lily 13-JUL-2016 13156 6666 I. Alter Tables – Add New Fields • Syntax: ALTER TABLE ADD (attribute_name data_type_declare constraints_declare); • attribute name: referring to the new attribute that you want to add into the existing table • data type: defines the data type and the size of the new attribute • constraint: defines the constraints that the new attribute might be enforced by certain constraints • Example: add a transmission attribute to the CAR table ALTER TABLE car ADD (transmission VARCHAR2(10)); 67 Alter Tables – Modify Fields • Syntax: ALTER TABLE MODIFY (attribute_name new_data_type); • attribute name: refers to the attribute that you want to modify • new data: defines the new data type that you want to use replacing the old one • Example: change the data type of the transmission to CHAR with a size of 30 ALTER TABLE car MODIFY (transmission CHAR(30)); I 68 Alter Tables – Drop Columns • Syntax: ALTER TABLE DROP COLUMN attribute_name; • Example: delete the attribute transmission from the CAR table ALTER TABLE car DROP COLUMN transmission ; I Oracle/Week-4a-StarSchema-StudentEnrollment 1 More on Student Enrollment Case Study (Tutorial 2) 1. Two dimensions vs. One dimension The star schema of the case study used in Tutorial 2 (Student-Enrollment Star Schema-1 (one semester-year dimension) The SQL codes to create dimension and fact tables are as follows: –Campus dimension Create table campus_dim as SELECT distinct Ocampus FROM Offering; –Semester_year dimension Create table sem_year_dim as SELECT distinct Oyear||Osem as sem_id, Oyear, Osem FROM Offering; –Subject Dimension Create table subject_dim as SELECT * FROM subject; –Grade Dimension Create table grade_dim as SELECT distinct Grade FROM Enrollment; 2 –The Fact CREATE TABLE student_enrollment_fact as SELECT o.Ocampus, o.Oyear||o.Osem as sem_id, s.Ucode, e.Grade, count(st.sid) as num_of_student, sum(e.score) as Total_score FROM subject s, enrollment e, offering o, student st WHERE e.OID = o.OID and s.Ucode = o.Ucode and st.SID = e.SID GROUP BY o.Ocampus, o.Oyear||o.Osem, s.Ucode, e.Grade; The records in the data warehouse are as follows: SQL> select * from campus_dim; OCAMPUS ——— City Main DE SQL> select * from sem_year_dim; SEM_ID OYEAR OSEM ———- ———- —- 20092 2009 2 20091 2009 1 SQL> select * from subject_dim; UCODE UTITLE UCREDIT ———- ——————– ———- IT001 Database 5 IT002 Java 5 IT003 SAP 10 IT004 Network 5 IT005 ASP.NET 5 SQL> select * from grade_dim; GRADE —– HD P D C N 3 SQL> select * from student_enrollment_fact; OCAMPUS SEM_ID UCODE GRADE NUM_OF_STUDENT TOTAL_SCORE ——— ——— ———- —– ————– ———– Main 20091 IT001 HD 3 253 Main 20091 IT001 D 2 149 Main 20091 IT002 C 1 65 Main 20091 IT002 N 1 47 Main 20092 IT002 D 2 151 City 20092 IT001 C 1 64 Main 20091 IT001 N 1 41 Main 20092 IT002 C 1 64 DE 20092 IT004 P 2 105 City 20091 IT003 C 1 63 City 20092 IT001 N 1 32 11 rows selected. Look at particularly the sem_year_dim dimension, both in the SQL code and Can we have two separate dimensions: one for semester, and the other for year? 4 Star Schema-2 (two separate dimensions for semester and year) The SQL codes to create these two new dimensions (sem_dim and year_dim) and –Semester dimension Create table sem_dim as SELECT distinct Osem as sem_id FROM Offering; –Year dimension Create table year_dim as SELECT distinct Oyear as year_id FROM Offering; 5 –The Fact CREATE TABLE student_enrollment_fact_2 as SELECT o.Ocampus, o.Osem as sem_id, o.Oyear as year_id, s.Ucode, e.Grade, count(st.sid) as num_of_student, sum(e.score) as Total_score FROM subject s, enrollment e, offering o, student st WHERE e.OID = o.OID and s.Ucode = o.Ucode and st.SID = e.SID GROUP BY o.Ocampus, o.Osem, o.Oyear, s.Ucode, e.Grade; The contents of the two dimensions and the new fact are as follows: SQL> select * from sem_dim; SEM_ID ———- 1 2 SQL> select * from year_dim; YEAR_ID ———- 2009 SQL> select * from student_enrollment_fact_2; OCAMPUS SEM_ID YEAR_ID UCODE GR NUM_OF_STUDENT TOTAL_SCORE ——— —— ——– ———- — ————– ———– DE 2 2009 IT004 P 2 105 Main 1 2009 IT001 N 1 41 Main 1 2009 IT001 D 2 149 Main 2 2009 IT002 C 1 64 Main 1 2009 IT002 N 1 47 Main 2 2009 IT002 D 2 151 City 2 2009 IT001 N 1 32 Main 1 2009 IT002 C 1 65 City 1 2009 IT003 C 1 63 Main 1 2009 IT001 HD 3 253 City 2 2009 IT001 C 1 64 11 rows selected. Now compare the contents of Fact-1 and Fact-2. They are almost the same, 6 Fact-2 has two attributes, called sem_id (from semester_dim) and year_id (from 7 2. Dimension Hierarchy (Hierarchy vs. No Hierarchy) Suppose we have richer information about campus, as follows: Campus Dimension Campus Clayton Melbourne Victoria 3800 Australia CA Caulfield Dandenong Caulfield Melbourne Victoria 3145 Australia PA Parkville Royal Parkville Melbourne Victoria 3052 Australia SY Sydney Opera Sydney Sydney New 2001 Australia MA Malaysia Jalan Bandar Sydney Selangor 47500 Malaysia SA South Africa Peter St Johanne Johannesb Johanne 1725 South In the star schema, we have two options: Option 1 is to have one dimension called campusDIM containing all of these Option 1 (no hierarchy) 8 The SQL to create the Campus Dimension is shown as follows: Create Table campus_dim As SELECT Distinct Ocampus, < FROM Offering; Option 2 is to have a hierarchy of three dimensions (e.g. campusDIM connected Option 2 (with hierarchy) The Campus dimension is still linked to the Fact, whereas the City dimension is 9 to the City dimension. Therefore, there is a hierarchy from Campus to City, and to Note that the information about city is pulled out from the campus, and then The SQL to create the three dimensions in the hierarchy is as follows: –Campus dimension Create Table campus_dim As SELECT Distinct Ocampus, < FROM <
–City dimension Create Table city_dim As SELECT Distinct Ocity, < FROM <
–Country dimension Create Table country_dim As SELECT Distinct Ocountry, < FROM <
The contents of the dimensions in the hierarchy (campus, city, and country Campus Dimension City Dimension Country Dimension Note that the creation of the fact table is not affected at all. Either model (option 10 Hierarchy should use the higher detail dimension (e.g. campus) as the parent, One of the most common mistakes is the wrong ordering of the dimensions in An incorrect hierarchy (upside down) 11 Hierarchy vs No Hierarchy Since there are two options in implementing a star schema: with and without 1. One table vs. many tables When you want to produce a report involving the fact and campus, with With the hierarchy option, we need to join the fact with three tables 2. Normalized vs. Unnormalized Since no update, insert, or delete operations are performed to the 3. Drilling down and rolling up How about rolling up? Rolling up is exploring information from the detail With the hierarchy option, we still need to do two queries: one for the 12 Select CampusName, Sum(Num_of_student) Select CountryName, Sum(Num_of_student) Whereas with the non-hierarchy option, we still need to do two queries: Select CampusName, Sum(Num_of_student) Select Country, Sum(Num_of_student) From the query point of view, both need two queries. If we insist to have a drill down, that is from country to campus, we still Oracle/Week-4b-Determinant Dimensions 1 Determinant 2 Price Min Max Monday 1. In 2. One The Min Total Num Unleaded Monday 3 We Total Num U Monday Total Min U Box Total Min U Caltex 4 5 3. 6 as Oracle/Week-4c-Determinant -vs-non-determinant 1 Determinant Version-1 2 Version-‐2 Version-2 3 These Gold Num Swimming Gold Num London 4 Olympic 5 6 So Gold Num USA 7 Fact Oracle/Week-5-AVG-in-Fact Page 1/5 Average in the Fact? Should we store an “average” fact measure in the fact table? Consider the following example. Suppose we have the following 16 records as An Operational Database IT001 Database 1 Mirriam 81 The star schema of the above operational database contains one fact and two The fact table aggregates these score records based on their dimensions, which Page 2/5 Fact Table IT001 1 73.833 IT001 2 48 IT002 1 56 IT002 2 71.667 IT003 1 63 IT004 2 52.5 The dimension tables look like as follows: SubjectDIM Table IT001 Database IT004 Network SemesterDIM Table 1 Looking at the Fact Table, the average score for the unit Database in Semester Is this fact table correct? For example, if we want to query the fact table to find out what is the average The SQL to query the Fact Table is as follows: Select Avg(Average_Score) Is this correct? No. In the Operational Database, there are 8 records for Database unit in Semester Let’s do further comparisons: Page 3/5 The average score for Java unit in Semester One and Two using the above fact is Ok, now let’s calculate further. The average score for Semester One using the For Semester Two, using the above fact the average scare for Semester Two is So, storing average as a fact measurement is not a good idea. In the fact table, we should store the “total score” and “number of students” in Fact Table 2 IT001 1 443 6 IT001 2 96 2 IT002 1 112 2 IT002 2 215 3 IT003 1 63 1 IT004 2 105 2 Note that the dimension tables remain unchanged: SubjectDIM Table IT001 Database IT004 Network SemesterDIM Table 1 Using the correct fact table above, it is easy to calculate the average score of The SQL to query the Correct Fact Table is as follows: Page 4/5 Select Sum(Total_Score)/Sum(NumberofStudents) The correct star schema is then as follows: Conclusion The problem of AVG in the Fact is known as the “Average of an Average” How about Min or Max in the Fact? Can we do it? – Yes we can. Fact Table 3 IT001 1 41 87 IT001 2 32 64 IT002 1 47 65 IT002 2 64 78 IT003 1 63 63 IT004 2 52 53 SubjectDIM Table IT001 Database IT004 Network Page 5/5 SemesterDIM Table 1 (*Note: the dimension tables are unchanged) Assuming we want to get the Max_Score of IT001, then the max of {87, 64} will The SQL to retrieve the maximum score of IT001 is as follows: Select Max(Max_Score) The same applies to “Min of Min”. If we want to get the minimum score of IT001, Select Min(Min_Score) We certainly don’t want to mix between min and max. For example, retrieving As a final conclusion: • Average in the fact is not desirable, although technically it satisfies the • Min and Max in the fact can still be used, since min_score and max_score • In general, count and sum are more common. Count is “number of”, and
b) Show the total service charged for each service cost type.
c) Show the total number of patients by each age group (infant <1, children <18, adult 18+,
senior 65+) in April 2020.
d) Show the total service charged for general medical consultations in each suburb.
Submission checklist for Case Study 1:
A. A pdf file or word document containing:
i. The SQL commands to create tables and insert records to the operational database –
Task 1
ii. The star schema diagram – Task 2
iii. The Two-Column Table Methodology illustration – Task 3
iv. The SQL commands to create the dimension and fact tables, as well as the contents of
these tables – Task 4
v. The SQL commands to answer the queries in Task 5 and the query results
B. The .sql files containing:
i. The SQL commands to create the operational database.
ii. The SQL commands to create the data warehouse.
iii. The SQL commands to answer the queries in Task 5.
(Notes: The marker of this test will simply run this file. So you need to make sure that the
SQL commands are written correctly in the .sql file. Pay a particular attention to the quote,
especially if you cut and paste from a Word file).
C. A folder:
i. Save all the files from [A] and the SQL files from [B] in one folder.
ii. Name the folder CaseStudy1_yourStudentID.
https://www.monash.edu/health/medical/services
Case Study 2: Accident Records Data Warehouse (50 marks)
Figure 2 : A simplified structure of the Accident Record’s current database system
The above figure displays how the Victorian Roads and Safety (VicR&S) Department stores accident
related data. When there are any road accidents, the details about the accidents such as the location,
date, time and event (e.g. what exactly happened during the accident) are recorded. For each accident,
the information (e.g. name, contact number, employment start date, current employment branch)
about one police officer who is in charge of overseeing the events of the accident (e.g finding out
who/what caused the accident) are stored. In addition, to identify a particular accident, each accident
is given a unique accident number.
On the other hand, one accident may be caused by several vehicles (e.g. a collision between 2
vehicles). All vehicles have owners who may or may not be the same as the vehicle driver during the
accident. Therefore, the VicR&S database stores information about vehicles involved in the accident
(i.e. vehicle number, vehicle model, vehicle make and vehicle manufacturer), the information about
the owners of the vehicles (i.e. owner name, owner address, owner contact number) and the
information about the drivers at the time of the accident (driver name, driver address, driver contact
number, driver licence no, and when the driver started to drive the vehicle). Furthermore, for each
accident, the severity of damage incurred by each vehicle is recorded.
Currently, to reduce the number of road accidents VicR&S department is interested in the following:
- The total number of accidents happening by different locations and by different lighting
periods (daytime: 6AM - 5:59PM and nighttime 6PM - 5:59AM).
- The total number of accidents by each vehicle model.
- The number of vehicles involved in every accident event on different locations.
- The number of accidents taken care of different police officer branches.
Tasks:
[1]. Develop an ACCIDENT RECORDS star schema. Identify the fact table, dimensions and
attributes required to support the schema. If the star schema consists of a Bridge Table, you have
to also include the Weight Factor and List Aggregate.
The result of this task is a star schema diagram. You can use any drawing tool, such as
Lucidchart, to draw the star schema.
[2]. Validate your star schema using the Two-Column Table Methodology. You are required to
illustrate some two-column tables for this task based on your star schema design.
[3]. Write the SQL commands to create the fact and dimension tables. You need to create a script
file containing the appropriate SQL commands to create the fact and dimension tables. The
operational tables are accessible from the ACCIDENT account. The result of this task is the SQL
commands. You will also need to show the contents of the tables that you have created.
[4]. Write the SQL commands to answer the following queries: (you need to make sure that there are
records in your fact and dimensions tables. For each of the following queries, write the SQL and
show the results):
a) Show the total number of accidents happening by different locations and by different
lighting periods (daytime: 6AM - 5:59PM and nighttime 6PM - 5:59AM).
b) Show the total number of accidents by each vehicle model.
c) Show the number of vehicles involved in every accident event on different locations.
d) Show the number of accidents taken care of by different police officer branches.
[5]. You need to come up with additional two more questions and answer these questions using the
SQL commands. Also explain the reason for why the management would like to have such
information.
Submission checklist for Case Study 2:
A. A pdf file or word document containing:
i. The star schema diagram – Task 1
ii. The Two-Column Table Methodology illustration – Task 2
iii. The SQL commands to create the dimension and fact tables, as well as the contents of
these tables – Task 3
iv. The SQL commands to answer the queries in Task 4 and the query results
v. The additional two questions with the SQL commands, query result, and your
explanation – Task 5
B. The .sql files containing:
i. The SQL commands to create the data warehouse.
ii. The SQL commands to answer the queries in Task 4 and Task 5.
(Notes: The marker of this test will simply run this file. So you need to make sure that the
SQL commands are written correctly in the .sql file. Pay a particular attention to the quote,
especially if you cut and paste from a Word file).
C. A folder:
i. Save all the files from [A] and the SQL files from [B] in one folder.
ii. Name the folder CaseStudy2_yourStudentID.
Submission Method:
1. A zip file:
a. Save the CaseStudy1_yourStudentID and CaseStudy2_yourStudentID folders into one
folder named FIT5195_Test_yourname_studentID.
b. ZIP the folder FIT5195_Test_yourname_studentID.zip. This must be a ZIP file and
not other types of compressed folder. The zip file should contain the prescribed folder
structure as follows:
● FIT5195_Test_yourname_studentID/
○ CaseStudy1_yourStudentID/
○ Files for tasks
○ CaseStudy2_yourStudentID/
○ Files for tasks
2. Upload your zip file on Moodle by the due date: Friday 1-May-2020, 11:55pm (No late
submission is accepted).
END OF TAKE HOME TEST
Oracle/Week-1c-SQL-revision
11
SQL - Revision
Assoc. Prof. David Taniar
22
33
SQL - Revision
Outline:
A. Introduction to SQL client environment
B. Create tables
C. Insert into
D. Simple query retrieval
E. Updating and deleting records
F. Commit
G. Joining multiple tables
H. Aggregate functions and group by
I. Alter tables
44
A. Introduction to SQL client
• There are several SQL client software:
• SQL Developer
• SQL*Plus (Windows, Mac, Unix)
• Login details:
• Username: S12345678 (12345678 is your student id)
• Password: student
• Host string: the unit code
5
Introduction to SQL Developer
Oracle username which starts
with S and followed by your
student ID. Password is student
Location of
the server
This is the hoststring, which
is the database name
rhino.its.monash.edu
FIT5195
FIT5195
6
Introduction to SQL Developer
Single Statement Execution Script Execution
Output of Execution
Type Command
77
B. Create Tables
• General Syntax:
CREATE TABLE
• NOT NULL Operator:
• Example: retrieving all make, model, advertisedPrice and colour in a descending
SELECT make, model, advertisedPrice, colour
15 records
system) is shown as follows.
the contents of this dimension. In the above version of the star schema, the
semester and the year information are actually combined into one dimension,
called sem_year_dim dimension.
The answer is “yes, you can”. The new star schema would look like as follows.
Because the original sem_year_dim is now split into two dimensions: sem_dim,
and year_dim, consequently, in the fact table, we will have two attributes:
semester and year, instead of just one attribute sem_id.
the fact tables are as follows:
except that Fact-1 has one attribute from the sem_year_dim dimension, whereas
year_dim).
Conclusion: From the FACT point of view, it does not really matter whether we
have two separate dimensions or just one dimension. They are the same.
However, if two information, like semester and year, is often seen as one entity
or one piece of information, it would be easier if the two information is present
in one dimension.
CampusID CampusName Address Suburb City State Postcode Country
CL Clayton
Wellington
Rd
Campus
Rd
East
Campus
Parade
Campus
Boulevard
South
Wales
Campus
Lagoon
Sunway
Campus
sburg
urg
sburg
Africa
attributes.
–Campus dimension
to the Fact, campusDIM is also the parent of cityDIM – because campus-city
forms a hierarchy, and cityDIM becomes the parent of countryDIM).
linked (through hierarchy) to the Campus, and the Country dimension is linked
Country.
create a hierarchy between campus and city. The information about country is
also pulled out from the campus and city. >;
>;
>;
dimensions are as follows):
CampusID CampusName Address Suburb Postcode CityID
CL Clayton Campus Wellington Rd Clayton 3800 MEL
CA Caulfield Campus Dandenong Rd Caulfield East 3145 MEL
PA Parkville Campus Royal Parade Parkville 3052 MEL
SY Sydney Campus Opera Boulevard Sydney 2001 SYD
MA Malaysia Campus Jalan Lagoon Bandar Sunway 47500 KUL
SA South Africa Campus Peter St Johannesburg 1725 JNB
CityID CityName State CountryID
MEL Melbourne Victoria AU
SYD Sydney New South Wales AU
KUL Kuala Lumpur Selangor MA
JNB Johannesburg Johannesburg SA
CountryID CountryName
AU Australia
MA Malaysia
SA South Africa
1 or option 2) will not affect the fact and other dimensions (e.g. grade, subject,
semester_year dimensions).
which is linked to the fact. The lower detail dimension (e.g. city) becomes the
child in the hierarchy. The lowest detail dimension (e.g. country) becomes the
lowest in the hierarchy.
the hierarchy, such as in the following star schema:
hierarchy, which option is more preferred?
Without hierarchy, there is only one table (e.g. the campus dimension
table). With hierarchy, there are three tables (e.g. campus, city, and
country dimension tables). The consequence is in the join query
processing when producing reports.
the no hierarchy option, we only need to use two tables: the campus
dimension table, and the fact table. Hence, the join query processing is
simple.
(campus, city, and country). Consequently, the join query processing will
be more complex
With the hierarchy option, the tables are normalized, which follows the
relational model. In contrast, the non-hierarchy option, the table (e.g. the
campus dimension table) is unnormalized, which is basically 1NF with
visible replication on information which is prone to anomalies (e.g. insert,
update, and delete anomalies).
dimension tables (once they have created), hence, anomalies become
irrelevant. Therefore, the unnormalized option may be preferred, also due
to a simpler join query processing.
People often mistakenly thought that the hierarchy model is good for
drilling down. The hierarchy is not from country to city and to campus.
The correct hierarchy is from the detail to the more general, such as from
campus to city, and to country. Hence, we are not drilling down from
campus to country, because drilling down should from something that is
general (such as country) to something more detail (such as city and
campus). In other words, the hierarchy model is not about drilling down
information exploration.
to a more general. For example, retrieve number of students located at a
campus level, and then rolling up to the country level.
campus and the second for the country.
From Fact F, CampusDim C
Where F.CampusID = C.CampusID
Group By CampusName;
From Fact F, CampusDim C1, CityDim C2, CountryDim C3
Where F.CampusID = C1.CampusID
And C1.CityID = C2.CityID
And C2.CountryID = C3.CountryID
Group By CountryName;
From Fact F, CampusDim C
Where F.CampusID = C.CampusID
Group By CampusName;
From Fact F, CampusDim C
Where F.CampusID = C.CampusID
Group By Country;
From the query processing point of view, the non-hierarchy option uses
one join operation only, because it only needs to join the fact and one
dimension.
From the conceptual point of view, the hierarchy model does not actually
offer a better roll up or drill down features.
need to do the two queries above, but we will do the country query first,
and then the campus query second.
Dimensions
The
Petrol
Price
Case
Study
1.
Snapshot
of
an
operational
database
The
following
is
a
sample
operational
database
that
keeps
the
petrol
prices
(of
different
kind
of
petrol)
of
petrol
stations
within
a
period
of
time
2.
Star
Schema
The
requirements
for
the
data
warehouse
are
to
answer
questions
related
to
(a)
average
petrol
price,
(b)
min
petrol
price,
and
(c)
max
petrol
prices.
From
the
sample
data
above,
we
choose
the
following
three
angles
to
view
the
fact
measures:
(i)
day
of
week,
(ii)
suburb,
and
(iii)
petrol
company.
From
the
Day
of
Week
point
of
view,
it
is
sensible
to
answer
the
question
something
like
which
day
is
has
the
lowest
petrol
price.
We
use
a
two-‐column
table
methodology
to
visualize
a
conceptual
model
of
the
star
schema.
Hence,
the
two-‐column
table
for
the
day
of
week
is
as
follows:
Day
of
week
Total
Petrol
Num
of
Petrol
Station
Petrol
Price
Petrol
Price
Tuesday
…
Notes:
the
above
2-‐column
table,
the
first
column
(say
column
A,
which
is
Day
of
Week)
is
a
category.
The
second
columns
(say,
columns
B1,
B2,
B3,
and
B4)
are
the
fact
measures.
of
the
requirements
for
the
fact
measure
is
average
petrol
price.
Since,
the
fact
table
should
not
include
an
average
as
a
fact
measure,
we
include
two
attributes:
total
petrol
price,
and
number
of
petrol
stations,
instead.
visualization
of
the
above
two-‐column
table
for
Day
of
Week
is
rather
incomplete,
because
there
is
no
data
in
the
fact
measure
columns.
Take
a
look
at
the
Min
Petrol
Price
column
in
the
above
table.
What
kind
of
value
should
be
in
that
column?
Supposed
Monday
has
a
value
in
the
Min
Petrol
Price
column,
what
does
that
value
min?
It
does
not
specify
which
petrol
type,
whether
it
is
Unleaded
91,
or
Premium
95,
or
Premium
98,
for
example.
Therefore,
the
Min
Petrol
Price
column,
although
seems
right,
does
not
make
sense
at
all.
How
can
this
problem
be
solved?
One
solution
is
to
have
one
Min
Petrol
Price
for
each
petrol
type
(e.g.
Unleaded
91,
Premium
95,
Premium
98,
E10,
Diesel,
and
LPG).
Therefore,
we
have
6
attributes
to
represent
Min
Petrol
Prices,
namely:
Min
Unleaded
91
Price,
Min
Premium
95
Price,
Min
Premium
98
Price,
Min
E10
Price,
Min
Diesel
Price,
and
Min
LPG
Price.
Petrol
Price
Day
of
week
Petrol
Price
of
Petrol
Station
P95
P98
E10
Diesel
LPG
Max
Petrol
Price
119.9c
132.9c
139.9c
114.9c
115.9c
55.7c
Tuesday
118.9c
131.9c
136.9c
114.9c
115.9c
56.9c
…
need
to
do
the
same
for
the
Total
Price,
and
Max
Petrol
Price;
each
of
which
should
be
divided
into
6
columns.
The
new
two-‐column
table
for
Day
of
Week
is
then
shown
as
follows:
Petrol
Price
Min
Petrol
Price
Max
Petrol
Price
Day
of
Week
U
95
98
E
D
LPG
of
Petrol
Station
95
98
E
D
LPG
U
95
98
E
D
LPG
Tuesday
The
matter
will
become
more
complicated
if
number
of
petrol
stations
that
has
Unleaded
91
is
different
from
that
of
Premium
98,
for
example.
Then
to
calculate
an
average
of
Unleaded
91
on
Monday,
and
an
average
of
Premium
98
on
Monday
cannot
be
simply
a
division
between
Tot91
and
Num
of
Petrol
Stations,
and
between
Tot98
and
Num
of
Petrol
Stations,
because
the
based
line,
which
is
Num
of
Petrol
Stations
for
different
fuel
type
is
different.
Therefore,
we
need
to
divide
the
Num
of
Petrol
Stations
column
into
6
columns
–
one
for
each
petrol
type.
As
a
result,
we
will
end
up
with
24
columns
to
store
the
fact
measures,
instead
of
just
4
columns,
as
initially
thought.
Using
the
same
principle,
we
can
have
a
two-‐column
table
for
Suburb,
and
another
one
for
Company.
Petrol
Price
Num
of
Petrol
Station
Petrol
Price
Max
Petrol
Price
Suburb
95
98
E10
D
L
U
95
98
E10
D
L
U
95
98
E10
D
L
U
95
98
E10
D
L
Hill
Clayton
Petrol
Price
Num
of
Petrol
Station
Petrol
Price
Max
Petrol
Price
Company
95
98
E
D
L
U
95
98
E
D
L
U
95
98
E
D
L
U
95
98
E
D
L
7-‐Eleven
Based
on
the
above
three
two-‐column
tables,
we
are
now
confident
to
have
the
following
star
schema:
Star
Schema
version-2
Is
there
another
way
to
reduce
the
number
of
fact
measures?
The
above
star
schema
has
24
fact
measures,
because
there
are
6
different
kinds
of
petrol.
The
answer
is
yes!!!
We
can
have
a
new
dimension
called
Petrol
Type
DIM,
which
stores
the
different
kinds
of
petrol.
With
this
new
dimension,
the
fact
measure
is
reduced
to
4,
which
is
the
original
fact
measures
we
had
in
the
beginning.
Therefore,
the
new
star
schema
becomes
as
follows:
With
this
new
star
schema,
as
an
example,
we
can
retrieve
data
that
shows
the
Min
Petrol
Price
of
Premium
98
on
Monday.
Or
another
example
is
to
get
the
Average
Petrol
Price
(which
is
Total
Petrol
Price/Num
of
Petrol
Station)
of
Unleaded
Petrol
in
7-‐Eleven
petrol
station
located
at
Clayton.
Note
that
when
retrieving
data
from
a
data
warehouse,
there
is
no
obligation
that
we
must
use
all
dimensions.
In
the
Monday
Premium
98
example,
only
two
dimensions
are
used,
which
are
Day
of
Week
DIM,
and
Petrol
Type
DIM.
And
in
the
second
example,
only
three
dimensions
are
used:
Petrol
Type
DIM,
Company
DIM,
and
Suburb
DIM.
So,
to
retrieve
data
from
the
data
warehouse,
we
can
use
few
as
one
dimension
table;
but
we
can
also
use
all
dimension
tables.
We
are
not
restricted
to
which
dimension
we
have
to
use.
However,
this
imposes
a
new
problem.
For
example,
the
above
star
schema
allows
us
to
retrieve
the
Min
Petrol
Price
of
Monday,
which
uses
only
Day
of
Week
DIM.
This
data
retrieval
does
not
make
any
sense;
which
petrol
type
are
we
referring
to?
Based
on
this
case
study,
it
is
clear
that
the
Petrol
Type
DIM
holds
the
key
to
any
data
retrieval.
We
must
use
the
Petrol
Type
DIM
in
any
retrieval,
such
as
the
Min
Petrol
Price
of
Premium
98
on
Monday,
or
the
Average
Petrol
Price
of
Unleaded
in
7-‐Eleven
petrol
station
located
in
Clayton
suburb.
The
Petrol
Type
DIM,
in
this
example,
is
called
a
“Determinant
Dimension”
–
a
dimension
which
needs
to
be
used
in
the
data
retrieval,
because
the
fact
measures
are
determined
by
this
dimension.
In
order
to
differentiate
between
Determinant
Dimension
and
all
other
dimensions,
Determinant
Dimension
is
denoted
by
a
star.
The
fact
measures
are
also
denoted
by
a
star.
The
new
star
schema
with
a
Determinant
Dimension
is
as
follows:
vs.
Non-Determinant
Dimensions
1.
The
Olympic
Games
Case
Study
The
Olympic
Games
committee
maintains
an
operational
database
that
stores
all
matches,
games,
as
well
as
the
medal
winners
of
the
Olympic
Games
over
the
years
(https://www.olympic.org/).
We
would
like
to
build
a
data
warehouse
to
analyze
the
medal
counts,
by
each
country,
sport,
and
at
which
Olympic
Games.
There
are
two
possible
star
schemas.
Version-‐1
star
schema
contains
3
dimensions,
whereas
Version-‐2
star
schema
contains
4
dimensions.
For
simplicity,
a
very
minimal
number
of
attributes
are
included
in
each
dimension.
Version-‐1
star
schema
has
CountryDIM,
SportDIM,
and
OlympicGamesDIM
as
the
dimensions.
Three
fact
measures
are
included
in
the
fact
table,
namely
number
of
gold
medals,
silver
medals,
and
bronze
medals.
Star
Schema
star
schema
has
four
dimensions
–
with
an
additional
of
Medal
Type
DIM
(which
is
either
Gold,
Silver,
or
Bronze),
but
there
is
only
one
fact
measure
in
the
Fact
table,
which
is
Num
of
Medals.
Star
Schema
with
the
Medal
Type
DIM
What
is
the
difference
between
these
two
versions
of
star
schema?
Is
Medal
Type
DIM
a
Determinant
Dimension?
In
order
to
answer
these
questions,
we
need
to
visualize
two-‐column
tables
for
each
dimension
category.
2.
Two-Column
Tables
for
Version-1
Star
Schema
(without
the
Medal
Type
DIM)
The
first
two-‐column
table
is
from
the
Country
point
of
view,
which
is
as
follows:
Country
Num
of
Gold
Num
of
Silver
Num
of
Bronze
USA
733
602
488
China
199
143
133
Australia
167
170
189
are
the
number
of
Gold,
Silver
and
Bronze
medals
that
these
country
got
on
all
Olympic
Games
(several
Olympic
Games)
recorded
in
the
operational
database.
Note
that
this
two-‐column
table
methodology
is
to
help
the
data
warehouse
designer
to
visualize
the
view
of
the
fact
measures
from
each
dimension.
The
second
two-‐column
table
is
from
the
Sport
point
of
view.
Assuming
that
in
the
operational
database,
it
records
20
past
Olympic
Games,
and
at
each
Olympic,
there
is
only
one
gold
for
100m
Butterfly
Men,
for
instance.
Sport
Num
of
Num
of
Silver
of
Bronze
100m
Butterfly
Men
20
20
20
Swimming
400m
Freestyle
Women
20
20
20
Swimming
4x100m
Medley
Relay
Men
20
20
20
The
third
two-‐column
table
is
from
the
Olympic
Name
point
of
view.
Olympic
Name
Num
of
Num
of
Silver
of
Bronze
2012
302
304
356
Beijing
2008
302
303
353
Athens
2004
301
301
327
All
the
three
two-‐column
tables
above
seem
to
be
reasonably
correct.
The
first
columns
are
the
categories,
while
the
other
columns
are
the
fact
measures
which
are
numeric
and
aggregate
values.
Because
these
three
two-‐column
tables
make
sense,
we
are
confident
that
version-‐1
star
schema
is
correct.
3.
Two-Column
Tables
for
Version-2
Star
Schema
(with
the
Medal
Type
DIM)
The
two-‐column
tables
for
the
first
three
dimensions,
namely
Country,
Sport,
and
Olympic
Names
are
as
follows:
Country
Num
of
Medals
USA
1823
China
475
Australia
526
Sport
Num
of
Medals
Swimming
100m
Butterfly
Men
60
Swimming
400m
Freestyle
Women
60
Swimming
4x100m
Medley
Relay
Men
60
Name
Num
of
Medals
London
2012
962
Beijing
2008
958
Athens
2004
929
The
question
is
whether
these
two-‐column
tables
make
sense.
If
we
look
at
the
country,
it
makes
sense
to
see
how
many
medals
Australia
has
received
in
all
Olympic
Games;
the
same
with
Sport,
and
Olympic
Names.
Finding
how
many
medals
(regardless
the
medal
types)
for
each
country,
for
each
sport,
and
for
each
Olympic
seems
to
be
reasonable.
The
fourth
two-‐column
table
for
version-‐2
star
schema
is
the
Medal
Type,
which
is
as
follows:
Medal
Type
Num
of
Medals
Gold
4115
Silver
4095
Bronze
4474
This
two-‐column
table
on
Medal
Type
seems
to
be
reasonable
too.
Hence,
version-‐2
star
schema
(with
Medal
Type
DIM)
is
correct.
So
in
conclusion,
both
star
schemas
(with
or
without
Medal
Type
DIM)
are
correct.
Now
going
back
to
the
original
question:
Is
Medal
Type
DIM
a
Determinant
Dimension?
4.
Determinant
or
Non-Determinant
Dimensions
A
“Determinant
Dimension”
is
a
dimension
that
the
fact
measure
relies
on,
and
consequently,
all
data
retrieval
from
the
data
warehouse
must
include
this
dimension.
If
the
data
retrieval
from
the
data
warehouse
does
not
include
this
determinant
dimension,
the
retrieval
result
will
not
make
sense
at
all.
In
the
previous
case
study
on
Petrol
Price,
Petrol
Type
DIM
is
a
Determinant
Dimension
(refer
to
the
star
schema
below).
Note
that
a
Determinant
Dimension
is
denoted
by
a
star.
The
fact
measures
affected
by
the
determinant
dimension
are
also
starred.
The
fact
measures:
Total
Petrol
Price,
Num
of
Petrol
Station,
Min
Petrol
Price,
and
Max
Petrol
Price,
depend
on
the
Petrol
Type,
which
is
indicated
by
the
Petrol
Type
DIM.
That
means
analyzing
the
min
petrol
price
from
the
day
of
week
point
of
view,
must
include
the
petrol
type.
Otherwise,
it
doesn’t
make
sense
to
retrieve
data
to
show
that
on
Monday
the
lowest
(min)
petrol
price
is,
for
example,
109.90cents.
As
it
does
not
indicate
which
petrol
type
it
is,
this
lowest
petrol
price
is
meaningless.
Therefore,
a
better
data
retrieval
is
to
retrieve
the
record
to
show
that
for
example,
on
Monday,
the
min
“Unleaded”
petrol
price
is
109.90cents
(e.g.
Unleaded
is
a
petrol
type
obtained
from
the
Petrol
Type
DIM).
Now
going
back
to
the
Olympic
Games
case
study
(refer
to
Version-‐2
star
schema
with
Medal
Type
DIM).
Is
Medal
Type
DIM
a
Determinant
Dimension?
The
answer
to
this
question
can
be
answered
by
another
question.
To
retrieve
the
data
from
version-‐2
star
schema,
must
we
have
the
information
from
Medal
Type
DIM?
The
answer
is
clearly
no,
because
we
can
simply
retrieve
a
record
from
the
fact
to
show
that
Australia
in
London
2012
Olympic
Games
received
10
medals
in
Swimming.
This
covers
three
dimensions,
namely
Country
(Australia),
Olympic
Name
(London
2012),
and
Sport
(Swimming).
In
this
example,
Medal
Type
DIM
is
not
involved,
and
the
information
retrieved
still
makes
sense.
the
answer
to
the
question
whether
Medal
Type
DIM
is
a
Determinant
Dimension
or
not,
the
answer
is
clearly
No!!
The
next
question
is:
what
is
the
difference
between
the
Olympic
Games
case
study
and
the
Petrol
Price
case
study.
Both
are
very
similar,
but
the
Olympic
Games
Medal
Type
DIM
is
not
a
determinant
dimension,
whereas
the
Petrol
Type
DIM
is
a
determinant
dimension.
The
answer
is
the
aggregate
function
used
in
the
fact
measure.
In
the
Olympic
Games
case
study,
the
fact
measure
function
is
COUNT,
which
is
count
of
medals.
The
breakdown
of
the
medals
is
gold,
silver,
and
bronze;
but
the
main
aggregate
function
of
the
fact
measure
is
number
of
medals,
which
is
a
count.
If
the
fact
measure
is
a
count,
then
the
dimension
(e.g.
Medal
Type
DIM)
is
not
a
determinant
dimension,
because
we
can
still
analyze
the
fact
measure
which
is
the
total
medals
from
other
dimensions,
without
the
medal
type
dimension.
On
the
other
hand,
the
Petrol
Price
case
study
uses
AVG,
MIN,
and
MAX
as
the
aggregation
functions.
Note
that
we
do
not
store
average
as
a
fact
measure,
but
total
price
and
number
of
stations.
These
two
fact
measures
will
be
used
to
calculate
the
average.
Although
the
average
is
not
explicitly
stored
in
the
fact,
implicitly,
the
total
price
and
number
of
stations
represent
the
average.
If
the
aggregate
function
to
calculate
the
fact
measures
is
not
COUNT,
then
a
determinant
dimension
is
needed.
In
this
case,
Petrol
Type
DIM
is
hence
a
determinant
dimension,
because
Min
Petrol
Price,
for
example,
does
not
have
any
meaning
without
petrol
type.
5.
Version-1
(without
Medal
Type
DIM)
vs.
version-2
(with
Medal
Type
DIM)
As
both
versions
in
the
Olympic
Games
case
study
are
correct,
let’s
compare
and
contract
these
two
versions.
In
order
to
do
this,
let’s
have
a
look
at
the
records
in
the
respective
fact
tables.
The
fact
table
for
version-‐1
star
schema
(without
Medal
Type
DIM)
has
6
attributes:
three
from
the
dimensions,
and
the
other
three
for
the
fact
measures.
The
contents
of
the
fact
table
are
as
follows:
Fact
(version-1
star
schema)
Country
Sport
Olympic
Name
Num
of
Num
of
Silver
of
Bronze
Swimming
London
2012
16
9
6
China
Swimming
London
2012
5
1
4
Australia
Swimming
London
2012
1
6
3
The
fact
table
for
version-‐2
star
schema
(with
Medal
Type
DIM)
consists
of
5
columns:
4
from
the
dimension,
but
only
one
fact
measure.
(version-2
star
schema)
Country
Sport
Olympic
Name
Medal
Type
Num
of
Medals
USA
Swimming
London
2012
Gold
16
USA
Swimming
London
2012
Silver
9
USA
Swimming
London
2012
Bronze
6
China
Swimming
London
2012
Gold
5
China
Swimming
London
2012
Silver
1
China
Swimming
London
2012
Bronze
4
Australia
Swimming
London
2012
Gold
1
Australia
Swimming
London
2012
Silver
6
Australia
Swimming
London
2012
Bronze
3
From
the
storage
point
of
view,
it
is
clear
that
version-‐1
is
the
winner.
It
has
only
3
records,
whereas
in
version-‐2,
the
same
information
is
represented
in
9
records.
From
the
modeling
point
of
view,
some
may
prefer
version-‐2,
because
the
model
is
concise
and
more
compact.
When
the
number
of
fact
measure
is
reasonably
large
(like
in
the
Petrol
Price
case
study),
the
star
schema
with
a
determinant
dimension
looks
very
slim
and
compact
–
hence,
it
is
easy
to
understand.
But
consequently,
the
storage
requirement
increases
as
well.
In
contrast,
with
many
different
petrol
types,
if
the
star
schema
does
not
use
a
determinant
dimension,
the
number
of
attributes
in
the
fact
will
dramatically
increase,
and
the
schema
looks
more
complex
and
crowded;
but
the
storage
cost
is
lower.
No!!!!!!!
our data source in the operational database. Note that there are 9 Semester One
and 7 Semester Two records, respectively. Out of the 8 Database unit recodes, 6
of them are Semester One, and 2 of them are Semester Two.
Ucode Utitle Semester Sfname Score
IT001 Database 1 Allan 41
IT001 Database 1 Ben 74
IT001 Database 1 Kate 85
IT001 Database 1 Larry 87
IT001 Database 1 Leonard 75
IT001 Database 2 Juan 64
IT001 Database 2 Andy 32
IT002 Java 1 Ally 65
IT002 Java 1 Menson 47
IT002 Java 2 Mirriam 78
IT002 Java 2 Ben 73
IT002 Java 2 Larry 64
IT003 SAP 1 Ally 63
IT004 Network 2 Juan 53
IT004 Network 2 Menson 52
dimensions. The dimensions are: Subject and Semester (One or Two); and the
fact measure is Average Score.
are subject and semester. If we store Average Score in the fact table, this is how
the fact table will look like:
UCode Semester Average_Score
UCode UTitle
IT002 Java
IT003 SAP
Semester
2
One is 73.833 (average of the first 6 score records); the average score for the unit
Database in Semester Two is 48 ((64 + 32)/2).
It looks correct. But actually it is incorrect.
score of the Database unit, by looking at the above fact table, the answer would
be (73.833+48)/2= 60.9165.
From FactTable
Where UCode = ‘IT001’;
One and Two (see the first eight records in the operational database). If we sum
all the score of these eight records and divided by eight records, the result will be
539/8=67.375; not 60.9165.
(56+71.667)/2=63.833. The actual average score for Java unit in Semester One
and Two is not 63.833, but 65.4 (see the next 5 records in the above score list,
and sum these scores and then divide by 5, 327/5). So again, the above fact table,
which stores the average score, will not produce correct results.
above fact is (73.833+56+63)/3=64.278. In the above score list records, there are
nine Semester One records, and the average is in fact 68.667.
(48+71.667+52.5)/3=57.389; whereas the actual average score for the seven
Semester Two records is 59.4286.
How do we solve the above problems?
each aggregate group. Hence, the fact table should look like this.
UCode Semester Total_Score NumberofStudents
UCode UTitle
IT002 Java
IT003 SAP
Semester
2
Database unit, which is (443+96)/(6+2)=67.375.
From FactTable2
Where UCode = ‘IT001’;
problem. This problem is well known in Mathematics. Average of an average will
simply produce an incorrect average result (almost all the time). Hence, it is not
desirable to have an average measure in the fact – unless the analysis ALWAYS
uses all the dimensions.
Because Max of Max is always a global max, and Min of Min is always a global
min. For example, using the above sample data, assume we have Max_Score and
Min_Score in the Fact, as follows:
Ucode Semester Min_Score Max_Score
UCode UTitle
IT002 Java
IT003 SAP
Semester
2
produce 87, and 87 is the maximum score of IT001, because 87 is the max in
semester 1, which is greater than any max of IT001 (e.g. in semester 2). In other
words, “Max of Max” is correct.
From FactTable3
Where UCode = ‘IT001’;
the result will be 32, which is the minimum between 41 and 32.
From FactTable3
Where UCode = ‘IT001’;
the minimum of Max_Score would be meaningless; the same as retrieving the
maximum of Min_Score.
two criteria of the fact (e.g. must be a numerical and aggregate value)
are valid fact measures (e.g. they are numerical and aggregated values)
Sum is “total of”.