Experiences of A Curious Mind

Monday, 19 March 2018

Cricket Insights with Oracle BI

Target Audience

Oracle BI Professionals - You can learn building a new data model and challenges involved in the development.
Other BI Professionals - You may want to try this out in your BI tool (Cognos, PowerBI, Tableau, QlikView, etc.)
Others - Those who love cricket and want to analyze data from different perspectives.

Background

The sports industry has been using analytical tools extensively from a long time. In particular, cricket is leading in this space with rich visualizations for scorecards, player comparison and other stats. I don't know what tool / technology is used to build such appealing and interactive reports displayed during the match, but I thought I can apply my knowledge in Business Intelligence to do similar analysis using Oracle BI. In this blog, we will see how to build an analytics application for cricket from scratch.

Objective and Design Approach

It is important to have clear understanding of expectations from Cricket Analytics application so that we are on the same page. There can be different types of applications as listed below -

Basic Application
It should have a dashboard where you can filter on the match format, tournament and year. This should display a list of matches for the filter criteria. You should then be able to click on particular match to view the match summary and scorecards.

Advanced Application
As the name suggests, it should have advanced features such as metrics or calculations to do comparative analysis, rating system / methodology, etc. With this application, we should be able to identify best performances in batting, bowling and overall for different match formats.

Out of the Scope
Well, there are things that can only be identified with an expert eye and is beyond the capability of Cricket Analytics application. The best wicket taken by a bowler, the best catch by a fielder or the best shot played by a batsman fall in this category. An apt example for this case is the famous "Ball of the 21st Century" in the 3rd test match between Australia and New Zealand at WACA, Perth held during 14th-18th December 2017. The Mitchell Starc's dismissal of James Vince has been widely appreciated by many bowling legends. May be in the future, an Artificial Intelligence (AI) application with very detailed dataset will be able to identify such moments in the cricket history.

Let's continue with the basic application and later we can take up the advanced part. The below image is a visual representation of wireframe model of the basic application.

Image 1 - Wireframe Model of Cricket Insights Application

The standard data warehouse implementation involves data model design which means identifying the key entities / dimensions and events / facts and determine how they relate to each other. I know this process sounds very technical, so let's break it down so that its easier to understand the terminologies in data warehousing domain.

Let's look at the questions and answers in the below image.

Image 2 - Data Model Design 1

We can observe that the first set of questions have textual answers while the second set has numerical answers. In data warehousing, the dimensions have descriptive attributes and facts store numerical information. Here, the dimension Venue can have attributes like stadium name and city, the dimension Date can have attributes like month and year, the dimension Players can have attributes like name, date of birth, country. The facts basically represent an event or activity or process. The match between two teams is an activity in which runs are scored, so it's a good candidate for fact.

As we have basic understanding of dimensions and facts, we will now define the relationships among them. A fact is typically connected with a number of dimensions with one-to-many (1:M) relationship. The Players dimension can have one record for Kapil Dev and one record for Clive Lloyd, but they have scored / played in a lot of matches so the fact can have multiple records for them. The same can be said about Venue - mutiple matches played at a stadium and Date - mutiple matches played on a day. There are more factors involved in a match as showcased in the below image.

Image 3 - Data Model Design 2

In the above diagram, Player of the Match, Batsman and Bowler represent Players dimension which is used in the different context. In the same way, Team 1 (Batting) and Team 2 (Fielding) are different representations of Team dimension. We can also have a common dimension that will store various list of values like dismissal type and extras type. This concept of using a dimension in different context is referred as "Role Playing Dimension". The fact Match has references (foreign keys) for the connected dimensions and additional details like toss and match win summary. I have created following variations for facts -
   *    Fact - Score    →   This records runs scored and wickets on ball-by-ball basis.
   *    Fact - Scorecard Batting    →   This records performance details of each batsman.
   *    Fact - Scorecard Bowling    →   This records performance details of each bowler.

The first fact contains the information at the detailed level with runs scored on each ball, extras and wickets. The second and third facts have data aggregated at player level. The batting scorecard has batsman's performance recorded as runs scored, balls played, strike rate, fours and sixes hit. The last fact stores bowler's performance as overs bowled, wickets taken, economy and dot balls.

Acquiring the Data

We have talked about what to expect from Cricket Analytics application and some key concepts in data warehousing to design the data model for the application. In order to proceed further, we need data that will be displayed and analyzed in the reports. There are many sites that have current and historical cricket match results. Some of the prominent ones are ICC Cricket, Cricbuzz, ESPN and BBC. I chose to get the data from ICC Cricket (www.icc-cricket.com) as it is the official website of the governing body of world cricket.

It is fairly easy to pull match results and scorecards from any of the famous websites. However, as we are developing an analytics application for cricket we require details on ball-by-ball basis. On top of this, we need this information in the specific format (dimensions and facts) we discussed during design of the data model. This was a hard part, I wrote some code to scrap data from the commentary section of World Cup 2015 and then lot of data clean-up was done to bring it in the desired format. The intention of this blog is to focus on design and development of a new analytics application, so the data acquisition is described at high level only and steps for data scraping is not covered in detail. I understand some readers would be interested in the final dataset, so I have made it available here to download and analyze the data.

Download Here →

Building Oracle BI Repository

Our next move is to build the Oracle BI repository from scratch, this means we create a blank / new repository and then build all 3 layers one-by- one.

Physical Layer
We start with importing all the tables (dimensions and facts) from the database. Then we create the aliases for each table as this is one of the best practices in OBIEE development. Some tables have more than one alias for the reason discussed during the design phase which is Role Playing Dimension. There is also an alias created for the fact table as it joins to other facts.

Table Name	Alias Name	Comments
DIM_PLAYERS	D_Players	Connected with 3 facts - Score, Batting and Bowling.
DIM_PLAYERS	D_Player_of_Match	Connected with 1 facts - Matches.
DIM_TEAMS	D_Teams	Connected with 2 facts - Matches and Score.
DIM_TEAMS	D_Opposition	Connected with 1 fact - Matches.
FACT_MATCHES	F_Matches	Connected with multiple dimensions and 1 fact - Score.
FACT_MATCHES	F_Matches#1	Connected with 2 facts - Batting and Bowling.

The tables are joined as shown in the below image, there are multiple star schemas represented in the single diagram. A key point to be noted here is D_Opposition is joined with F_Matches but not with F_Score. This is because there are two teams playing a match but only one team is scoring at a time.

Image 4 - Physical Layer Star Schema Diagram

Logical Layer
The multiple physical star schemas are consolidated into one in the logical layer. This is achieved by combining all facts together to create a single logical fact table with multiple Logical Table Sources (LTS) which are mapped to physical fact tables. The key changes done in logical layer are -
   *    Inner Join    →   F_Matches#1 is inner joined with Batting and Bowling Scorecard facts.
   *    Inner Join    →   F_Matches is inner joined with Score fact.
   *    Aggregation Rule    →   Rule is set appropriately for fact columns.
   *    Dimension Hierarchy    →   Hierarchy is created for each logical dimension.
   *    Content Setting    →   Logical level set in Content tab of fact LTS.

There are various reports created on this logical model. In some cases, we expect the background SQL to have Batting Scorecard to be joined with Matches#1 while in other case it should be joined with Matches. These types are scenarios are handled by proper design of LTS and inner joins, and I consider this as the one of the most important aspects of OBIEE development.

Image 5 - Logical Layer Star Schema Diagram

Image 6 - Logical Layer Configuration

Presentation Layer
In this layer, the logical tables are carried over and then renamed appropriately. The logical fact is broken down into 4 separate facts in presentaion layer just like our original design.

Image 7 - Presentation Layer

Download Here →

Password - Admin123, Version - 325 / 11.1.1.7.0

Creating Cricket Dashboard

As per the wireframe model, we will first create a master report "Match List" and then a set of detail reports "Batting Scorecard", "Bowling Scorecard", "Fall of Wickets", etc. The navigation from master to detail reports will be achieved by passing Match Number as parameter to display scorecards for a specific match.

Master Report
The master report in the tabular format is easy to create, just select the columns for report criteria and it is done. But it does not look good enough in this format, it needs to be modified for better user experience. We can use Narrative View to enhance visualization of data with HTML contents. Below image shows the data displayed in the basic table and customized narrative format, you can see the difference.

Image 8 - Master Report - Match List

Navigation
When a user clicks on the "View Scorecard" button, the HTML <a> tag leverages GO URL functionality of OBIEE to navigate to detail page and also passes Match Number as parameter.

< a href="saw.dll?
Portal&Path=/shared/MyCricketCatalog/_portal/Tournaments
&page=Match%20Detail&Action=Navigate
&P0=1&P1=eq&P2="Fact - Matches"."Match Number"&P3=@1" >
View Scorecard < /a >

Detail Reports
We have a following set of detail reports -

→ Inning 1 - Batting Scorecard	→ Inning 2 - Batting Scorecard
→ Inning 1 - Bowling Scorecard	→ Inning 2 - Bowling Scorecard
→ Inning 1 - Fall of Wickets	→ Inning 2 - Fall of Wickets
→ Inning 1 - Extras	→ Inning 2 - Extras

All of these reports are filtered for Match Number which is dynamically passed during navingation from master report. There is lot of formatting (colour, font, spacing, etc.) done on these reports to give it the look of a typical scorecard.

Image 9 - Detail Reports - Scorecards

Download Here →

Advanced Application

While the basic application allows us to search matches and view data, the advanced application has the capability to explore the insights hidden in the data. We just need good statistical measures and detailed dataset to find these insights. Below are some examples of what kind of reports can be built for advanced analysis -

Image 10 - Advanced Application - Example Reports

If anyone is interested in advanced analysis and detailed dataset is available, then you can develop your own formulae or models to analyze the performances based on different parameters. This will require modification or extension of data model design to accomodate new data points and customized calculations.

Friday, 5 January 2018

Visualizing Mandelbrot Set in Oracle BI using D3

Target Audience

Oracle BI Professionals - You can learn building new visualizations using JavaScript library.
Other BI Professionals - You may want to try this out in your BI tool (Cognos, PowerBI, Tableau, QlikView, etc.)
Others 1 - Those who want to learn a mathematical concept.
Others 2 - Those who love intricate / artistic patterns. You can just download the file and enjoy the beautiful patterns.
Others 3 - Those who want to test the processing power of their computer.

Background

For a long time, I was thinking about bringing together mathematics (my favorite subject since childhood) and Oracle BI (my passion and skillset). While browsing the internet for various math topics, I came across Mandelbrot Set which comes under the family of fractals. The thing that amazed me was that a small mathematical formula can result in very visually appealing patterns. In this blog, we will see what is a Mandelbrot Set and how to create Oracle BI report to visualize the set.

What is Mandelbrot Set?

It is a set of numbers "c" in the complex plane for which the absolute value of the quadratic function remains bounded when iterated from "z = 0". Below is the function -

z_n+1 = z_n² + c

It appears very difficult to understand but actually it's quite simple when you know the terminology used. Below image breaks down the definition-

Image 1 - Terminologies in the Mandelbrot Set definition

When we plot the points in Mandelbrot Set on the complex plane, the resulting image is an intricate pattern which repeats itself over-and-over again when we zoom in. The points in the green color belong to the Mandelbrot Set and points in other colors have become un-bounded after certain range of iterations.

Image 2 - Mandelbrot Set diagram

Preparing the Dataset

To visualize Mandelbrot Set in Oracle BI, we need to have the dataset. So let’s see the detail steps and calculations involved in identifying the points in Mandelbrot Set. The first step is to generate a grid of points and we are specifically interested in the points that lie in the range -2 and 2 on the real axis and -2i and 2i on the imaginary axis. This is because it is known that any number outside this grid will definitely break one of the criteria (bounded) and the absolute value will go to infinity when iterated for the quadratic equation.

Image 3 - Complex Planes with different density

It is fairly easy to generate the grid of points by nesting two loops in a PL-SQL procedure. As each point on the grid represents a complex number, there are two associated values – real part and imaginary part. For example, the complex number -1.3 + 0.7i has the real part as -1.3 and imaginary part as 0.7. As we have the grid of points, we can proceed with the calculations. The formula has 3 complex numbers – output number, input number and constant number. Let’s represent these numbers –

Complex Number	Real Part	Imaginary Part
Output Number, z_n+1	R3	I3
Input Number, z_n	R2	I2
Constant / Grid Point, c	R1	I1

z_n+1 = z_n² + c
R3 + I3 * i = (R2 + I2 * i)² + (R1 + I1 * i)	..... Expressing in terms of Real and Imaginary parts
R3 + I3 * i = (R2² + 2 * R2 * I2 * i + I2² * i²) + (R1 + I1 * i)	..... Expanding square expression
R3 + I3 * i = (R2² + 2 * R2 * I2 * i - I2²) + (R1 + I1 * i)	..... Replacing i² by -1
R3 + I3 * i = (R2² - I2² + R1) + (2 * R2 * I2 + I1) * i	..... Re-arranging the parts

This gives us the formula for real and imaginary parts of the output number as –

R3 = R2² - I2² + R1

I3 = 2 * R2 * I2 + I1

Absolute Value = √(R3² + I3²)

Let’s try out two complex numbers on the grid and see if they qualify to be part of the Mandelbrot Set.

z₁ = z₀² + c

→

z₂ = z₁² + c

→

z₃ = z₂² + c

→

z₄ = z₃² + c

→

z₅ = z₄² + c

*Grid Point (c) = -1.996 + 0.01 i**			*Grid Point (c) = 0.5 - 0.85 i**
Seq	Complex Number	Abs. Value	Seq	Complex Number	Abs. Value
z₀	0 + 0 * i	0	z₀	0 + 0 * i	0
z₁	- 1.996 + 0.01 * i	1.99602505	z₁	0.5 – 0.85 * i	0.986154146
z₂	1.987916 – 0.02992 * i	1.988141149	z₂	0.0275 – 1.7 * i	1.700222412
z₃	1.954914817 – 0.108956893 * i	1.957948811	z₃	- 2.38924375 - 0.9435 * i	2.568789199
z₄	1.813820336 – 0.416002891 * i	1.860914457	z₄	No need for further calculations as value greater than 2, hence unbounded.
z₅	1.120885805 – 1.499109006 * i	1.871820718	z₅

And the winner after 5 iterations → -1.996 + 0.01 * i
Here we have stopped after doing 5 iterations, but this complex number may not remain bounded if we do more iterations. Here is another challenge for you -

Challenge #2 - How many iterations can you do with your computer?

Let’s insert this information in the database table.

REAL_PART	IMG_PART	BOUNDED	ITERATIONS
- 1.996	0.01	YES	5
0.5	- 0.85	NO	3

Modelling in Oracle BI

In the Oracle BI Repository,

Physical Layer	→	Import the table from database and create two aliases - dimension and fact. A join can be created on any column or even dummy column (1=1) to complete the formality of building Star Schema.
Logical Layer	→	Next, move this model to logical layer. No need to create hierarchies or metrics here.
Presentation Layer	→	Bring only one table to this layer - dimension or fact, anyone is fine.

In the Oracle BI Answers,
For visualizing this data set, we will use Scatter Plot. The points will be displayed in different colors based on the number of iterations. For example, points with 5 iterations can be displayed in green color, 4 iterations as blue, 3 iterations as yellow and so on.

Image 4 - Mandelbrot Set diagram with standard graph in Oracle BI

This is how it looks with standard chart option in Oracle BI, but it does not look similar to actual diagram. We need something that will give us complete control on the formatting like point size, color, axis, interaction, etc. There are lot of JavaScript libraries available as open-source that can be integrated with Oracle BI and be utilized for advanced or very interactive visualizations. Here, we will use D3 (Data Driven Documents). The pre-requisite for this implementation is basic understanding of HTML, CSS and Javascript, this would help in writing and modifying the code to suit our visualization requirements.

The integration of D3 library requires following actions –
1. Download the library (https://d3js.org/).
     → It is just one file with .js extension. The file size is less than 1 MB.
     → Please note the version, there are syntax differences while coding.
2. Deploy the downloaded library to the Oracle BI Server.
     → It can be copied to Managed Server path (tmp\_WL_user\analytics_11.1.1\7dezjl\war\res \b_mozilla).
     → It can also be "deployed as an application" from the Admin Console.

Now we are all set to start visualizing Mandelbrot Set as an Oracle BI report using D3 libraries, we need to use Narrative View for this purpose. The process involves referencing D3 library (prefix section), binding data (narrative section) and manipulating the dataset (postfix section). The first two parts are pretty easy, however the last part involves writing lot of javascript code for drawing x-y axes, plotting points, scaling, coloring, etc.

Prefix Section
     → Reference the D3 library using <script> tag.
     → Declare an empty array. Example - mbset = [];

Narrative Section
     → Bind the data to array using push method.
     → Example - mbset.push({BOUNDED:@1, REAL:@2, IMG:@3});
     → Please note @1, @2 and @3 are column sequence in the Criteria tab of the report.

Postfix Section
     → Define the axes using axisBottom() and axisLeft() functions.
     → To utilize more screen space, scale-up the axes using scaleLinear() function.
     → To plot and format(color, size and position) grid points, use SVG elements.

Image 5 - JavaScript Coding in the Narrative View

It is now time to see the final result -

Image 6 - Mandelbrot Set diagram with D3 Visualization in Oracle BI

About The Challenges

I have tried different combinations of grid density and iterations and below are the performances on my personal laptop -

Grid Density	Unit Step	Iterations	Execution Time
400 x 400	0.0100	40	13.34 seconds
800 x 800	0.0050	40	62.744 seconds
800 x 800	0.0050	80	76.22 seconds
2000 x 2000	0.0020	200	708.73 seconds
4000 x 4000	0.0010	100	1994.74 seconds
8000 x 8000	0.0005	100	8711.18 seconds

If you want to test performance / limits of your computer, you can go for denser grid like 100k x 100k, 1 million x 1 million, 2 million x 2 million, etc. These challenges may also help you out to check efficiency of PL-SQL code. There is absolutely no limitation to how far you can go with respect to grid density or iterations, you can search and check out YouTube videos to understand this.

Download

At the below link, there is a zip file which contains -
→ HTML file with JavaScript code
→ CSV file with Mandelbrot Set data
→ Close-up images of different parts of Mandelbrot Set

Download Here →

The HTML file can be modified to see how different aspects of the code work, you can play with the chart area by modifying iteration ranges or color codes.

Note : I have tested this in Firefox 57.0.4 and it works fine. However, there could be browser or version specific issues. Also, it takes a while to render the image in the browser.

Thursday, 22 October 2015

My First Technical Solution

Background

It was year 2008, about a year in the IT career. I was doing a proof-of-concept for modelling dimension hierarchy table in OBIEE. I had created a table with sample data and next step was to create the hierarchy table for it. The available method to achieve this was using a complex Informatica mapping and I hated Informatica that time. The idea of using Informatica mapping for such a simple task(IMO that time) of converting dimension table to dimension hierarchy didn't make sense to me. I was looking for a simple SQL that would do the job. I soon realized that there is no SQL method for doing this, a fact that my mind refused to accept. So I decided to build it myself.

The Problem

To display all hierarchial reporting manager details on a single record.

Convert this information-

Unique_ID	Emp_Name	Position	Parent_ID
1032	Sandhya Shataram	President	NULL
1033	Jayshree Gadkar	Vice President	1032
1034	Varsha Usgaonkar	Senior Manager	1033
1035	Sonalee Kulkarni	Manager	1034
1036	Ketaki Mategaonkar	Sales Representative	1035

Into hierarchial format on one line like this-

Level5_Emp_Name	Level_5_Position	Level4_Emp_Name	Level4_Position	Level3_Emp_Name	Level3_Position	Level2_Emp_Name	Level2_Position	Level1_Emp_Name	Level1_Position
Sandhya Shataram	President	Sandhya Shataram	President	Sandhya Shataram	President	Sandhya Shataram	President	Sandhya Shataram	President
Jayshree Gadkar	Vice President	Jayshree Gadkar	Vice President	Jayshree Gadkar	Vice President	Jayshree Gadkar	Vice President	Sandhya Shataram	President
Varsha Usgaonkar	Senior Manager	Varsha Usgaonkar	Senior Manager	Varsha Usgaonkar	Senior Manager	Jayshree Gadkar	Vice President	Sandhya Shataram	President
Sonalee Kulkarni	Manager	Sonalee Kulkarni	Manager	Varsha Usgaonkar	Senior Manager	Jayshree Gadkar	Vice President	Sandhya Shataram	President
Ketaki Mategaonkar	Sales Representative	Sonalee Kulkarni	Manager	Varsha Usgaonkar	Senior Manager	Jayshree Gadkar	Vice President	Sandhya Shataram	President

Sample Data

Click here for the code --> CREATE TABLE and INSERT INTO Scripts

PL-SQL Solution

Click here for the code --> PL-SQL Script

This solution involves inserting records in the dimension hierarchy table for each level in dimension table one-by-one. The logical steps in the PL-SQL procedure are as follows -
1. Identify the records with highest level in the dimension table and load these into a dummy table.
2. For each record in the dummy table, insert a record in the dimension hierarchy table.
3. Next, identify the records with second highest level in the dimension table and load these into another dummy table.
4. For each record in this new dummy table, insert a record in the dimension hierarchy table.
5. Repeat steps 3 and 4 for all lower levels in the dimension table.

Wow!!!!! I did it, found a solution for the problem and it feels great. But this euphoria didn't last long because the SQL solution is not simple. There are too many lines, loops, conditions in the code. THE SOLUTION MUST BE SIMPLE! Couple of weeks passed thinking about alternate solution, is it possible to do it another way? One day I realized that the PL-SQL solution, at the design level, does the job on a row-by-row basis. If I have to do it another way, it should be, at the design level, on a column-by-column basis. I got the direction for my second solution and so started working on it.

SQL Solution

Click here for the code --> SQL Script

This solution involves preparing an SQL statement with multiple self-joins to dimension table and decode statements to select correct values for each column.

Afterthoughts

So I found the solution for the problem, twice! However, it does not replace Informatica mapping as it's a proven ETL tool with many large scale implementations. But these solutions provides alternative ways to achieve the same objective. This gave me a sense of satisfaction and confidence to solve the problems.

Thanks for spending your time to read my blog!