Monday, 19 March 2018

Cricket Insights with Oracle BI

Cricket Insights with Oracle BI

Target Audience

Oracle BI Professionals - You can learn building a new data model and challenges involved in the development.
Other BI Professionals - You may want to try this out in your BI tool (Cognos, PowerBI, Tableau, QlikView, etc.)
Others - Those who love cricket and want to analyze data from different perspectives.

Background

The sports industry has been using analytical tools extensively from a long time. In particular, cricket is leading in this space with rich visualizations for scorecards, player comparison and other stats. I don't know what tool / technology is used to build such appealing and interactive reports displayed during the match, but I thought I can apply my knowledge in Business Intelligence to do similar analysis using Oracle BI. In this blog, we will see how to build an analytics application for cricket from scratch.

Objective and Design Approach

It is important to have clear understanding of expectations from Cricket Analytics application so that we are on the same page. There can be different types of applications as listed below -

Basic Application
It should have a dashboard where you can filter on the match format, tournament and year. This should display a list of matches for the filter criteria. You should then be able to click on particular match to view the match summary and scorecards.

Advanced Application
As the name suggests, it should have advanced features such as metrics or calculations to do comparative analysis, rating system / methodology, etc. With this application, we should be able to identify best performances in batting, bowling and overall for different match formats.

Out of the Scope
Well, there are things that can only be identified with an expert eye and is beyond the capability of Cricket Analytics application. The best wicket taken by a bowler, the best catch by a fielder or the best shot played by a batsman fall in this category. An apt example for this case is the famous "Ball of the 21st Century" in the 3rd test match between Australia and New Zealand at WACA, Perth held during 14th-18th December 2017. The Mitchell Starc's dismissal of James Vince has been widely appreciated by many bowling legends. May be in the future, an Artificial Intelligence (AI) application with very detailed dataset will be able to identify such moments in the cricket history.

Let's continue with the basic application and later we can take up the advanced part. The below image is a visual representation of wireframe model of the basic application.

Image 1 - Wireframe Model of Cricket Insights Application


The standard data warehouse implementation involves data model design which means identifying the key entities / dimensions and events / facts and determine how they relate to each other. I know this process sounds very technical, so let's break it down so that its easier to understand the terminologies in data warehousing domain.

Let's look at the questions and answers in the below image.

Image 2 - Data Model Design 1


We can observe that the first set of questions have textual answers while the second set has numerical answers. In data warehousing, the dimensions have descriptive attributes and facts store numerical information. Here, the dimension Venue can have attributes like stadium name and city, the dimension Date can have attributes like month and year, the dimension Players can have attributes like name, date of birth, country. The facts basically represent an event or activity or process. The match between two teams is an activity in which runs are scored, so it's a good candidate for fact.

As we have basic understanding of dimensions and facts, we will now define the relationships among them. A fact is typically connected with a number of dimensions with one-to-many (1:M) relationship. The Players dimension can have one record for Kapil Dev and one record for Clive Lloyd, but they have scored / played in a lot of matches so the fact can have multiple records for them. The same can be said about Venue - mutiple matches played at a stadium and Date - mutiple matches played on a day. There are more factors involved in a match as showcased in the below image.

Image 3 - Data Model Design 2


In the above diagram, Player of the Match, Batsman and Bowler represent Players dimension which is used in the different context. In the same way, Team 1 (Batting) and Team 2 (Fielding) are different representations of Team dimension. We can also have a common dimension that will store various list of values like dismissal type and extras type. This concept of using a dimension in different context is referred as "Role Playing Dimension". The fact Match has references (foreign keys) for the connected dimensions and additional details like toss and match win summary. I have created following variations for facts -
   *    Fact - Score    →   This records runs scored and wickets on ball-by-ball basis.
   *    Fact - Scorecard Batting    →   This records performance details of each batsman.
   *    Fact - Scorecard Bowling    →   This records performance details of each bowler.

The first fact contains the information at the detailed level with runs scored on each ball, extras and wickets. The second and third facts have data aggregated at player level. The batting scorecard has batsman's performance recorded as runs scored, balls played, strike rate, fours and sixes hit. The last fact stores bowler's performance as overs bowled, wickets taken, economy and dot balls.

Acquiring the Data

We have talked about what to expect from Cricket Analytics application and some key concepts in data warehousing to design the data model for the application. In order to proceed further, we need data that will be displayed and analyzed in the reports. There are many sites that have current and historical cricket match results. Some of the prominent ones are ICC Cricket, Cricbuzz, ESPN and BBC. I chose to get the data from ICC Cricket (www.icc-cricket.com) as it is the official website of the governing body of world cricket.

It is fairly easy to pull match results and scorecards from any of the famous websites. However, as we are developing an analytics application for cricket we require details on ball-by-ball basis. On top of this, we need this information in the specific format (dimensions and facts) we discussed during design of the data model. This was a hard part, I wrote some code to scrap data from the commentary section of World Cup 2015 and then lot of data clean-up was done to bring it in the desired format. The intention of this blog is to focus on design and development of a new analytics application, so the data acquisition is described at high level only and steps for data scraping is not covered in detail. I understand some readers would be interested in the final dataset, so I have made it available here to download and analyze the data.

      Download Here    →  


Building Oracle BI Repository

Our next move is to build the Oracle BI repository from scratch, this means we create a blank / new repository and then build all 3 layers one-by- one.

Physical Layer
We start with importing all the tables (dimensions and facts) from the database. Then we create the aliases for each table as this is one of the best practices in OBIEE development. Some tables have more than one alias for the reason discussed during the design phase which is Role Playing Dimension. There is also an alias created for the fact table as it joins to other facts.

  Table Name   Alias Name   Comments
  DIM_PLAYERS   D_Players   Connected with 3 facts - Score, Batting and Bowling.
  DIM_PLAYERS   D_Player_of_Match   Connected with 1 facts - Matches.
  DIM_TEAMS   D_Teams   Connected with 2 facts - Matches and Score.
  DIM_TEAMS   D_Opposition   Connected with 1 fact - Matches.
  FACT_MATCHES   F_Matches   Connected with multiple dimensions and 1 fact - Score.
  FACT_MATCHES   F_Matches#1   Connected with 2 facts - Batting and Bowling.

The tables are joined as shown in the below image, there are multiple star schemas represented in the single diagram. A key point to be noted here is D_Opposition is joined with F_Matches but not with F_Score. This is because there are two teams playing a match but only one team is scoring at a time.

Image 4 - Physical Layer Star Schema Diagram


Logical Layer
The multiple physical star schemas are consolidated into one in the logical layer. This is achieved by combining all facts together to create a single logical fact table with multiple Logical Table Sources (LTS) which are mapped to physical fact tables. The key changes done in logical layer are -
   *    Inner Join    →   F_Matches#1 is inner joined with Batting and Bowling Scorecard facts.
   *    Inner Join    →   F_Matches is inner joined with Score fact.
   *    Aggregation Rule    →   Rule is set appropriately for fact columns.
   *    Dimension Hierarchy    →   Hierarchy is created for each logical dimension.
   *    Content Setting    →   Logical level set in Content tab of fact LTS.

There are various reports created on this logical model. In some cases, we expect the background SQL to have Batting Scorecard to be joined with Matches#1 while in other case it should be joined with Matches. These types are scenarios are handled by proper design of LTS and inner joins, and I consider this as the one of the most important aspects of OBIEE development.

Image 5 - Logical Layer Star Schema Diagram

Image 6 - Logical Layer Configuration


Presentation Layer
In this layer, the logical tables are carried over and then renamed appropriately. The logical fact is broken down into 4 separate facts in presentaion layer just like our original design.

Image 7 - Presentation Layer

      Download Here    → Password - Admin123, Version - 325 / 11.1.1.7.0


Creating Cricket Dashboard

As per the wireframe model, we will first create a master report "Match List" and then a set of detail reports "Batting Scorecard", "Bowling Scorecard", "Fall of Wickets", etc. The navigation from master to detail reports will be achieved by passing Match Number as parameter to display scorecards for a specific match.

Master Report
The master report in the tabular format is easy to create, just select the columns for report criteria and it is done. But it does not look good enough in this format, it needs to be modified for better user experience. We can use Narrative View to enhance visualization of data with HTML contents. Below image shows the data displayed in the basic table and customized narrative format, you can see the difference.

Image 8 - Master Report - Match List


Navigation
When a user clicks on the "View Scorecard" button, the HTML <a> tag leverages GO URL functionality of OBIEE to navigate to detail page and also passes Match Number as parameter.

  < a href="saw.dll?
Portal&Path=/shared/MyCricketCatalog/_portal/Tournaments
&page=Match%20Detail&Action=Navigate
&P0=1&P1=eq&P2="Fact - Matches"."Match Number"&P3=@1" >
View Scorecard < /a >
 

Detail Reports
We have a following set of detail reports -
   →   Inning 1 - Batting Scorecard    →   Inning 2 - Batting Scorecard
   →   Inning 1 - Bowling Scorecard    →   Inning 2 - Bowling Scorecard
   →   Inning 1 - Fall of Wickets    →   Inning 2 - Fall of Wickets
   →   Inning 1 - Extras    →   Inning 2 - Extras
All of these reports are filtered for Match Number which is dynamically passed during navingation from master report. There is lot of formatting (colour, font, spacing, etc.) done on these reports to give it the look of a typical scorecard.

Image 9 - Detail Reports - Scorecards


      Download Here    →  


Advanced Application

While the basic application allows us to search matches and view data, the advanced application has the capability to explore the insights hidden in the data. We just need good statistical measures and detailed dataset to find these insights. Below are some examples of what kind of reports can be built for advanced analysis -

Image 10 - Advanced Application - Example Reports


If anyone is interested in advanced analysis and detailed dataset is available, then you can develop your own formulae or models to analyze the performances based on different parameters. This will require modification or extension of data model design to accomodate new data points and customized calculations.

No comments:

Post a Comment