Why we need data integration and what can we do?

The first question is why we need data integration?

Let me give you an example here to answer this question.

Every company has many departments, and different departments use different tools to store their data. For example, marketing team may use hubspot tool.

Now, we have different departments which store different types of data in a company.

However, insightful information is needed to make business decisions through those large amount of data.

What can we do?

Maybe we can connect all the databases everytime to generate reports. However, it will cost us large amount of time, then the term of data integration is raised.

What is data integration?

Data integration is a process in which heterogeneous data is retrieved and combined as an incorporated form and structure.

There are some advantages of data integration:

  1. Reduce data complexity
  2. Data integrity
  3. Easy data collaboration
  4. Smarter business decisions

There are also some well-known tools can do data integration, e.g., Microsoft SQL Server Integration Services (SSIS), Informatica, Oracle Data Integrator, AWS Glue etc.

This blog we will talk about SSIS, which is one of the most popular ETL tool in New Zealand.

Why SSIS?

  1. Data can be loaded in parallel to many varied destinations
  2. SSIS removes the need of hard core programmers
  3. Tight integration with other products of Microsoft
  4. SSIS is cheaper than most other ETL tools

SSIS stands for SQL Server Integration Services, which is a component of the Microsoft SQL Server database software that can be used to perform a broad range of data integration and data transformation tasks.

What can SSIS do?

It combines the data residing in different sources and provides users with an unified view of these data.

Also, it can also be used to automate maintenance of SQL Server databases and update to multidimensional analytical data.

How is works?

These tasks of data transformation and workflow creation is carried out using SSIS package:

Operational Data–>ETL–>Data Warehouse

In the first place, operational data store (ODS) is a database designed to integrate data from multiple sources for additional operations on the data. This is the place where most of the data used in current operation is housed before it’s tranferred to the data warehouse for longer tem storage or achiiving.

Next step is ETL(Extract, Transform and Load), the process of extracting the data from various sources, tranforming this data to meet your requirement and then loading into a target data warehouse.

The third step is data warehousing, which is a large set of data accumulated which is used for assembling and managing data from various sources of answering business questions. Hence, helps in making decisions.

What is SSIS package?

It is an object that implements integration services functionality to extract, transform and load data. A package is composed of:

  1. connection
  2. control flow elements(handle workflow)
  3. data flow elements(handle data transform)

If you want to investigate more about SSIS, check it out on Microsoft official documents.

If you are interested in or have any problems with SSIS, feel free to contact me.

Or you can connect with me through my LinkedIn.

Dimensional Modelling

First of all, a concept of Data Warehouse is supposed to be clear. Data Warehouse is not a copy of source database with a name prefixed with ‘DW’.

DW is we can store data from multiple data sources to be used for historical or analysis report. Some data sources are .csv format, some are google docs, etc. So we need to aggregate them into one data source.

This is called Data Warehouse.

How to design it? The procedure is Dimensional Modelling.

What is the difference between relational and dimensional?

It is same as the difference like normalised VS denormalised.

Normalised:

  1. Minimal data redundancy
  2. Optimised for fast read and fast write
  3. Current data only
  4. Realtime data

Denormalised:

  1. Redundancy data storage for performance
  2. Fast read only
  3. Non-realtime data
  4. Current and historical data

The next part is about fact table and dimension table in dimensional design.

Fact table:

  1. Data that can be measured
  2. Contains surrogate key, linking the associated measures or facts

Dimension table:

  1. Descriptive information

Some types of dimensional models:

  1. Accumalating snapshot table
  2. Aggregate fact
  3. Fact table
  4. Factless Fact table
  5. Snapshot table

If you are interested in or have any problems with Dimensional Modelling, feel free to contact me.

Or you can connect with me through my LinkedIn.

SQL Correlated Subqueries

Recently I often made some mistakes about subqueries, so I wrote this blog about correlated subqueries.

Firstly, here is a SQL practicing website: https://sqlzoo.net.

It is free and easy for SQL beginners to do SQL exercise step by step.

A correlated subquery works like a nested loop: the subquery only has access to rows related to a single record at a time in the outer query.

The technique relies on table aliases to identify two different usages of the same table, which means one usage is in the outer query and another one is in the subquery.

Here is a table called world, which is an example on the sqlzoo website:

namecontinentareapopulationgdp
Afghanistan Asia 652230 25500100 20343000000
Albania Europe 28748 2831741 12960000000
Algeria Africa 2381741 37100000 188681000000
Andorra Europe 468 78115 3712000000
Angola Africa 1246700 20609294 100990000000

Question: Find the largest country (by area) in each continent, show the continent, the name and the area.

SQL answer using subquery:

SELECT continent, name, area 
FROM world x
WHERE area >= ALL
    (SELECT area FROM world y
     WHERE y.continent=x.continent AND area>0)

One way to interpret the SQL line in the WHERE clause that references the two table is “… where the correlated values are the same”.

In this example, we can tell “select the country details from the world table where the area is larger than or equal to the area of all countries where the continent is the same”.

If you are interested in or have any problems with SQL, feel free to contact me.

Or you can connect with me through my LinkedIn.

Rules of Creating Dimension Tables and Fact Tables

What is a dimension table? What is a fact table?

Why we need both of them in Data Warehouse?

Because Data Warehouse is used to make reports for business decisions.

Every report is made of two parts: Fact and Dimension.

Here is a picture of fact tables and dimension tables in a star schema in data warehouse.

So, this blog we will talk about what are the rules of creating dimension tables and fact tables.

First of all, we need to illustrate a definition.

What is surrogate key?

It is the primary key in demension tables.

Rules of creating dimension tables:

  1. Primary key (surrogate key, auto-increase number, only unique number in data warehouse)
  2. Business key (the key can be linked back to data source, with business meaning)
  3. Attributes (descriptive information from data source)

There are two kinds of data: Master data and transactional data.

Master data refers to the entity (e.g. employee) whereas transactional data refers to all the transactions that are carried out using that entity.

Master data is limited whereas transactional data can be billions.

In dimension tables, most data is master data.

Rules of creating fact tables:

  1. Primary key (surrogate key/alternate key, auto-increase number)
  2. Foreign key (primary key/surrogate key/alternate key from dimension tables)
  3. Measure (addictive number/semi-addictive number)

Tips: No descriptive data in Fact tables.

If you are interested in or have any problems with fact tables and dimension tables, feel free to contact me .

Or you can connect with me through my LinkedIn.

Troubleshooting SSMS Error: 15517

If you meet with the same error as me in SSMS 2016.

Error:

“Cannot execute as the database principal because the principal “dbo” does not exist, this type of principal cannot be impersonated, or you do not have permission (Microsoft SQL Server, Error: 15517)”

Solution:

use [databasename]
GO
EXEC sp_changedbowner 'sa'
GO

Hope this solution can help you.

If you are interested in or have any problems with SQL and SSMS, feel free to contact me .

Or you can connect with me through my LinkedIn.

Stored Procedure in SQL

This article I will give you some basic syntax about stored procedure in SQL.

Stored Procedure:

Stored Procedure is the prepared SQL codes that we can save, so the code can be reused over and over again.

It is suitable for some SQL queries we need to use frequently.

There are three kinds of stored procedure:

  1. No parameter
  2. One parameter
  3. multiple parameters

Stored Procedure (No parameter) Syntax

CREATE PROCEDURE procedure_name
AS
sql_statement
GO
EXEC procedure_name

Stored Procedure (One parameter) Syntax

CREATE PROCEDURE [dbo].[oneparameter]
@ProductCatergoryID int
AS
SELECT *FROM Production.ProductCategoryID
GO
EXEC oneparameter @ProductCatergoryID = '4'

Stored Procedure (multiparameter) Syntax

CREATE PROCEDURE [dbo].[multiparameter]
@ProductCategoryID int, @Name varchar(50)
AS
SELECT *FROM Production.ProductCategory pc
WHERE pc.ProductCategoryID = @ProductCategoryID and pc.Name = @Name
GO
EXEC multiparameter @ProductCategoryID = '4', @Name = 'Accessories'

I will record all knowledge I touch in my Business Intelligence journey.

Next blog will be around Data Warehouse.

If you are interested in or have any problems with Business Intelligence, feel free to contact me .

Or you can connect with me through my LinkedIn.

Some Cheat Sheets about Business Intelligence

There are several parts of my Business Intelligence journey.

So I write this blog for recording some cheat sheets.

Installation:

Firstly, some tools need to be installed on the laptop.

All of them are Microsoft softwares, so I prepare a surface laptop 2.

  1. SQL Server 2016 Developer Edition: Seclect the features we need to install, such as SQL Server Integration Services(SSIS), SQL Server Analysis Services(SSAS) and SQL Server Reporting Services(SSRS).
  2. SQL Server Management Studio(SSMS)
  3. SQL Server Data Tools(SSDT)
  4. Report Builder

What is Business Intelligence(BI)?

I have mentioned it in the previous blog. It is to transform raw data into meaningful information for analysis.

BI Process:

  1. Import data from different data sources
  2. Produce reports or graghs through SSIS, SSAS and SSRS
  3. Make business decisions

Some terms:

ETL: Extraction, Transformation and Loading

OLTP: Online Transaction Processing

Cube: Multi-dimensional data structure built using dimensions and facts, using MDX (Multi dimensional expression) language

SQL and Relational Database: which I have mentioned them in the previous blog

Primary Key: A single attribute or an unique id in a Relational Database

Foreign Key: A set of attributes that references a candidate key in a Relational Database

Some differences:

Traditional Normalized Database VS Data Warehouse: Formal one has no duplicate data but latter one has duplicate data for efficiency.

Null VS Blank(where clause): For null, the clause is where xxxxx is null; For blank, the clause is where xxxxx is ”

Delete VS truncate: truncate table xxxxx ->no rollback&faster; delete from xxxxx -> rollback

SQL Language:

Learning sources:

  1. w3schools.com
  2. SQL Server tutorial for beginner

Some common commands:

One table:

SELECT…FROM

WHERE CLAUSE

ORDER BY CLAUSE

Two tables:

Inner Join

Other Joins, e.g., LEFT OUTER JOIN

GROUP BY

If you are interested in or have any problems with Business Intellgence, feel free to contact me .

Or you can connect with me through my LinkedIn.