Why we need data integration and what can we do?

The first question is why we need data integration?

Let me give you an example here to answer this question.

Every company has many departments, and different departments use different tools to store their data. For example, marketing team may use hubspot tool.

Now, we have different departments which store different types of data in a company.

However, insightful information is needed to make business decisions through those large amount of data.

What can we do?

Maybe we can connect all the databases everytime to generate reports. However, it will cost us large amount of time, then the term of data integration is raised.

What is data integration?

Data integration is a process in which heterogeneous data is retrieved and combined as an incorporated form and structure.

There are some advantages of data integration:

  1. Reduce data complexity
  2. Data integrity
  3. Easy data collaboration
  4. Smarter business decisions

There are also some well-known tools can do data integration, e.g., Microsoft SQL Server Integration Services (SSIS), Informatica, Oracle Data Integrator, AWS Glue etc.

This blog we will talk about SSIS, which is one of the most popular ETL tool in New Zealand.

Why SSIS?

  1. Data can be loaded in parallel to many varied destinations
  2. SSIS removes the need of hard core programmers
  3. Tight integration with other products of Microsoft
  4. SSIS is cheaper than most other ETL tools

SSIS stands for SQL Server Integration Services, which is a component of the Microsoft SQL Server database software that can be used to perform a broad range of data integration and data transformation tasks.

What can SSIS do?

It combines the data residing in different sources and provides users with an unified view of these data.

Also, it can also be used to automate maintenance of SQL Server databases and update to multidimensional analytical data.

How is works?

These tasks of data transformation and workflow creation is carried out using SSIS package:

Operational Data–>ETL–>Data Warehouse

In the first place, operational data store (ODS) is a database designed to integrate data from multiple sources for additional operations on the data. This is the place where most of the data used in current operation is housed before it’s tranferred to the data warehouse for longer tem storage or achiiving.

Next step is ETL(Extract, Transform and Load), the process of extracting the data from various sources, tranforming this data to meet your requirement and then loading into a target data warehouse.

The third step is data warehousing, which is a large set of data accumulated which is used for assembling and managing data from various sources of answering business questions. Hence, helps in making decisions.

What is SSIS package?

It is an object that implements integration services functionality to extract, transform and load data. A package is composed of:

  1. connection
  2. control flow elements(handle workflow)
  3. data flow elements(handle data transform)

If you want to investigate more about SSIS, check it out on Microsoft official documents.

If you are interested in or have any problems with SSIS, feel free to contact me.

Or you can connect with me through my LinkedIn.

View VS SYNONYM VS TRIGGER in SQL

What is a view?

A view is a virtual table based on the result set of an SQL statement.

Here is an example.

create view [sales].[v.salesbyperson2]
as
select
  salespersonid, round(totaldue,2) as salesamount
 from
sales.sales

What is a synonym?

A synonym, like the name, is an alternate name we create for another database object. 

Here is the syntax:

create synonym [synonym_name]
for [server_name].[database_name].[schema_name].[object_name]

What is the trigger?

A trigger is a special kind of stored procedure that automatically executes when an event occurs in the database server. 

Regardless of whether or not any table rows are affected.

Here is an example:

create trigger reminder1
on sales.customer
after insert, update
as raiserror ('notify customer relations', 16, 10);
go

If you are interested in or have any problems with SQL or Business Intelligence, feel free to contact me.

Or you can connect with me through my LinkedIn.

Dimensional Modelling

First of all, a concept of Data Warehouse is supposed to be clear. Data Warehouse is not a copy of source database with a name prefixed with ‘DW’.

DW is we can store data from multiple data sources to be used for historical or analysis report. Some data sources are .csv format, some are google docs, etc. So we need to aggregate them into one data source.

This is called Data Warehouse.

How to design it? The procedure is Dimensional Modelling.

What is the difference between relational and dimensional?

It is same as the difference like normalised VS denormalised.

Normalised:

  1. Minimal data redundancy
  2. Optimised for fast read and fast write
  3. Current data only
  4. Realtime data

Denormalised:

  1. Redundancy data storage for performance
  2. Fast read only
  3. Non-realtime data
  4. Current and historical data

The next part is about fact table and dimension table in dimensional design.

Fact table:

  1. Data that can be measured
  2. Contains surrogate key, linking the associated measures or facts

Dimension table:

  1. Descriptive information

Some types of dimensional models:

  1. Accumalating snapshot table
  2. Aggregate fact
  3. Fact table
  4. Factless Fact table
  5. Snapshot table

If you are interested in or have any problems with Dimensional Modelling, feel free to contact me.

Or you can connect with me through my LinkedIn.

SQL Correlated Subqueries

Recently I often made some mistakes about subqueries, so I wrote this blog about correlated subqueries.

Firstly, here is a SQL practicing website: https://sqlzoo.net.

It is free and easy for SQL beginners to do SQL exercise step by step.

A correlated subquery works like a nested loop: the subquery only has access to rows related to a single record at a time in the outer query.

The technique relies on table aliases to identify two different usages of the same table, which means one usage is in the outer query and another one is in the subquery.

Here is a table called world, which is an example on the sqlzoo website:

namecontinentareapopulationgdp
Afghanistan Asia 652230 25500100 20343000000
Albania Europe 28748 2831741 12960000000
Algeria Africa 2381741 37100000 188681000000
Andorra Europe 468 78115 3712000000
Angola Africa 1246700 20609294 100990000000

Question: Find the largest country (by area) in each continent, show the continent, the name and the area.

SQL answer using subquery:

SELECT continent, name, area 
FROM world x
WHERE area >= ALL
    (SELECT area FROM world y
     WHERE y.continent=x.continent AND area>0)

One way to interpret the SQL line in the WHERE clause that references the two table is “… where the correlated values are the same”.

In this example, we can tell “select the country details from the world table where the area is larger than or equal to the area of all countries where the continent is the same”.

If you are interested in or have any problems with SQL, feel free to contact me.

Or you can connect with me through my LinkedIn.

Rules of Creating Dimension Tables and Fact Tables

What is a dimension table? What is a fact table?

Why we need both of them in Data Warehouse?

Because Data Warehouse is used to make reports for business decisions.

Every report is made of two parts: Fact and Dimension.

Here is a picture of fact tables and dimension tables in a star schema in data warehouse.

So, this blog we will talk about what are the rules of creating dimension tables and fact tables.

First of all, we need to illustrate a definition.

What is surrogate key?

It is the primary key in demension tables.

Rules of creating dimension tables:

  1. Primary key (surrogate key, auto-increase number, only unique number in data warehouse)
  2. Business key (the key can be linked back to data source, with business meaning)
  3. Attributes (descriptive information from data source)

There are two kinds of data: Master data and transactional data.

Master data refers to the entity (e.g. employee) whereas transactional data refers to all the transactions that are carried out using that entity.

Master data is limited whereas transactional data can be billions.

In dimension tables, most data is master data.

Rules of creating fact tables:

  1. Primary key (surrogate key/alternate key, auto-increase number)
  2. Foreign key (primary key/surrogate key/alternate key from dimension tables)
  3. Measure (addictive number/semi-addictive number)

Tips: No descriptive data in Fact tables.

If you are interested in or have any problems with fact tables and dimension tables, feel free to contact me .

Or you can connect with me through my LinkedIn.

Troubleshooting SSMS Error: 15517

If you meet with the same error as me in SSMS 2016.

Error:

“Cannot execute as the database principal because the principal “dbo” does not exist, this type of principal cannot be impersonated, or you do not have permission (Microsoft SQL Server, Error: 15517)”

Solution:

use [databasename]
GO
EXEC sp_changedbowner 'sa'
GO

Hope this solution can help you.

If you are interested in or have any problems with SQL and SSMS, feel free to contact me .

Or you can connect with me through my LinkedIn.

Stored Procedure in SQL

This article I will give you some basic syntax about stored procedure in SQL.

Stored Procedure:

Stored Procedure is the prepared SQL codes that we can save, so the code can be reused over and over again.

It is suitable for some SQL queries we need to use frequently.

There are three kinds of stored procedure:

  1. No parameter
  2. One parameter
  3. multiple parameters

Stored Procedure (No parameter) Syntax

CREATE PROCEDURE procedure_name
AS
sql_statement
GO
EXEC procedure_name

Stored Procedure (One parameter) Syntax

CREATE PROCEDURE [dbo].[oneparameter]
@ProductCatergoryID int
AS
SELECT *FROM Production.ProductCategoryID
GO
EXEC oneparameter @ProductCatergoryID = '4'

Stored Procedure (multiparameter) Syntax

CREATE PROCEDURE [dbo].[multiparameter]
@ProductCategoryID int, @Name varchar(50)
AS
SELECT *FROM Production.ProductCategory pc
WHERE pc.ProductCategoryID = @ProductCategoryID and pc.Name = @Name
GO
EXEC multiparameter @ProductCategoryID = '4', @Name = 'Accessories'

I will record all knowledge I touch in my Business Intelligence journey.

Next blog will be around Data Warehouse.

If you are interested in or have any problems with Business Intelligence, feel free to contact me .

Or you can connect with me through my LinkedIn.

Some Cheat Sheets about Business Intelligence

There are several parts of my Business Intelligence journey.

So I write this blog for recording some cheat sheets.

Installation:

Firstly, some tools need to be installed on the laptop.

All of them are Microsoft softwares, so I prepare a surface laptop 2.

  1. SQL Server 2016 Developer Edition: Seclect the features we need to install, such as SQL Server Integration Services(SSIS), SQL Server Analysis Services(SSAS) and SQL Server Reporting Services(SSRS).
  2. SQL Server Management Studio(SSMS)
  3. SQL Server Data Tools(SSDT)
  4. Report Builder

What is Business Intelligence(BI)?

I have mentioned it in the previous blog. It is to transform raw data into meaningful information for analysis.

BI Process:

  1. Import data from different data sources
  2. Produce reports or graghs through SSIS, SSAS and SSRS
  3. Make business decisions

Some terms:

ETL: Extraction, Transformation and Loading

OLTP: Online Transaction Processing

Cube: Multi-dimensional data structure built using dimensions and facts, using MDX (Multi dimensional expression) language

SQL and Relational Database: which I have mentioned them in the previous blog

Primary Key: A single attribute or an unique id in a Relational Database

Foreign Key: A set of attributes that references a candidate key in a Relational Database

Some differences:

Traditional Normalized Database VS Data Warehouse: Formal one has no duplicate data but latter one has duplicate data for efficiency.

Null VS Blank(where clause): For null, the clause is where xxxxx is null; For blank, the clause is where xxxxx is ”

Delete VS truncate: truncate table xxxxx ->no rollback&faster; delete from xxxxx -> rollback

SQL Language:

Learning sources:

  1. w3schools.com
  2. SQL Server tutorial for beginner

Some common commands:

One table:

SELECT…FROM

WHERE CLAUSE

ORDER BY CLAUSE

Two tables:

Inner Join

Other Joins, e.g., LEFT OUTER JOIN

GROUP BY

If you are interested in or have any problems with Business Intellgence, feel free to contact me .

Or you can connect with me through my LinkedIn.

The Simplest Way to Install SQL Server 2017 on macOS

This article we will show how to install Microsoft SQL Server 2017 on macOS, as we talk it in previous blog.

Which method we use?

Prior to SQL Server 2017, if we want to install it on macOS, a virtual machine (like Parallels Desktop) is essential. Then we can install a Windows system in it and then we install and run SQL Server.

Luckily, from SQL Server 2017, we can choose to install SQL Server on Docker containers.

How to install SQL Server on Docker?

1 What is Docker?

Docker is a platform which can make softwares to run in it. It is called a container, which is an isolated environment.

2 Download and install Docker

If you haven’t installed Docker on your Mac, the next step is to install Docker.

Go to Docker page to download the .dmg file and then double click and install it according to the instructions.

3 Run Docker and increase the memory

Run Docker as you used to run other softwares and the next step is to increase the memory. This is because the default value of Docker memory is 2 GB but SQL needs at least 3.25 GB.

I recommend we set the memory value to 4 GB.

  1. Click Docker icon on top menu of your Mac
  2. Click Preferences
  3. Set Memory under Advanced to 4 GB.

Click Apply&Restart

4 Download Microsoft SQL Server 2017

The next step is to download SQL Server 2017 from Terminal, which is an easy way.

Copy and paste the command in Terminal of your Mac:

docker pull microsoft/mssql-server-linux

Through it, the latest version of MS Server SQL can be downloaded.

5 Run a Docker image

Copy, change and paste the command in Terminal of your Mac:

docker run -d --name xxxxxxx -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=xxxx\xxxx' -p 1433:1433 microsoft/mssql-server-linux

We need to change the name and password as your own here.

This step is to run an instance of Docker image. A Docker image is a file, which is used to execute codes in a Docker container.

If you want to check, copy and paste the following command to see whether the Docker container is running:

docker ps

If it works, it will show like this:

6 Install the sql-cli command line tool

The next step is to install the sql-cli command line tool. It can allow you to run commands against your SQL Server instance.

Copy and paste the command in Terminal of your Mac:

npm install -g sql-cli

If an error happened and shows you do not have the permissions to access this file as the current user, just try add sudo in your command:

sudo npm install -g sql-cli

7 Connect to SQL Server

After sql-cli is installed, now we can connect to SQL Server using the mssql command.

mssql -u xxxxxxx -p xxxxxxx

Here xxxxxxx and xxxxxxx means your name and password.

Then you will see:


Now you’ve successfully connected to your instance of SQL Server.

Next blog we will continue our journey with SQL Server and Azure Data Studio. Maybe you will want to read more about Azure and Microsoft Learn in the previous blog.

If you are interested in or have any problems with Business Intelligence, feel free to contact me .

Or you can connect with me through my LinkedIn.