Why we need data integration and what can we do?

The first question is why we need data integration?

Let me give you an example here to answer this question.

Every company has many departments, and different departments use different tools to store their data. For example, marketing team may use hubspot tool.

Now, we have different departments which store different types of data in a company.

However, insightful information is needed to make business decisions through those large amount of data.

What can we do?

Maybe we can connect all the databases everytime to generate reports. However, it will cost us large amount of time, then the term of data integration is raised.

What is data integration?

Data integration is a process in which heterogeneous data is retrieved and combined as an incorporated form and structure.

There are some advantages of data integration:

  • Reduce data complexity
  • Data integrity
  • Easy data collaboration
  • Smarter business decisions

There are also some well-known tools can do data integration, e.g., Microsoft SQL Server Integration Services (SSIS), Informatica, Oracle Data Integrator etc.

This blog we will talk about SSIS, which is one of the most popular ETL tool.

Why SSIS?

  • Data can be loaded in parallel to many varied destinations
  • SSIS removes the need of hard core programmers
  • Tight integration with other products of Microsoft
  • SSIS is cheaper than most other ETL tools

SSIS stands for SQL Server Integration Services, which is a component of the Microsoft SQL Server database software that can be used to perform a broad range of data integration and data transformation tasks.

What can SSIS do?

It combines the data residing in different sources and provides users with an unified view of these data.

Also, it can also be used to automate maintenance of SQL Server databases and updates to multidimensional analytical data.

How is works?

These tasks of data transformation and workflow creation is carried out using SSIS package:

Operational Data–>ETL–>Data Warehouse

In the first place, operational data store (ODS) is a database designed to integrate data from multiple sources for additional operations on the data. This is the place where most of the data used in current operation is housed before it’s tranferred to the data warehouse for longer tem storage or achiiving.

Next step is ETL(Extract, Transform and Load), the process of extracting the data from various sources, tranforming this data to meet your requirement and then loading into a target data warehouse.

The third step is data warehouse, which is a large set of data accumulated which is used for assembling and managing data from various sources of answering business questions. Hence, helps in making decisions.

What is SSIS package?

It is an object that implements integration services functionality to extract, transform and load data. A package is composed of:

  • connection
  • control flow elements(handle workflow)
  • data flow elements(handle data transform)

If you are interested in or have any problems with SSIS, feel free to contact me.

Or you can connect with me through my LinkedIn.

How to design a Data Warehouse(Part 1)

Last blog I wrote why we need a Data Warehouse.

First, what is the data warehouse?

It is a centralized relational database that pulls together data from different sources (CRM, marketing stack, etc.) for better business insights.

It stores current and historical data are used for reporting and analysis.

However, here is the problem:

How we can design a Data Warehouse?

1 Define Business Requirements

Because it touches all areas of a company, all departments need to be onboard with the design. Each department needs to understand what the benefits of data warehouse and what results they can expect from it.

What objectives we can focus on:

  • Determine the scope of the whole project
  • Find out what data is useful for analysis and where our dat is current siloed
  • Create a backup plan in case of failure
  • Security: monitoring, etc.

2 Choose a data warehouse platform

There are four types of data warehouse platforms:

  • Traditional database management systems: Row-based relational platforms, e.g., Microsoft SQL Server
  • Specialized Analytics DBMS: Columnar data stores designed specifically for managing and running analytics, e.g., Teradata
  • Out-of-box data warehouse appliances: the combination with a software and hardware with a DBMS pre-installed, e.g., Oracle Exadata
  • Cloud-hosted data warehouse tools

We can choose the suitable one for the company according to budget, employees and infrastructure.

We can choose between cloud or on-premise?

Cloud solution pros:

  • Scalability: easy, cost-effective, simple and flexible to scale with cloud services
  • Low entry cost: no servers, hardware and operational cost
  • Connectivity: easy to connect to other cloud services
  • Security: cloud providers supply security patches and protocols to keep customers safe

Choices:

  • Amazon Redshift
  • Microsoft Azure SQL Data Warehouse
  • Google Bigquery
  • Snowflake Computing

On-premise solution pros:

  • Reliability: With good staff and exceptional hardware, on-premise solutions can be highly available and reliable
  • Security: Organizations have full control of the security and access

Choices:

  • Oracle Database
  • Microsoft SQL Server
  • MySQL
  • IBM DB2
  • PostgreSQL

What we can choose between on-premise and cloud solution, in the big picture, it depends on our budget and existing system.

If we look for control, then we can choose on-premise solution. Conversely, if we look for scalability, we can choose a cloud service.

3 Set up the physical environments

There are three physical environments in Data Warehouse: development, testing and production.

  • We need to test changes before they move into the production environment
  • Running tests against data typically uses extreme data sets or random sets of data from the production environment.
  • Data integrity is much easier to track and issues are easier to contain if we have three environments running.

4 Data Modelling

It is the most complex phase of Data Warehouse design. It is the process of visualizing data distribution in the warehouse.

  • Visualize the relationships between data
  • Set standardized naming conventions
  • Create relationships between data sets
  • Establish compliance and security processes

There are bunches of data modeling techniques that businesses use for data warehouse design. Here are top 3 popular ones:

  • Snowflake Schema
  • Star Schema
  • Galaxy Schema

4 Choosing the ETL solution

ETL, stands for Extract, Transform and Load is the process we pull data from the storage solutions to warehouse.

We need to build an easy, replicable and consistent data pipeline because a poor ETL process can break the entire data warehouse.

Wrapping up

This post explored the first 4 steps about designing a Data Warehouse in the company. In the future, I will write the next steps.

If you are interested in or have any problems with Data Warehouse, feel free to contact me.

Or you can connect with me through my LinkedIn.

What is normalization and what are normalization forms?

Normalization is a technique that decomposes the table to reduce the redundacy of data.

There are 3 normalization forms we need to check and follow to normalize tables.

1st Normalization Form:

  • Each coloum of the table should have a single value
  • Each column should belong to a same domain
  • Two colums should not have a same name
  • It need not to be a proper order, e.g., we need not to sort the records according to the date of DOB.

2nd Normalization Form:

  • The table should be satisfy 1st Normalization Form
  • All the non-key attributes myst be functionally dependently on primary key

3rd Normalization Form:

  • The table should be satisfy 2nd Normalization Form
  • There is no transitive dependency for non-prime attributes

The transitive functional dependency is as follows:

A is functional dependent on B and B is functional dependent on C. So, C is transitive dependent on A through B.

If you are interested in or have any problems with Business Intelligence, feel free to contact me.

Or you can connect with me through my LinkedIn.

Create a SSIS Project in Visual Studio 2015

Installation:

  • SQL Server Data Tools 2015(install a shell of Visual Studio 2015)
  • Visual Studio 2015

What is SSIS?

SSIS stands for SQL Server Intergration Services. SSIS is to do the ETL(Extract, Transform and Load) task for data warehouse.

SSIS can also update data warehouse, clean and mine data, create ‘packages’, manipulate data, etc.

To design ETL task in Visual Studio, we use the data flow in Visual Studio.

What is data flow?

A data flow defines a flow of data from a source to a destination.

Before designing the data flow, we can try to drag a data flow and drop it in the control flow.

What is a control flow?

It is a flow we can use to control the flow for different tasks. It provides the logic for when data flow components are run and how they run.

For example:

  • perform lopping
  • call stored procedures
  • move files
  • manage error handling
  • check a condition and call different tasks

It defines a workflow of tasks to be executed, often a particular order (assuming your included precedence constraints).

we can rename the data flow task in control flow.

For example, we can drag OLE DB Source, so we can connect to Relational Database.

Then we can follow the Microsoft official toturial to do it step by step.

https://docs.microsoft.com/en-us/sql/integration-services/lesson-1-1-creating-a-new-integration-services-project?view=sql-server-ver15

If you are interested in or have any problems with SSIS, feel free to contact me.

Or you can connect with me through my LinkedIn.

Dimensional Modelling

First of all, a concept of Data Warehouse is supposed to be clear. Data Warehouse is not a copy of source database with a name prefixed with ‘DW’.

DW is we can store data from multiple data sources to be used for historical or analysis report. Some data sources are .csv format, some are google docs, etc. So we need to aggregate them into one data source.

This is called Data Warehouse.

How to design it? The procedure is Dimensional Modelling.

What is the difference between relational and dimensional?

It is same as the difference like normalised VS denormalised.

Normalised:

  • Minimal data redundancy
  • Optimised for fast read and fast write
  • Current data only
  • Realtime data

Denormalised:

  • Redundancy data storage for performance
  • Fast read only
  • Non-realtime data
  • Current and historical data

The next part is about fact table and dimension table in dimensional design.

Fact table:

  • Data that can be measured
  • Contains surrogate key, linking the associated measures or facts

Dimension table:

  • Descriptive information

Some types of dimensional models:

  • Accumalating snapshot table
  • Aggregate fact
  • Fact table
  • Factless Fact table
  • Snapshot table

If you are interested in or have any problems with Dimensional Modelling, feel free to contact me.

Or you can connect with me through my LinkedIn.

The Simplest Way to Understand SCD in Data Warehouse

What is SCD?

SCD stands for Slowly Changing Dimensions.

It is very important in Data Warehouse.

As we know, ETL (Extract, Transform, Load) is between data sources and data warehouse.

When ETL runs, it will pick up all records and update them in the Dimension tables.

Why we need SCD?

Because we have some problems in updating data in Data Warehouse when data in data sources are changing.

In the dimension tables, if we want to keep some old records, how we can do this?

Using SCD.

Types of SCD:

Note:The types of SCD are defined on Column level, not on the table level.

There are two popular types in SCD.

Type 1: Overwrite

An old record is updated by the new record. It means covering the old records.

Type 2: Store history by creating another row

Type 2 is to add new records rather than covering the old records.

As long as we have a type 2 in tables, we must have two extra values: ‘StartDate’ and ‘EndDate’. The ‘EndDate’ is when the changes happen, as the end date of historical data.

We can have a third one, ‘IsCurrent’, to identify or mark the current record.

We don’t need to update the records, and we just need to update ‘StartDate’ and ‘EndDate’.

If you are interested in or have any problems with fact tables and dimension tables, feel free to contact me .

Rules of Creating Dimension Tables and Fact Tables

What is a dimension table? What is a fact table?

Why we need both of them in Data Warehouse?

Because Data Warehouse is used to make reports for business decisions.

Every report is made of two parts: Fact and Dimension.

Here is a picture of fact tables and dimension tables in a star schema in data warehouse.

So, this blog we will talk about what are the rules of creating dimension tables and fact tables.

First of all, we need to illustrate a definition.

What is surrogate key?

It is the primary key in demension tables.

Rules of creating dimension tables:

  • Primary key (surrogate key, auto-increase number, only unique number in data warehouse)
  • Business key (the key can be linked back to data source, with business meaning)
  • Attributes (descriptive information from data source)

There are two kinds of data: Master data and transactional data.

Master data refers to the entity (e.g. employee) whereas transactional data refers to all the transactions that are carried out using that entity.

Master data is limited whereas transactional data can be billions.

In dimension tables, most data is master data.

Rules of creating fact tables:

  • Primary key (surrogate key/alternate key, auto-increase number)
  • Foreign key (primary key/surrogate key/alternate key from dimension tables)
  • Measure (addictive number/semi-addictive number)

Tips: No descriptive data in Fact tables.

If you are interested in or have any problems with fact tables and dimension tables, feel free to contact me .

Or you can connect with me through my LinkedIn.

4 Reasons Why We Need Data Warehouse

Here is a basic process in Business Intelligence.

Maybe some people will be confused, why we need data warehouse?

Without data warehouse, we can also analyze the data.

We can get the data and create the report directly.

So, what the benefits of data warehouse in an organization?

Here we list 4 reasons why we need data warehouse.

  • Integrate data from various data sources and centralize the data into one place.
  • Have data loaded into data warehouse so that reporting won’t impact live system or database.

That is why we have a seperate data warehouse and the data is stored in the data warehouse.

We can make a scheduled job running at night to centralize the data from operational databases to data warehouse.

  • Easy access (one place of data and single source of truth).

It is easy for people to go to data warehouse to get the data, and they don’t need to worry other problems, e.g., we have so many data sources and where can I get the data?

We can trust the data warehouse where we can get the data.

  • Build model: choose the best design model to get the best flexibility and performance, especially for those large datasets.

We usually use kimball methodology – star schema/snowflake schema (de-normalization).

For example, we use the star schema to improve the query performance.

It is a methodology developed by Ralph Kimball which includes a set of methods, techniques and concepts for use in data warehouse design.

There are also some other methodologies we can use in data warehouse, e.g., inmon methodology, datavault methodology.

If you are interested in or have any problems with data warehouse, feel free to contact me.

Or you can connect with me through my LinkedIn.

Stored Procedure in SQL

This article I will give you some basic syntax about stored procedure in SQL.

Stored Procedure:

Stored Procedure is the prepared SQL codes that we can save, so the code can be reused over and over again.

It is suitable for some SQL queries we need to use frequently.

There are three kinds of stored procedure:

  • No parameter
  • One parameter
  • multiple parameters

Stored Procedure (No parameter) Syntax

CREATE PROCEDURE procedure_name
AS
sql_statement
GO
EXEC procedure_name

Stored Procedure (One parameter) Syntax

CREATE PROCEDURE [dbo].[oneparameter]
@ProductCatergoryID int
AS
SELECT *FROM Production.ProductCategoryID
GO
EXEC oneparameter @ProductCatergoryID = '4'

Stored Procedure (multiparameter) Syntax

CREATE PROCEDURE [dbo].[multiparameter]
@ProductCategoryID int, @Name varchar(50)
AS
SELECT *FROM Production.ProductCategory pc
WHERE pc.ProductCategoryID = @ProductCategoryID and pc.Name = @Name
GO
EXEC multiparameter @ProductCategoryID = '4', @Name = 'Accessories'

I will record all knowledge I touch in my Business Intelligence journey.

Next blog will be around Data Warehouse.

If you are interested in or have any problems with Business Intelligence, feel free to contact me .

Or you can connect with me through my LinkedIn.

Business Intelligence Tutorial

This blog is Business Intelligence tutorial, which contains lots of definitions and terms.

There are three steps in Business Intelligence: Data forms, Data Warehouse and Reporting.

Data forms:

  • Structured: Standardized and easy for computers to read and query.
  • Semistructured
  • Unstructured: Not stored in rows and columns, so it can’t easily read by computers.

As we talked in previous blog, company data can be found in several locations, such as CRM programs, which is also shown in the picture below.

Data Warehouse:

Data warehouse uses a process (ETL, i.e., extract, transform and load) to standardize data, which allows it can be queried.

How does information get to a central location?

ETL———————>Data Warehouse

  • Extract: unstructutred data is tagged with metadata to make it easier to find
  • Transform: normalize data
  • Load: tranfer data to central warehouse or data mart

Turning Data into Powerpoints (Business Intelligence Reporting)

  • Data visualization: Graphic display of results
  • Dashboard: Interfaces that represent specific analyses

If you are interested in or have any problems with Business Intelligence, feel free to contact me .

Or you can connect with me through my LinkedIn.