Create a SSIS Project in Visual Studio 2015

Installation:

  1. SQL Server Data Tools 2015(install a shell of Visual Studio 2015)
  2. Visual Studio 2015

What is SSIS?

SSIS stands for SQL Server Intergration Services. SSIS is to do the ETL(Extract, Transform and Load) task for data warehouse.

SSIS can also update data warehouse, clean and mine data, create ‘packages’, manipulate data, etc.

To design ETL task in Visual Studio, we use the data flow in Visual Studio.

What is data flow?

A data flow defines a flow of data from a source to a destination.

Before designing the data flow, we can try to drag a data flow and drop it in the control flow.

What is a control flow?

It is a flow we can use to control the flow for different tasks. It provides the logic for when data flow components are run and how they run.

For example:

  1. perform lopping
  2. call stored procedures
  3. move files
  4. manage error handling
  5. check a condition and call different tasks

It defines a workflow of tasks to be executed, often a particular order (assuming your included precedence constraints).

we can rename the data flow task in control flow.

For example, we can drag OLE DB Source, so we can connect to Relational Database.

Then we can follow the Microsoft official toturial to do it step by step.

https://docs.microsoft.com/en-us/sql/integration-services/lesson-1-1-creating-a-new-integration-services-project?view=sql-server-ver15

If you are interested in or have any problems with SSIS, feel free to contact me.

Or you can connect with me through my LinkedIn.

Some Prerequisite Knowledge About Spatial Data

Spatial Data Types:

Two kinds of spatial data types:

  1. geometry: flat 2D surface with two dimensions. Supposed X = 3 and Y = 4, then our point representation will be like POINT (3 4).
  2. geography: uses the same methods but the data type reflects the fact the we live on a curved 2D surface.

However, the two kinds of spatial data types is the need for the aforementioned Spatial Reference IDs (SRID).

SRID:

Both geometry and geography data types have two parts, the coordinates of the object and the SRID number

To check the list of SRID in SQL server, we can execute query statement as belows:

Geography::Point(Latitude,Longitude,SRID); 

The SRID number is set by EPSG standard. It dictates that the SRID of any geometry data is 0 and for Geography the default of SRID is 4326.

An example:

Here is a link of my Github of stored procedure, which aims to get the nearest suburb for each public transport stop:

https://github.com/jacquiwuc/PropertyAnalysis/blob/master/SQ_StationForSuburb.sql

In this stored procedure, it can be seen that

SET @geo1=geography::Point(@stationlat,@stationlong,4326);

Which is an application of transformation of the two spatial data types.

If you are interested in or have any problems with SQL, feel free to contact me.

Or you can connect with me through my LinkedIn.

SQL Correlated Subqueries

Recently I often made some mistakes about subqueries, so I wrote this blog about correlated subqueries.

Firstly, here is a SQL practicing website: https://sqlzoo.net.

It is free and easy for SQL beginners to do SQL exercise step by step.

A correlated subquery works like a nested loop: the subquery only has access to rows related to a single record at a time in the outer query.

The technique relies on table aliases to identify two different usages of the same table, which means one usage is in the outer query and another one is in the subquery.

Here is a table called world, which is an example on the sqlzoo website:

namecontinentareapopulationgdp
Afghanistan Asia 652230 25500100 20343000000
Albania Europe 28748 2831741 12960000000
Algeria Africa 2381741 37100000 188681000000
Andorra Europe 468 78115 3712000000
Angola Africa 1246700 20609294 100990000000

Question: Find the largest country (by area) in each continent, show the continent, the name and the area.

SQL answer using subquery:

SELECT continent, name, area 
FROM world x
WHERE area >= ALL
    (SELECT area FROM world y
     WHERE y.continent=x.continent AND area>0)

One way to interpret the SQL line in the WHERE clause that references the two table is “… where the correlated values are the same”.

In this example, we can tell “select the country details from the world table where the area is larger than or equal to the area of all countries where the continent is the same”.

If you are interested in or have any problems with SQL, feel free to contact me.

Or you can connect with me through my LinkedIn.

Rules of Creating Dimension Tables and Fact Tables

What is a dimension table? What is a fact table?

Why we need both of them in Data Warehouse?

Because Data Warehouse is used to make reports for business decisions.

Every report is made of two parts: Fact and Dimension.

Here is a picture of fact tables and dimension tables in a star schema in data warehouse.

So, this blog we will talk about what are the rules of creating dimension tables and fact tables.

First of all, we need to illustrate a definition.

What is surrogate key?

It is the primary key in demension tables.

Rules of creating dimension tables:

  1. Primary key (surrogate key, auto-increase number, only unique number in data warehouse)
  2. Business key (the key can be linked back to data source, with business meaning)
  3. Attributes (descriptive information from data source)

There are two kinds of data: Master data and transactional data.

Master data refers to the entity (e.g. employee) whereas transactional data refers to all the transactions that are carried out using that entity.

Master data is limited whereas transactional data can be billions.

In dimension tables, most data is master data.

Rules of creating fact tables:

  1. Primary key (surrogate key/alternate key, auto-increase number)
  2. Foreign key (primary key/surrogate key/alternate key from dimension tables)
  3. Measure (addictive number/semi-addictive number)

Tips: No descriptive data in Fact tables.

If you are interested in or have any problems with fact tables and dimension tables, feel free to contact me .

Or you can connect with me through my LinkedIn.

4 Reasons Why We Need Data Warehouse

Here is a basic process in Business Intelligence.

Maybe some people will be confused, why we need data warehouse?

Without data warehouse, we can also analyze the data.

We can get the data and create the report directly.

So, what the benefits of data warehouse in an organization?

Here we list 4 reasons why we need data warehouse.

Integrate data from various data sources and centralize the data into one place.

Have data loaded into data warehouse so that reporting won’t impact live system or database.

That is why we have a seperate data warehouse and the data is stored in the data warehouse.

We can make a scheduled job running at night to centralize the data from operational databases to data warehouse.

Easy access (one place of data and single source of truth).

It is easy for people to go to data warehouse to get the data, and they don’t need to worry other problems, e.g., we have so many data sources and where can I get the data?

We can trust the data warehouse where we can get the data.

Build model: choose the best design model to get the best flexibility and performance, especially for those large datasets.

We usually use kimball methodology – star schema/snowflake schema (de-normalization).

For example, we use the star schema to improve the query performance.

It is a methodology developed by Ralph Kimball which includes a set of methods, techniques and concepts for use in data warehouse design.

There are also some other methodologies we can use in data warehouse, e.g., inmon methodology, datavault methodology.

If you are interested in or have any problems with data warehouse, feel free to contact me.

Or you can connect with me through my LinkedIn.

A mind map for SQL

In this article, I make a basic SQL mind map for people who want to kick-start their career into business intelligence and data analysis industry.

Hope it can give you a basic understanding about SQL.

If you are interested in or have any problems with SQL, feel free to contact me.

Or you can connect with me through my LinkedIn.

Troubleshooting SSMS Error: 15517

If you meet with the same error as me in SSMS 2016.

Error:

“Cannot execute as the database principal because the principal “dbo” does not exist, this type of principal cannot be impersonated, or you do not have permission (Microsoft SQL Server, Error: 15517)”

Solution:

use [databasename]
GO
EXEC sp_changedbowner 'sa'
GO

Hope this solution can help you.

If you are interested in or have any problems with SQL and SSMS, feel free to contact me .

Or you can connect with me through my LinkedIn.

Stored Procedure in SQL

This article I will give you some basic syntax about stored procedure in SQL.

Stored Procedure:

Stored Procedure is the prepared SQL codes that we can save, so the code can be reused over and over again.

It is suitable for some SQL queries we need to use frequently.

There are three kinds of stored procedure:

  1. No parameter
  2. One parameter
  3. multiple parameters

Stored Procedure (No parameter) Syntax

CREATE PROCEDURE procedure_name
AS
sql_statement
GO
EXEC procedure_name

Stored Procedure (One parameter) Syntax

CREATE PROCEDURE [dbo].[oneparameter]
@ProductCatergoryID int
AS
SELECT *FROM Production.ProductCategoryID
GO
EXEC oneparameter @ProductCatergoryID = '4'

Stored Procedure (multiparameter) Syntax

CREATE PROCEDURE [dbo].[multiparameter]
@ProductCategoryID int, @Name varchar(50)
AS
SELECT *FROM Production.ProductCategory pc
WHERE pc.ProductCategoryID = @ProductCategoryID and pc.Name = @Name
GO
EXEC multiparameter @ProductCategoryID = '4', @Name = 'Accessories'

I will record all knowledge I touch in my Business Intelligence journey.

Next blog will be around Data Warehouse.

If you are interested in or have any problems with Business Intelligence, feel free to contact me .

Or you can connect with me through my LinkedIn.

Business Intelligence Tutorial

This blog is Business Intelligence tutorial, which contains lots of definitions and terms.

There are three steps in Business Intelligence: Data collecting, Data Warehousing and Reporting.

Data Collecting:

  1. Structured: Standardized and easy for computers to read and query.
  2. Semistructured
  3. Unstructured: Not stored in rows and columns, so it can’t easily read by computers.

As we talked in previous blog, company data can be found in several locations, such as CRM programs, which is also shown in the picture below.

Data Warehouse:

Data warehouse uses a process (ETL, i.e., extract, transform and load) to standardize data, which allows it can be queried.

How does information get to a central location?

ETL———————>Data Warehouse

  1. Extract: unstructutred data is tagged with metadata to make it easier to find
  2. Transform: normalize data
  3. Load: tranfer data to central warehouse or data mart

Turning Data into Powerpoints (Business Intelligence Reporting)

  1. Data visualization: Graphic display of results
  2. Dashboard: Interfaces that represent specific analyses

If you are interested in or have any problems with Business Intelligence, feel free to contact me .

Or you can connect with me through my LinkedIn.

Some Cheat Sheets about Business Intelligence

There are several parts of my Business Intelligence journey.

So I write this blog for recording some cheat sheets.

Installation:

Firstly, some tools need to be installed on the laptop.

All of them are Microsoft softwares, so I prepare a surface laptop 2.

  1. SQL Server 2016 Developer Edition: Seclect the features we need to install, such as SQL Server Integration Services(SSIS), SQL Server Analysis Services(SSAS) and SQL Server Reporting Services(SSRS).
  2. SQL Server Management Studio(SSMS)
  3. SQL Server Data Tools(SSDT)
  4. Report Builder

What is Business Intelligence(BI)?

I have mentioned it in the previous blog. It is to transform raw data into meaningful information for analysis.

BI Process:

  1. Import data from different data sources
  2. Produce reports or graghs through SSIS, SSAS and SSRS
  3. Make business decisions

Some terms:

ETL: Extraction, Transformation and Loading

OLTP: Online Transaction Processing

Cube: Multi-dimensional data structure built using dimensions and facts, using MDX (Multi dimensional expression) language

SQL and Relational Database: which I have mentioned them in the previous blog

Primary Key: A single attribute or an unique id in a Relational Database

Foreign Key: A set of attributes that references a candidate key in a Relational Database

Some differences:

Traditional Normalized Database VS Data Warehouse: Formal one has no duplicate data but latter one has duplicate data for efficiency.

Null VS Blank(where clause): For null, the clause is where xxxxx is null; For blank, the clause is where xxxxx is ”

Delete VS truncate: truncate table xxxxx ->no rollback&faster; delete from xxxxx -> rollback

SQL Language:

Learning sources:

  1. w3schools.com
  2. SQL Server tutorial for beginner

Some common commands:

One table:

SELECT…FROM

WHERE CLAUSE

ORDER BY CLAUSE

Two tables:

Inner Join

Other Joins, e.g., LEFT OUTER JOIN

GROUP BY

If you are interested in or have any problems with Business Intellgence, feel free to contact me .

Or you can connect with me through my LinkedIn.