Design a site like this with
Get started

Introduction to Terraform

Just imagine you are a customer of a cloud provider and you want to spin up some machines, you can go into some web console, fill in some forms, click some buttons and then launch an instance.

But you can also use Terraform.

Terraform allows you to do the same thing but in code, i.e., Infrastructure as Code. It is the automation of your infrastructure and keep your infrastructure in a certain state. For example, you want to spin up 5 small instances and whenever you run Terraform, it will ensure that those run on a cloud platform. When we change something manually, Terraform will try to match the code with the actual infrastructure.

It will also make your infrastructure auditable. Just look at .tf files, we can see what the infrastructure is made of. And even better, we can keep changes in Version Control System, e.g., git.

Download and Install

Terraform can be downloaded and installed in different operational systems, pls check it out on official website.

After downloading and installing Terraform, lets verify it.

Use “terraform -v” in terminal, if the result shows the version, then it can verify if terraform is installed and everything works properly. However, if the result is “Terraform is not a command” or anything like this, there is an issue happen.

The last step in setting is to use an IDE to manage, e.g., Visual Studio Code.

What Language Terraform supports?

Terraform language is called Hashicorp configuration language in a file that has a .tf extension, so all of our Terraform codes will be stored in a file with .tf extension.

Let’s start!

Now we can create a new project in our own IDE and name it. Then we create a file called

At this point, what providers Terraform can support can be checked out on their website.

For example, if we click into AWS Provider, we can see there is an example how to configure the AWS Provider.

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.0"

# Configure the AWS Provider
provider "aws" {
  region = "us-east-1"

# Create a VPC
resource "aws_vpc" "example" {
  cidr_block = ""

We can copy and paste the part of “Configure the AWS Provider” codes in our file. And we can check what region we are in AWS and modify it.

Now we get the provider set up and if we go back to Terraform tutorial, it continues to teach us how to set things up. And the next thing to set up the Authentication.

We can take the hard-coding way to have a look: create static creditials. It is not recommended because if we publish the .tf file into Github or something else, the credentials will be stored there, which will cause a security vulnerability problem.

Now using this way just keeps things simper but later we will use a securer way.

From the tutorial, it can be seen that we need three parameter values: region, “access_key” and “secret_key”. Now we can go to AWS console and check those information in Identity and Access Management(IAM) service.

We can create an access key in IAM service. And then we click “show access key” and store those 2 values: “access_key” and “secret_key”.

Then let’s try to create and provision resources within AWS. The syntax is quite simple and same whatever you configure Azure, AWS or GCP in Terraform.

The resources syntax is shown as below:

resource "<provider>_<resource_type>""name"{
    config options....
    key = "value"
    key2= "another value"

So this is how to create a resources within a provider. Let us to try to create and deploy a EC2 instance in Terraform.

We need to refer back to the documentation now because we need to put value as “resource_type”:

resource "aws_instance" "web" {
  ami           =
  instance_type = "t3.micro"

  tags = {
    Name = "HelloWorld"

We can find such an example and it is barely what the minimum we need. We can copy and paste these codes in our file.

The “resource_type” in the example is “aws_instance”. We can give it a name and call it what we want to replace “web” in the example. “ami” value can be got if we launch an instance in EC2 service on AWS, then we can put the value into the example. And the “instance_type” is what we select in the AWS console as well.

As in the documentation, the “tags” is optional so we can delete it.

The next thing is go to terminal of your IDE, which is suggested because it will navigate to your project directory automatically. The first Terraform command we need to learn is “Terraform init”. This command will look at the config, which is all .tf files in our project. Because we just have one provider which is AWS, so it will download all necessary plugins to interact with AWS api.

After we run it in terminal, we can see it is initializing the backend and initializing provider plugins and so on. If we add another provider for Azure, then it will also download another plugin for Azure.

After it is successful, the second Terraform command we need to know is “Terraform plan”. Even it is completely optional, but it is a quick sanity check to make sure we won’t break anything. If we have a look at what is happening, it will color code things depending on the action.

  • “+” means creating a resource
  • “-“means deleting a resource
  • “~”means modifying an existing resource

The final command is “Terraform apply” and it will ask you to hit “yes” then it will create our server. Then it will show “Apply complete! Resources 1 added, 0 changed, 0 destroyed”.

Then we can verify it in our AWS console. At this point, we can see the first EC2 instance is deployed through Terraform.

There are more things we can learn in Terraform, maybe talk about it in next blog.

If you are interested in or have any problems with Terraform, feel free to contact me.

Or you can connect with me through my LinkedIn.


CloudFormation 101

What is CloudFormation?

CloudFormation is a tool which can spin up resources on AWS. If someone wants to be an AWS expert, CloudFormation is an essential service to master.

Before we jump into writing a CloudFormation template, let’s have a brief history about how to manage AWS infrastructure before CloudFormation.

Without CloudFormation, automating a process is time-consuming because of building tools to assist with automation, e.g., log in AWS console and manually provision servers.

Maybe it is a fast way to write some scripts to get job done and it will generally work fine at a small scale, however if we need to manage more systems in more environments, it will become tedious.

So that is why configuration management tools exist.

There are some popular tools like Chef, Puppet, Salt etc. They can be used to maintain consistency and track changes. However, their disadvantages are it is not necessatily needed for containerized application deployments.

So there is another way: Infrastructure as Code

Infrastructure as Code is all about having a single souce of truth that serves as a blueprint for what we want the infrastructure look like.

Templates are written using a DSL that describes resources and relationships, e.g., CloudFormation and Terraform.

Their pros:

  • Consistency
  • Auditing
  • Complicance
  • Rollbacks

What languages CloudFormation supports?

The answer is JSON /YAML

Let’s take a look at a JSON template (a template respresents a stack, the components with the stack are known as resources):

  "Description":"Create a S3 bucket",

Inside Resources, uniquely named keys are mapped to specific AWS resources. In our example, this CloudFormation template is used to create a S3 bucket in AWS.


CloudFormation also supports some functions, and the most useful is Ref, it is used to pass value.

Next blog we will continue our journey with AWS.

If you are interested in or have any problems with CloudFormation, feel free to contact me .

Or you can connect with me through my LinkedIn.

Medium-Hard Data Analyst SQL questions(Part 1)

As we know, the first 70% of SQL is pretty straightforward but the remaining 30% can be pretty tricky.

So, in this blog, some popular hard SQL interview questions will be covered for people to sharpen their skills.

Self-Join Practice Problems

Part 1: How much a key metric, e.g., monthly active users, changes between months, e.g., a table named ‘logins’ is shown as below.


Q: find the month-over-month percentage change for monthly active users


  DATE_TRUNC('month', date) AS month_timestamp,
  COUNT(DISTINCT user_id) AS mau
  FROM logins
  DATE_TRUNC('month', date)

a.month_timestamp AS previous_month,
a.mau AS previous_mau,
b.month_timestamp AS current_month,
b.mau AS current_mau,
ROUND(100.0*(b.mau - a.mau)/a.mau,2) AS percent_change
FROM mau a
JOIN mau b
  ON a.month_timestamp = b.month_timestamp - interval '1 month'

Next blog we will continue our journey with SQL medium-hard questions.

If you are interested in or have any problems with SQL, feel free to contact me .

Or you can connect with me through my LinkedIn.

Some query samples to interpret Google Analytics data using Bigquery

As a BI Analyst working in an Online Travelling Agency company, interpreting customer behaviors data into meaningful insights is a Business As Usual task.

Google Analytics is a popular web analytics service tracking website traffic. Therefore, for BI&Reporting team, how to interpret Google Analytics data seems to be an essential skill.

In the technical side, standard SQL can be used in Google BigQuery(cloud-based data warehousing platform) to generate data insights from Google Analytics.

Let’s take some query samples to have a look. Firstly, some dynamic values need to be understood before we write the first query:

  • Dimension: a parameter used for analysis sessions/users/hits, e.g, device.deviceCategory, device.browser, hits.type
  • projectID: ID of the project in BigQuery
  • dataSetName: data set in BigQuery with the information about Google Analytics sessions.

So we can write a query sample to calculate the numbers of users, sessions and hits across set dimensions:

  EXACT_COUNT_DISTINCT(fullVisitorId) AS users,
  EXACT_COUNT_DISTINCT(fullVisitorId_visitId) AS sessions,
  COUNT(hits.hitNumber) AS hits
    fullVisitorId+'_'+ visitId AS fullVisitorId_visitId,

Another sample is to set up a funnel with steps better suited for your business, as what we do for the first query. Here are some dynamic values we need to use:

  • projectID: ID of the project in BigQuery
  • dataSetName: data set in BigQuery with the information about Google Analytics sessions
  • tableName: name of the table with the standard BigQuery export data

Then we can write a query sample as follows:

  table_step1.step1 AS count_step1, 
  table_step2.step2 AS count_step2, 
  table_step3.step3 AS count_step3
 --calculate transitions to Step 1 (Pageviews), grouped by days
  COUNT(*) AS step1
  WHERE hits.eCommerceAction.action_type='2'
    date) AS table_step1
--calculate the number of transitions to Ste​p 2 (Product description view) by days
  COUNT(*) AS step2
    hits.eventInfo.eventCategory CONTAINS 'Click' 
    AND hits.eventInfo.eventAction CONTAINS 'Characteristics of product'
  date) AS table_step2
--calculate transitions to Step 3 (Adding product to cart) by days
  COUNT(*) AS step3
  date) AS table_step3

There are more things we can do to track website traffic with BigQuery, maybe talk about it in next blog.

If you are interested in or have any problems with BigQuery, feel free to contact me.

Or you can connect with me through my LinkedIn.

Send Google Analytics payload length as a Custom Dimension

The maximum length of a Google Analytics payload is 8192 bytes. It is useful to check if you are approaching this value with some of your hits because if the payload length exceeds this, the hit is never sent to GA.

How can we know the payload size with each hit?

Today i will show you how to send the payload size as a custom dimension to GA with each hit. The tool is Google Tag Manager.

Before starting, creating a new hit-scoped custom dimension in GA is essential, named ‘Hit Payload Length’ and check its index.

Then, create a custom task in GTM, the custom task is to let users modify the request sent to GA before it is sent. We can take Client ID as an example.

Custom task will work with a custom javascript variable. Here is the javacript code which we create in GTM.

function () {
  // clientIdIndex: The Custom Dimension index to where you want to send the visitor's Client ID, my example is 7.
  var clientIdIndex = 7;

  // payloadLengthIndex: The Custom Dimension index to where you want to send the length of the payload of the request,  my example is 18.

  var payloadLengthIndex = 18;

  var readFromStorage = function (key) {
    if (!window.Storage) {
      // From:
      var value = '; ' + document.cookie;
      var parts = value.split('; ' + key + '=');
      if (parts.length === 2) {
        return parts.pop().split(';').shift();
    } else {
      return window.localStorage.getItem(key);

  var writeToStorage = function (key, value, expireDays) {
    if (!window.Storage) {
      var expiresDate = new Date();
      expiresDate.setDate(expiresDate.getDate() + expireDays);
      document.cookie = key + '=' + value + ';expires=' + expiresDate.toUTCString();
    } else {
      window.localStorage.setItem(key, value);

  var globalSendHitTaskName   = '_ga_originalSendHitTask';

  return function (customTaskModel) {

    window[globalSendHitTaskName] = window[globalSendHitTaskName] || customTaskModel.get('sendHitTask');

    // clientIdIndex
    if (typeof clientIdIndex === 'number') {
      customTaskModel.set('dimension' + clientIdIndex, customTaskModel.get('clientId'));
    // /clientIdIndex

    customTaskModel.set('sendHitTask', function (sendHitTaskModel) {

      var originalSendHitTaskModel = sendHitTaskModel,
          originalSendHitTask      = window[globalSendHitTaskName],
          canSendHit               = true;

      try {

        // payloadLengthIndex
        if (typeof payloadLengthIndex === 'number') {
          var _pl_hitPayload = sendHitTaskModel.get('hitPayload');
          _pl_hitPayload += '&cd' + payloadLengthIndex + '=';
          _pl_hitPayload += (_pl_hitPayload.length + _pl_hitPayload.length.toString().length);
          sendHitTaskModel.set('hitPayload', _pl_hitPayload, true);
        // /payloadLengthIndex

        if (canSendHit) {

      } catch(err) {



The last step is the add the custom task into your tags, we can scroll down to a tag and add a new field, whose value will be the custom task we just created.

After this step, any tags which has this custom task will add the hit payload length as a custom dimension.

How to debug your work?

Verify it through the developer tools on your browser.

  • Open the Network on the developer tool on your browser
  • Click the request to collect
  • Check if there is a payload parameter

With this custom dimension, we can monitor if the payload maximum size will be hit.

If you are interested in or have any problems with GTM, feel free to contact me.

Or you can connect with me through my LinkedIn.

A Fundamental guide about Setting up PySpark For ETL

Due to the massive volume of data, Spark is built to handle big data in many user cases. It is an open source project on Apache.

Spark can use data stored in a variety of formats, including parquet files.

What is Spark?

Spark is a general-purpose distributed data processing engine that is suitable for use.

On top of the Spark core data processing engine, there are libraries for SQL, machine learning, etc. Tasks most frequently associated with Spark include ETL and SQL batch jobs across large data sets.

What Does Spark Do?

It has an extensive set of developer libraries and APIs and supports languages such as Java, Python, R, and Scala. Its flexibility makes it well-suited for a range of use cases, for this blog, we will just talk about data integration. Python is revealed the Spark programming model to work with structured data by the Spark Python API which is called as PySpark.

Data produced by different application systems across a business needs to be processed for reporting and analysis. Spark is used to reduce the cost and time required for this ELT process.

How can we set up the PySpark?

There are heaps of ways to set up PySpark, including with VirtualBox, Databricks, AWS EMR, AWS EC2, Anaconda etc. This blog, i will just talk about setting up PySpark with Anaconda.

This blog we will talk about how to set up PySpark with Anaconda.

  1. Download Anaconda version according to your operation system and install it
  2. Create a new named environment
  3. Install pyspark through “Anaconda Prompt” terminal, just be careful the python environment needs to be set up 3.7 or lower because pyspark doesn’t support python 3.8.
conda create --name yournamedenvironment
conda create -n yournamedenvironment python=3.7
conda create -n yournamedenvironment pyspark

Then we can launch the different IDEs on Anaconda home:

JupyterLab is highly recommended here.

After we launch the JupyterLab, a.ipynb file can be created on locahost.

Spark DataFrame Basics

Data frame and spark sql are the things we need to get familiar with in PySpark. If we’ve worked with pandas in python, sql, R or Excel before, the data frame will become familiar.

Initiating a sparksession is essential in the beginning.

#start a simple Spark Session
from pyspark.sql import SparkSession

After running it in a single block, we can see if the PySpark is installed successfully.

If you are interested in or have any problems with PySpark, feel free to contact me.

Or you can connect with me through my LinkedIn.

Why we need data integration and what can we do?

The first question is why we need data integration?

Let me give you an example here to answer this question.

Every company has many departments, and different departments use different tools to store their data. For example, marketing team may use hubspot tool.

Now, we have different departments which store different types of data in a company.

However, insightful information is needed to make business decisions through those large amount of data.

What can we do?

Maybe we can connect all the databases everytime to generate reports. However, it will cost us large amount of time, then the term of data integration is raised.

What is data integration?

Data integration is a process in which heterogeneous data is retrieved and combined as an incorporated form and structure.

There are some advantages of data integration:

  1. Reduce data complexity
  2. Data integrity
  3. Easy data collaboration
  4. Smarter business decisions

There are also some well-known tools can do data integration, e.g., Microsoft SQL Server Integration Services (SSIS), Informatica, Oracle Data Integrator, AWS Glue etc.

This blog we will talk about SSIS, which is one of the most popular ETL tool in New Zealand.


  1. Data can be loaded in parallel to many varied destinations
  2. SSIS removes the need of hard core programmers
  3. Tight integration with other products of Microsoft
  4. SSIS is cheaper than most other ETL tools

SSIS stands for SQL Server Integration Services, which is a component of the Microsoft SQL Server database software that can be used to perform a broad range of data integration and data transformation tasks.

What can SSIS do?

It combines the data residing in different sources and provides users with an unified view of these data.

Also, it can also be used to automate maintenance of SQL Server databases and update to multidimensional analytical data.

How is works?

These tasks of data transformation and workflow creation is carried out using SSIS package:

Operational Data–>ETL–>Data Warehouse

In the first place, operational data store (ODS) is a database designed to integrate data from multiple sources for additional operations on the data. This is the place where most of the data used in current operation is housed before it’s tranferred to the data warehouse for longer tem storage or achiiving.

Next step is ETL(Extract, Transform and Load), the process of extracting the data from various sources, tranforming this data to meet your requirement and then loading into a target data warehouse.

The third step is data warehousing, which is a large set of data accumulated which is used for assembling and managing data from various sources of answering business questions. Hence, helps in making decisions.

What is SSIS package?

It is an object that implements integration services functionality to extract, transform and load data. A package is composed of:

  1. connection
  2. control flow elements(handle workflow)
  3. data flow elements(handle data transform)

If you want to investigate more about SSIS, check it out on Microsoft official documents.

If you are interested in or have any problems with SSIS, feel free to contact me.

Or you can connect with me through my LinkedIn.

An advanced SQL mind map

Before, i wrote a blog about a basic SQL mind map for people who want to kick-start their career into business intelligence and data analysis industry.

As we know, SQL is essential to become a Business Intelligence Developer/Data Analyst.

So, in this blog, i drew an advanced SQL mind map for people who want to dive their SQL journey.

If you are interested in or have any problems with SQL, feel free to contact me.

Or you can connect with me through my LinkedIn.

Some Cloud Computing Fundamentals

What is cloud computing?

The practice of using a network of remote servers hosted on the Internet to store, manage, and process data, rather than a local server or a personal computer.


  1. You own the servers
  2. You hire the IT people
  3. You pay or rent the real-estate
  4. You take all the risk

Cloud providers:

  1. Someone else owns the servers
  2. Someone else hires the IT people
  3. Someone else pays or rents the real-estate
  4. You are responsible for your figuring cloud services and code, someone else takes care of the rest.

Different kinds of hosting:

  1. Dedicated Server: One physical machine dedicated to single a business. Runs a single web-app/site. (Very expensive, high maintenance, high security)
  2. Virtual Private Server: One physical machine dedicated to a single business. The physical machine is virtualized into sub-machines runs multiple web-apps/sites.
  3. Shared Hosting: One physical machine, shared by hundreds of businesses. Relies on most tenants under-utilizing their resources. (Very cheap, Very limited)
  4. Cloud Hosting: Multiple physical machines that act as one system. The system is abstracted into multiple cloud services. (Flexible, Scalable, Secure, Cost-Effective, High Configurability)

Common Cloud Services

A cloud provider can have hundreds of cloud services that are grouped various types of services. The four most common types of cloud services for infrastructure as a service(laaS) would be:

  1. Compute: Imagine having a virtual computer that can run application, programs and code.
  2. Storage: Imagine having a virtual hard-drive that can store files
  3. Networking: Image having a virtual network being able to define Internet connections or network isolations
  4. Database: Imagine a virtual database for stoing reporting data or a database for genetal purpose web-application

The term ‘cloud computing’ can be used to refer to all categories, even though it has ‘compute’ in the name.

Benefits of Cloud Computing

Cost-effective: You pay for what you consume, no up-front cost. Pay-as-you-go(PAYG) thousands of customers sharing the cost of the resources.

Global: Launch workloads anywhere in the world, just choose a region

Secure: Cloud provider takes care of physical security. Cloud services can by secure by default or you have the ability to configure access down to granular level.

Reliable: Data backup, disaster recovery, and data replication, and fault tolerance.

Scalable: Increase or decrease resources and services based on demand

Elastic: Automate scaling during spikes and drop in demand

Current: The underlying hardware and managed software is patched, upgraded and replaced by the cloud provider without interruption to you.

Next blog I will write some fundamentals about Microsoft Azure, which is the cloud provider service of Microsoft.

If you are interested in or have any problems with cloud computing, feel free to contact me.

Or you can connect with me through my LinkedIn.

How to design a Data Warehouse(Part 1)

Last blog I wrote why we need a Data Warehouse.

First, what is the data warehouse?

It is a centralized relational database that pulls together data from different sources (CRM, marketing stack, etc.) for better business insights.

It stores current and historical data are used for reporting and analysis.

However, here is the problem:

How we can design a Data Warehouse?

1 Define Business Requirements

Because it touches all areas of a company, all departments need to be onboard with the design. Each department needs to understand what the benefits of data warehouse and what results they can expect from it.

What objectives we can focus on:

  1. Determine the scope of the whole project
  2. Find out what data is useful for analysis and where our dat is current siloed
  3. Create a backup plan in case of failure
  4. Security: monitoring, etc.

2 Choose a data warehouse platform

There are four types of data warehouse platforms:

  1. Traditional database management systems: Row-based relational platforms, e.g., Microsoft SQL Server
  2. Specialized Analytics DBMS: Columnar data stores designed specifically for managing and running analytics, e.g., Teradata
  3. Out-of-box data warehouse appliances: the combination with a software and hardware with a DBMS pre-installed, e.g., Oracle Exadata
  4. Cloud-hosted data warehouse tools

We can choose the suitable one for the company according to budget, employees and infrastructure.

We can choose between cloud or on-premise?

Cloud solution pros:

  1. Scalability: easy, cost-effective, simple and flexible to scale with cloud services
  2. Low entry cost: no servers, hardware and operational cost
  3. Connectivity: easy to connect to other cloud services
  4. Security: cloud providers supply security patches and protocols to keep customers safe


  1. Amazon Redshift
  2. Microsoft Azure SQL Data Warehouse
  3. Google Bigquery
  4. Snowflake Computing

On-premise solution pros:

  1. Reliability: With good staff and exceptional hardware, on-premise solutions can be highly available and reliable
  2. Security: Organizations have full control of the security and access


  1. Oracle Database
  2. Microsoft SQL Server
  3. MySQL
  4. IBM DB2
  5. PostgreSQL

What we can choose between on-premise and cloud solution, in the big picture, it depends on our budget and existing system.

If we look for control, then we can choose on-premise solution. Conversely, if we look for scalability, we can choose a cloud service.

3 Set up the physical environments

There are three physical environments in Data Warehouse: development, testing and production.

  1. We need to test changes before they move into the production environment
  2. Running tests against data typically uses extreme data sets or random sets of data from the production environment.
  3. Data integrity is much easier to track and issues are easier to contain if we have three environments running.

4 Data Modelling

It is the most complex phase of Data Warehouse design. It is the process of visualizing data distribution in the warehouse.

  1. Visualize the relationships between data
  2. Set standardized naming conventions
  3. Create relationships between data sets
  4. Establish compliance and security processes

There are bunches of data modeling techniques that businesses use for data warehouse design. Here are top 3 popular ones:

  1. Snowflake Schema
  2. Star Schema
  3. Galaxy Schema

4 Choosing the ETL solution

ETL, stands for Extract, Transform and Load is the process we pull data from the storage solutions to warehouse.

We need to build an easy, replicable and consistent data pipeline because a poor ETL process can break the entire data warehouse.

Wrapping up

This post explored the first 4 steps about designing a Data Warehouse in the company. In the future, I will write the next steps.

If you are interested in or have any problems with Data Warehouse, feel free to contact me.

Or you can connect with me through my LinkedIn.