Design a site like this with WordPress.com
Get started

Introduction to Terraform

Just imagine you are a customer of a cloud provider and you want to spin up some machines, you can go into some web console, fill in some forms, click some buttons and then launch an instance.

But you can also use Terraform.

Terraform allows you to do the same thing but in code, i.e., Infrastructure as Code. It is the automation of your infrastructure and keep your infrastructure in a certain state. For example, you want to spin up 5 small instances and whenever you run Terraform, it will ensure that those run on a cloud platform. When we change something manually, Terraform will try to match the code with the actual infrastructure.

It will also make your infrastructure auditable. Just look at .tf files, we can see what the infrastructure is made of. And even better, we can keep changes in Version Control System, e.g., git.

Download and Install

Terraform can be downloaded and installed in different operational systems, pls check it out on official website.

After downloading and installing Terraform, lets verify it.

Use “terraform -v” in terminal, if the result shows the version, then it can verify if terraform is installed and everything works properly. However, if the result is “Terraform is not a command” or anything like this, there is an issue happen.

The last step in setting is to use an IDE to manage, e.g., Visual Studio Code.

What Language Terraform supports?

Terraform language is called Hashicorp configuration language in a file that has a .tf extension, so all of our Terraform codes will be stored in a file with .tf extension.

Let’s start!

Now we can create a new project in our own IDE and name it. Then we create a file called main.tf.

At this point, what providers Terraform can support can be checked out on their website.

For example, if we click into AWS Provider, we can see there is an example how to configure the AWS Provider.

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.0"
    }
  }
}
# Configure the AWS Provider
provider "aws" {
  region = "us-east-1"
}
# Create a VPC
resource "aws_vpc" "example" {
  cidr_block = "10.0.0.0/16"
}

We can copy and paste the part of “Configure the AWS Provider” codes in our main.tf file. And we can check what region we are in AWS and modify it.

Now we get the provider set up and if we go back to Terraform tutorial, it continues to teach us how to set things up. And the next thing to set up the Authentication.

We can take the hard-coding way to have a look: create static creditials. It is not recommended because if we publish the .tf file into Github or something else, the credentials will be stored there, which will cause a security vulnerability problem.

Now using this way just keeps things simper but later we will use a securer way.

From the tutorial, it can be seen that we need three parameter values: region, “access_key” and “secret_key”. Now we can go to AWS console and check those information in Identity and Access Management(IAM) service.

We can create an access key in IAM service. And then we click “show access key” and store those 2 values: “access_key” and “secret_key”.

Then let’s try to create and provision resources within AWS. The syntax is quite simple and same whatever you configure Azure, AWS or GCP in Terraform.

The resources syntax is shown as below:

resource "<provider>_<resource_type>""name"{
    config options....
    key = "value"
    key2= "another value"
}

So this is how to create a resources within a provider. Let us to try to create and deploy a EC2 instance in Terraform.

We need to refer back to the documentation now because we need to put value as “resource_type”:

resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.micro"
  tags = {
    Name = "HelloWorld"
  }
}

We can find such an example and it is barely what the minimum we need. We can copy and paste these codes in our main.tf file.

The “resource_type” in the example is “aws_instance”. We can give it a name and call it what we want to replace “web” in the example. “ami” value can be got if we launch an instance in EC2 service on AWS, then we can put the value into the example. And the “instance_type” is what we select in the AWS console as well.

As in the documentation, the “tags” is optional so we can delete it.

The next thing is go to terminal of your IDE, which is suggested because it will navigate to your project directory automatically. The first Terraform command we need to learn is “Terraform init”. This command will look at the config, which is all .tf files in our project. Because we just have one provider which is AWS, so it will download all necessary plugins to interact with AWS api.

After we run it in terminal, we can see it is initializing the backend and initializing provider plugins and so on. If we add another provider for Azure, then it will also download another plugin for Azure.

After it is successful, the second Terraform command we need to know is “Terraform plan”. Even it is completely optional, but it is a quick sanity check to make sure we won’t break anything. If we have a look at what is happening, it will color code things depending on the action.

  • “+” means creating a resource
  • “-“means deleting a resource
  • “~”means modifying an existing resource

The final command is “Terraform apply” and it will ask you to hit “yes” then it will create our server. Then it will show “Apply complete! Resources 1 added, 0 changed, 0 destroyed”.

Then we can verify it in our AWS console. At this point, we can see the first EC2 instance is deployed through Terraform.

There are more things we can learn in Terraform, maybe talk about it in next blog.

If you are interested in or have any problems with Terraform, feel free to leave me comment.

Advertisement

CloudFormation 101

What is CloudFormation?

CloudFormation is a tool which can spin up resources on AWS. If someone wants to be an AWS expert, CloudFormation is an essential service to master.

Before we jump into writing a CloudFormation template, let’s have a brief history about how to manage AWS infrastructure before CloudFormation.

Without CloudFormation, automating a process is time-consuming because of building tools to assist with automation, e.g., log in AWS console and manually provision servers.

Maybe it is a fast way to write some scripts to get job done and it will generally work fine at a small scale, however if we need to manage more systems in more environments, it will become tedious.

So that is why configuration management tools exist.

There are some popular tools like Chef, Puppet, Salt etc. They can be used to maintain consistency and track changes. However, their disadvantages are it is not necessatily needed for containerized application deployments.

So there is another way: Infrastructure as Code

Infrastructure as Code is all about having a single souce of truth that serves as a blueprint for what we want the infrastructure look like.

Templates are written using a DSL that describes resources and relationships, e.g., CloudFormation and Terraform.

Their pros:

  • Consistency
  • Auditing
  • Complicance
  • Rollbacks

What languages CloudFormation supports?

The answer is JSON /YAML

Let’s take a look at a JSON template (a template respresents a stack, the components with the stack are known as resources):

{
  "AWSTemplateFormatVersion":"2010-09-09",
  "Description":"Create a S3 bucket",
  "Resources":{
  "datalake":{
    "Type":"AWS::S3::Bucket"
             }
}
}

Inside Resources, uniquely named keys are mapped to specific AWS resources. In our example, this CloudFormation template is used to create a S3 bucket in AWS.

Functions:

CloudFormation also supports some functions, and the most useful is Ref, it is used to pass value.

If you are interested in or have any problems, feel free to leave me comment.

Some query samples to interpret Google Analytics data using Bigquery

As a data analyst working in an Online Travelling Agency company, interpreting customer behaviors data into meaningful insights is a Business As Usual task.

Google Analytics is a popular web analytics service tracking website traffic. Therefore, for data department, how to interpret Google Analytics data seems to be an essential skill.

In the technical side, standard SQL can be used in Google BigQuery(cloud-based data warehousing platform) to generate data insights from Google Analytics.

Let’s take some query samples to have a look. Firstly, some dynamic values need to be understood before we write the first query:

  • Dimension: a parameter used for analysis sessions/users/hits, e.g, device.deviceCategory, device.browser, hits.type
  • projectID: ID of the project in BigQuery
  • dataSetName: data set in BigQuery with the information about Google Analytics sessions.

So we can write a query sample to calculate the numbers of users, sessions and hits across set dimensions:

SELECT
  {{d​imension}​},​
  EXACT_COUNT_DISTINCT(fullVisitorId) AS users,
  EXACT_COUNT_DISTINCT(fullVisitorId_visitId) AS sessions,
  COUNT(hits.hitNumber) AS hits
FROM (
  SELECT
    fullVisitorId,
    fullVisitorId+'_'+ visitId AS fullVisitorId_visitId,
    hits.hitNumber,
    hits.type,
    device.browser,
    device.deviceCategory
  FROM
    '​projectID:dataSetName.tableName'
GROUP BY
  {{dimension}}

Another sample is to set up a funnel with steps better suited for your business, as what we do for the first query. Here are some dynamic values we need to use:

  • projectID: ID of the project in BigQuery
  • dataSetName: data set in BigQuery with the information about Google Analytics sessions
  • tableName: name of the table with the standard BigQuery export data

Then we can write a query sample as follows:

SELECT
  table_step1.date AS date,
  table_step1.step1 AS count_step1, 
  table_step2.step2 AS count_step2, 
  table_step3.step3 AS count_step3
FROM (
 --calculate transitions to Step 1 (Pageviews), grouped by days
  SELECT
  date,
  COUNT(*) AS step1
  FROM
   '​projectID:dataSetName.tableName'
  WHERE hits.eCommerceAction.action_type='2'
  GROUP BY 
    date) AS table_step1
JOIN (
--calculate the number of transitions to Ste​p 2 (Product description view) by days
  SELECT  
  date,
  COUNT(*) AS step2
  FROM 
    '​projectID:dataSetName.tableName'
  WHERE 
    hits.eventInfo.eventCategory CONTAINS 'Click' 
    AND hits.eventInfo.eventAction CONTAINS 'Characteristics of product'
  GROUP BY 
  date) AS table_step2
  ON
    table_step1.date=table_step2.date
JOIN (
--calculate transitions to Step 3 (Adding product to cart) by days
SELECT
  date,
  COUNT(*) AS step3
FROM 
  '​projectID:dataSetName.tableName'
WHERE 
  hits.eCommerceAction.action_type='3'
GROUP BY 
  date) AS table_step3
ON
  table_step1.date=table_step3.date

There are more things we can do to track website traffic with BigQuery, maybe talk about it in next blog.

If you are interested in or have any problems, feel free to leave me comment.

An advanced SQL mind map

A previous blog including a basic SQL mind map aims to people who want to kick-start their career into business intelligence and data analysis industry.

As we know, SQL is essential to become a Business Intelligence Developer/Data Analyst.

So, in this blog, i drew an advanced SQL mind map for people who want to dive their SQL journey.

If you are interested in or have any problems, feel free to leave me comment.

CrUX Dashboard & Data Strategy Lifecycle

Last Blog the data pipeline is demonstrated that we can use CrUX to analyze the site performance. This is from a data analyst perspective.

However, for a company, especially the leadership team, what they want is the final dashboard that generated from Data department, so management plan can be gained.

I already wrote how to query from Bigquery and what site speed metrics we can use from the introduction of CrUX blog and public dataset analysis blog.

So this blog I will show you what kind of dashboard we can generate after the steps of data collection from Google public dataset and ETL.

What data visualization tool we need to use?

There are bunches of data visualization tools we can use, e.g., Data Studio, Power BI etc. This time I take Tableau for an example.

I took www.flightcentre.co.nz(blue line) and www.rentalcars.com(red line) as the origin for comparison, set customer’s device is ‘desktop’ (we also can put a filter on it too).

And there are 4 sheets on a dashboard, i.e., Slow FCP Percentage, Fast FCP Percentage, Fast FID Percentage and Slow FID Percentage.

What they actually mean?

  1. Slow FCP Percentage(the percentage of users that experienced a first contentful paint time of 1 second or less)
  2. Fast FCP Percentage(the percentage of users that experienced a first contentful paint time of 2.5 seconds or more)
  3. Fast FID Percentage(the percentage of users that experienced a first input delay time of 50 ms or less)
  4. Slow FID Percentage(the percentage of users that experienced a first input delay time of 250 ms or more)

After this graph, we can roughly see that flightcentre has a higher site speed than rentalcars in user experience.

What we can do in the next step?

After that, we can inform devs and communicate impact according to show exactly the area that the site is falling down. We can point to the fact that it’s from real users and how people actually experiencing the site.

The second part is the data strategy lifecycle in a company.

What is the data strategy lifecycle in a company?

Develop the strategy–>Create the roadmap–>Change management plan–>Analytics lifestyle–>Measurement plan

Perspectives:

  1. Scope and Purpose: What data will we manage? How much does our data worth? How do we measure success?
  2. Data collection: Archiving, what data where and when, single source of truth(data lake), integrating data silos
  3. Architecture: Real time vs Batch, data sharing, data management and security, data modelling, visualization
  4. Insights and analysis: Data Exploration, self-service, collaboration, managing results
  5. Data governance: Identify data owners, strategic leadership. data stewardship. data lineage, quality, and cost
  6. Access and security: RBAC, encryption, PII, access processes, audits, regulatory
  7. Retention and SLAs: Data tiers and retention, SLA’s to the business

Wrapping up

This post explored the CrUX dashboard that data team can generate and the data strategy in a company. In the future, I will write more.

If you are interested in or have any problems, feel free to leave me comment.

Tutorial: Using BigQuery to Analyze CrUX Data

Last blog I gave some examples of how we can use the Chrome User Experience report (CrUX) to gain some insights about site speed. This blog I will continue to show you how to use bigquery to compare your site with the competitors.

Prerequisite:

  1.  Log into Google Cloud,
  2. Create a project for the CrUX work
  3. Avigate to BigQuery console
  4. Add the chrome-ux-report dataset and explore the way the tables are structured in ‘preview’

Step one: Figure out what is the origin of your site and the competitor site

like syntax is preferred (Take care of the syntax difference between Standard SQL and T-SQL)

 -- created by: Jacqui Wu
  -- data source: Chrome-ux-report(202003)
  -- last update: 12/05/2020
  
SELECT
  DISTINCT origin
FROM
  `chrome-ux-report.all.202003`
WHERE
  origin LIKE '%yoursite'

Step two: Figure out what should be queried in the select clause?

What we can query from CrUX?

The specific elements that Google is sharing are:

  1. “Origin”, which consists of the protocol and hostname, as we used in step one, which can make sure the URL link
  2. Effective Connection Type (4G, 3G, etc), which can be queried as the network
  3. Form Factor (desktop, mobile, tablet), which can be queried as the device
  4. Percentile Histogram data for First Paint, First Contentful Paint, DOM Content Loaded and onLoad (these are all nested, so if we want to query them, we need to unnest them)

Here I create a SQL query of FCP percentage in different sites, which measures the time from navigation to the time when the browser renders the first bit of content from the DOM.

This is an important milestone for users because it provides feedback that the page is actually loading.

SQL queries: 

  -- created by: Jacqui Wu
  -- data source: Chrome-ux-report(202003) in diffrent sites
  -- last update: 12/05/2020
  -- Comparing fcp metric in Different Sites
SELECT origin, form_factor.name AS device, effective_connection_type.name AS conn, "first contentful paint" AS metric, bin.start/1000 AS bin, SUM(bin.density) AS volume
FROM(  
SELECT origin, form_factor, effective_connection_type, first_contentful_paint.histogram.bin as bins
FROM `chrome-ux-report.all.202003`
WHERE origin IN ("your site URL link", "competitor A site URL link", "competitor B site URL link")
)
CROSSS JOIN UNNEST(bins) AS bin
GROUP BY origin, device, conn, bin

Step 3: Export the results to the Data Studio(Google visualization tool)

Here are some tips may be useful

  1. Line chart is preferred for comparing different sites in Visual Selection
  2. Set x-axis to bin(which we already calculate it to seconds) and y-axis to percentage of fcp
  3. Set filter(origin, device, conn) in Filtering section

Wrapping up

This post explored the data pipeline we can use CrUX report to analyze the site performance. In the future, I will write more about CrUX.

If you are interested in or have any problems, feel free to leave me comment.

How to use CrUX to analyze your site?

What is CrUX?

CrUX stands for the Chrome User Experience Report. It provides real world and real user metrics gathered from the millions of Google Chrome users who load millions of websites (include yours) each month. Of course, they all opt-in to syncing their browsing history and have usage statistic reporting enabled.

According to Google, its goal is ‘capture the full range of external factors that shape and contribute to the final user experience’.

In this post, I will walk you through how to use it to get insights of your site’s performance.

Why we need CrUX?

We all know faster site results in a better user experience and a better customer loyalty, compared to the sites of competitors. It results in the revenue increasing. Google confirmed some details about how they understand the speed. They are available in CrUX.

What are CrUX metrics?

  1. FP(First Paint): when everything loads on the page
  2. FCP(First Content loaded): when some text or an image loaded
  3. DCL(DOM content loaded): when DOM is loaded
  4. ONLOAD: when any additional scripts have loaded
  5. FID(First Input Delay): the time between when a user interacts with your site to when the server actually responds to that

How to generate the CrUX report on PageSpeed Insights?

PageSpeed Insights is a tool for people to understand what a page’s performance is and how to improve it.

It uses the lighthouse to audit the given page and identify opportunities to improve performance. It also integrates with the CrUX to show how real users experience performance on the page.

Take Yahoo as the example, after a few seconds, lighthouse audits will be performed and we will see sections for field and lab data.

In the field data section, we can see FCP and FID (please see the table below as we can see the FCP and FID values).

MetricFastAverageSlow
FCP0-1000ms1000ms-2500ms2500ms+
FID0-50ms50-250ms250ms+

We can see the Yahoo site is in ‘average’ according to the table. To achieve the ‘fast’, both FCP and FID must be categorized as fast.

Also, a percentile can be shown in each metric. For FCP, the 75th percentile is used and for FID, it is the 95th. For example, 75% of FCP experiences on the page are 1.5s or less.

How to use it in BigQuery?

In BigQuery, we can also extract insights about UX on our site.

SELECT origin, form_factor.name AS device, effective_connection_type.name  AS conn, 
       ROUND(SUM(onload.density),4) as density
FROM `chrome-ux-report.all.201907`,
    UNNEST (onload.histogram.bin) as onload
WHERE origin IN ("https://www.yahoo.com")
GROUP BY origin, device, conn

Then we can see the result in BigQuery.

The raw data is organized like a histogram, with bins have a start time, end time and density value. For example, we can query for the percent of ‘fast’ FCP experiences, where ‘fast’ is defined as happening under a second.

We can compare Yahoo with bing. Here is how the query look:

SELECT
  origin,
  SUM(fcp.density) AS fast_fcp
FROM
  `chrome-ux-report.all.201907`,
  UNNEST (first_contentful_paint.histogram.bin) AS fcp
WHERE
  fcp.start<1000
  AND origin IN ('https://www.bing.com',
    'https://www.yahoo.com')
GROUP BY
  origin

Wrapping up

This post explored some methods to get site insights with CrUX report. In the future, I will write more about CrUX.

If you are interested in or have any problems, feel free to leave me comment.

Google Analytics cheat sheets

Data-driven VS Data-informed

Data-driven: Decisions made only based upon statistics, which can be misleading.

Data-informed: Decisions made by combining statistics with insights and our knowledge of human wants & needs. 

We will be able to use data and human creativity to come up with innovative solutions in a business.

When we click into Google Analytics, we can see a large amount of lines, full of data, strange names.

But don’t chill out, we can break things

In this blog, I will illustrate Analytics +Art = Creative Data Scientist

Agenda

  1. Installing And Customising Google Analytics
  2. Learning Dashboard
  3. Analysing Behaviours
  4. User Acquisition
  5. Generate And Share Reports

1. Installing And Customising Google Analytics

How to Setup Google Analytics & Install on Website - YouTube

When we install the Google Analytics, there are several terms we should pay attention:

  1. Tracking code: Which is a basic code snippet for a website. It starts as UA, which stands for Universal Analytics.
  2. Data collection:  Turn it on and it allows us to get data of users.
  3. User-ID:  Allow us to tracker users. Generate the unique user Id and make sure the right ID is assigned to the right user and associate the data in Google Analytics.
  4. Session setting:  Any time a user has loaded up your site on their device. We can set the session time here.
  5. Referral exclusive list: Mostly, we set them as our own site URL.

2. Learning Dashboards:

Admin page: account, property, view

Under the account, there are several properties.

Take one account as the example, there are many URLs associated with this account.

In other words, we have a site. Under this site, we have a whole bunch of other properties that we’ve associated here.

The tracking Id is the same, except for a number at the very end.

Google Analytics Admin Page

Google Ads and Google AdSense: 

Ads are the words we buy from Google. They are links or text that appear on top of Google search pages. AdSense are the ads google sells that we can insert on our own site. We use AdSense if we are a publisher and we want to monetise our content.

Set the ‘Bot Filtering’ in the view setting under the view page of Admin.

It excludes all hits from known bots and spiders. Google, Yahoo and etc. They have programs to analyze and index the content. Bots is short for robots. Spiders are because these little programs crawl, web, spiders.

It is not a must way to have it turn on, but we have a big site, for a professional who wants to do the analytics, we should gain insights about what the humans (content is for humans) excluding the robots. If we are interested in what users are doing on our site, maybe we can do this easy way to turn it on.

Because the robots just crawl the data from our sites.

Sidebar

Then we can have a look at the sidebar.

  1. Customisation:  Suggest to install a custom report to give it a try.
  2. Real-time: What users are doing the right very second.
  3. Audience:  We will see who, what, when, where, from where. A big module.
  4. Acquisition:  Where traffic comes from, how marketing efforts working. We can have a look at channel, we will see direct, paid search, organic search, (other), referral, email, social. And this tells us again, how people find me? How is my marketing working? Are we just wasting our money?
  5. Behaviour: This is kind of fascinating, think of this as being the security camera in the store. We are watching our uses picking up items or checking out or running out of the exits.
  6. Conversiond: It is the happy part. It is where we track and figure out how well our sites are turning our visitors into customers.

3. Analysing Audience Behaviour

(1) Conversion vs Engagement:

Conversion: A one-time interaction. Granted, this is a powerful interaction, but it is the end goal of a chain of events.

Engagement: Repeated use, that results in an emotional, psychological and sometimes near-physical tie that users have to products, e.g., apple fans.

Build a hypothesis via the audience overview

There are a lot of opportunities to grow if we were to take this site to have it available in other languages.

(2) Active users

From the line in active users, we can see whether nothing is effective on the traffic. Or there is really no marketing being done.

(3) Cohort analysis

Cohort is a group of users that all share a common characteristic, in this case, the acquisition date is the day they came to your site which here is known as day 0. Metrics here are used to analyse the user behaviour.

We can see how is going on day 1?

We can see how many people came back the next day. 

  1. Track individual users with user explorer
  2. How to use segmentation to refine demographics and interests

It helps us to know who our audience is and what type of contents we are trying to expose them—>impact the design choices.

(4) Demographics

If it is young people who use smartphones mostly, we should simplify the navigation choices.

(5) Interests

Target-rich environment for the site can be a place which is the combination of top 3 interests

we can create a segment ‘tablet traffic’ to give it a try and we can find whether there are some differences between the all users.

(6) Geo

Language & Location. We can set the segment like ‘converters’ to compare with the ‘all users’ to find some differences.

(7) Behaviours:

We can set the ‘mobile and tablet traffic’ as the segment to compare with ‘all users’. We find after how many seconds, people are paying more attention. The numbers are trending up. Whether we got their attention for a long span of time. 

(8) Technology

Browser & OS:  Flash version is if we want to do ads on the website, we need to make sure that they actually display.

Network: This can be a really big deal if we are working in users and areas where we know they have very slow connections. And do we need to simplify a new page for them? This is called adaptive design.

(9) Mobile

If something is strange but not significant, we can just move on.

  1. Benchmarking and users flow
  2. Page Analytics (a plugin we can find in Chrome store)

4. User Acquisition

(1) Learning about channels, sources and mediums

There are many questions here:

Well, how do they get the site?

Sources, searched and referrals

SEO and what users are looking for

Social statistics and …

Channels:  The general, top-level categories that our traffic is coming from, such as search, referral or social

Sources: A subcategory of a channel. For example, search is a channel. Inside that channel, Yahoo Search is a source.

Mediums:  By which the traffic from a source is coming to our site. That is, if the traffic is coming from Google, is it organic search or paid search?  

(2) Differentiating between channels-organic search and direct

(direct): direct traffic is where someone comes directly to our site, i.e., type the address into the browser bar or they click on the bookmark.

‘not provided’: the data comes through Google is now encrypted to keep governments or hackers or spies from getting value from it. 

(3) Unlocking Mysterious Dark Social Traffic

There are 6 ways that the dark social traffic can come to the site.

  1. Email, messages. The traffic is from someone’s email program. This is not tracked by Google Analytics because GA lives in browsers.
  2. Links in docs: it is in an application that is not tracked by Google Analytics
  3. Bit.ly, Ow.ly, etc.
  4. Mobile social: twitter etc. 
  5. From https to http
  6. Search errors

(4) Drilling down to track who goes where

From source/mediums: trigger email

(5) Spotting the ‘Ghost Spam’ in referrals

Ghost spam: 

It isn’t really hurting anything. In fact if we really are organized and we are the only one looking at the reports. We can leave them alone and nothing happens. 

There are noxious visits to my site made with the nefarious intent of getting us to click on the links and visit the site of the spammer. They are not actual visits. These sessions and pageviews are from bots that either hit our site and execute the Google Analytics scripts or bypass the server and hit the Google Analytics directly.

Firstly, we need to find out the ghost referrals: 

They came from host names that are not our site.

Check it through:  Acquisition–>All traffic–>Referrals.

We can see there is some websites that have 100% bounce rate and 0s average session duration.

We can also see some things e.g., xxxxxx.com / referral in Acquisition–>All traffic–>Sources/Medium

If you are interested in or have any problems, feel free to leave me comment.

Analyzing Google Analytics Data in BigQuery (Part1)

What is BigQuery?

Among Google Cloud Platform family products, there are Google App Engine, Google Compute Engine, Google Cloud Datastore, Google Cloud Storage, Google BigQuery (for analytics), and Google Cloud SQL.

The most important product for data analyst is Big Query, it is an OLAP Data Warehouse which supports DW, Join and fully managed. It can make developers use SQL to query massive amounts of data in seconds.

Why BigQuery?

The main advantage is BigQuery can integrate with Google Analytics. It means we can synchronize Session/Event data to BigQuery easily to make custom analytics, not only the Google Analytics functions.

In other words, BigQuery can dump raw GA data into it. So it means some custom analytics which can’t be performed with the GA interface now can be generated by BigQuery.

Moreover, we can also bring in third-party data into it.

What is the difficulty for data analyst, it means we need to calculate every metrics in queries.

Which SQL is preferred in Big Query?

Standard SQL syntax is preferred in Big query nowadays.

How we can get the data from Google Analytics?

A daily dataset can be got from GA to BigQuery. Any within each dataset, a table is imported for each day of export. Its name format is ga_sessions_YYYYMMDD.

We can also set some steps to make sure the tables, dashboards and data transfers are always up-to-date.

How to get it a try?

Firstly, set up a Google Cloud Billing account. With a Google Cloud Billing account, we can use BigQuery web UI with Google Analytics 360.

The next step is to run a SQL query and visualize the output. The query editor is standard and follows the SQL syntax.

For example, here is a sample query that queries user-level data, total visits and page views.

SELECT fullVisitorId,
       visitId,
       trafficSource.source,
       trafficSource.medium,
       totals.visits,
       totals.pageviews,
FROM 'ga_sessions_YYYYMMDD'

In this step, if we need to get a good understanding of ga_sessions_table in BigQuery, we need to make sure what is the available raw GA data fileds can be got in BigQuery.

We can use an interactive visual representation as the reference.

Next blog we will give more examples about how to analyze GA data in BigQuery according to data ranges or others like users, sessions, traffic sources, etc.

If you are interested in or have any problems, feel free to leave me comment.

What is normalization and what are normalization forms?

Normalization is a technique that decomposes the table to reduce the redundacy of data.

There are 3 normalization forms we need to check and follow to normalize tables.

1st Normalization Form:

  1. Each coloum of the table should have a single value
  2. Each column should belong to a same domain
  3. Two colums should not have a same name
  4. It need not to be a proper order, e.g., we need not to sort the records according to the date of DOB.

2nd Normalization Form:

  1. The table should be satisfy 1st Normalization Form
  2. All the non-key attributes myst be functionally dependently on primary key

3rd Normalization Form:

  1. The table should be satisfy 2nd Normalization Form
  2. There is no transitive dependency for non-prime attributes

The transitive functional dependency is as follows:

A is functional dependent on B and B is functional dependent on C. So, C is transitive dependent on A through B.

If you are interested in or have any problems, feel free to leave me comment.