As we know, the first 70% of SQL is pretty straightforward but the remaining 30% can be pretty tricky.
So, in this blog, some popular hard SQL interview questions will be covered for people to sharpen their skills.
Self-Join Practice Problems
Part 1: How much a key metric, e.g., monthly active users, changes between months, e.g., a table named ‘logins’ is shown as below.
Q: find the month-over-month percentage change for monthly active users
WITH mau AS
DATE_TRUNC('month', date) AS month_timestamp,
COUNT(DISTINCT user_id) AS mau
a.month_timestamp AS previous_month,
a.mau AS previous_mau,
b.month_timestamp AS current_month,
b.mau AS current_mau,
ROUND(100.0*(b.mau - a.mau)/a.mau,2) AS percent_change
FROM mau a
JOIN mau b
ON a.month_timestamp = b.month_timestamp - interval '1 month'
Next blog we will continue our journey with SQL medium-hard questions.
If you are interested in or have any problems with SQL, feel free to contact me .
As a BI Analyst working in an Online Travelling Agency company, interpreting customer behaviors data into meaningful insights is a Business As Usual task.
Google Analytics is a popular web analytics service tracking website traffic. Therefore, for BI&Reporting team, how to interpret Google Analytics data seems to be an essential skill.
In the technical side, standard SQL can be used in Google BigQuery(cloud-based data warehousing platform) to generate data insights from Google Analytics.
Let’s take some query samples to have a look. Firstly, some dynamic values need to be understood before we write the first query:
Dimension: a parameter used for analysis sessions/users/hits, e.g, device.deviceCategory, device.browser, hits.type
projectID: ID of the project in BigQuery
dataSetName: data set in BigQuery with the information about Google Analytics sessions.
So we can write a query sample to calculate the numbers of users, sessions and hits across set dimensions:
EXACT_COUNT_DISTINCT(fullVisitorId) AS users,
EXACT_COUNT_DISTINCT(fullVisitorId_visitId) AS sessions,
COUNT(hits.hitNumber) AS hits
fullVisitorId+'_'+ visitId AS fullVisitorId_visitId,
Another sample is to set up a funnel with steps better suited for your business, as what we do for the first query. Here are some dynamic values we need to use:
projectID: ID of the project in BigQuery
dataSetName: data set in BigQuery with the information about Google Analytics sessions
tableName: name of the table with the standard BigQuery export data
Then we can write a query sample as follows:
table_step1.date AS date,
table_step1.step1 AS count_step1,
table_step2.step2 AS count_step2,
table_step3.step3 AS count_step3
--calculate transitions to Step 1 (Pageviews), grouped by days
COUNT(*) AS step1
date) AS table_step1
--calculate the number of transitions to Step 2 (Product description view) by days
COUNT(*) AS step2
hits.eventInfo.eventCategory CONTAINS 'Click'
AND hits.eventInfo.eventAction CONTAINS 'Characteristics of product'
date) AS table_step2
--calculate transitions to Step 3 (Adding product to cart) by days
COUNT(*) AS step3
date) AS table_step3
There are more things we can do to track website traffic with BigQuery, maybe talk about it in next blog.
If you are interested in or have any problems with BigQuery, feel free to contact me.
The first question is why we need data integration?
Let me give you an example here to answer this question.
Every company has many departments, and different departments use different tools to store their data. For example, marketing team may use hubspot tool.
Now, we have different departments which store different types of data in a company.
However, insightful information is needed to make business decisions through those large amount of data.
What can we do?
Maybe we can connect all the databases everytime to generate reports. However, it will cost us large amount of time, then the term of data integration is raised.
What is data integration?
Data integration is a process in which heterogeneous data is retrieved and combined as an incorporated form and structure.
There are some advantages of data integration:
Reduce data complexity
Easy data collaboration
Smarter business decisions
There are also some well-known tools can do data integration, e.g., Microsoft SQL Server Integration Services (SSIS), Informatica, Oracle Data Integrator, AWS Glue etc.
This blog we will talk about SSIS, which is one of the most popular ETL tool in New Zealand.
Data can be loaded in parallel to many varied destinations
SSIS removes the need of hard core programmers
Tight integration with other products of Microsoft
SSIS is cheaper than most other ETL tools
SSIS stands for SQL Server Integration Services, which is a component of the Microsoft SQL Server database software that can be used to perform a broad range of data integration and data transformation tasks.
What can SSIS do?
It combines the data residing in different sources and provides users with an unified view of these data.
Also, it can also be used to automate maintenance of SQL Server databases and update to multidimensional analytical data.
How is works?
These tasks of data transformation and workflow creation is carried out using SSIS package:
Operational Data–>ETL–>Data Warehouse
In the first place, operational data store (ODS) is a database designed to integrate data from multiple sources for additional operations on the data. This is the place where most of the data used in current operation is housed before it’s tranferred to the data warehouse for longer tem storage or achiiving.
Next step is ETL(Extract, Transform and Load), the process of extracting the data from various sources, tranforming this data to meet your requirement and then loading into a target data warehouse.
The third step is data warehousing, which is a large set of data accumulated which is used for assembling and managing data from various sources of answering business questions. Hence, helps in making decisions.
What is SSIS package?
It is an object that implements integration services functionality to extract, transform and load data. A package is composed of:
control flow elements(handle workflow)
data flow elements(handle data transform)
If you want to investigate more about SSIS, check it out on Microsoft official documents.
If you are interested in or have any problems with SSIS, feel free to contact me.
It is a centralized relational database that pulls together data from different sources (CRM, marketing stack, etc.) for better business insights.
It stores current and historical data are used for reporting and analysis.
However, here is the problem:
How we can design a Data Warehouse?
1 Define Business Requirements
Because it touches all areas of a company, all departments need to be onboard with the design. Each department needs to understand what the benefits of data warehouse and what results they can expect from it.
What objectives we can focus on:
Determine the scope of the whole project
Find out what data is useful for analysis and where our dat is current siloed
Create a backup plan in case of failure
Security: monitoring, etc.
2 Choose a data warehouse platform
There are four types of data warehouse platforms:
Traditional database management systems: Row-based relational platforms, e.g., Microsoft SQL Server
Specialized Analytics DBMS: Columnar data stores designed specifically for managing and running analytics, e.g., Teradata
Out-of-box data warehouse appliances: the combination with a software and hardware with a DBMS pre-installed, e.g., Oracle Exadata
Cloud-hosted data warehouse tools
We can choose the suitable one for the company according to budget, employees and infrastructure.
We can choose between cloud or on-premise?
Cloud solution pros:
Scalability: easy, cost-effective, simple and flexible to scale with cloud services
Low entry cost: no servers, hardware and operational cost
Connectivity: easy to connect to other cloud services
Security: cloud providers supply security patches and protocols to keep customers safe
Microsoft Azure SQL Data Warehouse
On-premise solution pros:
Reliability: With good staff and exceptional hardware, on-premise solutions can be highly available and reliable
Security: Organizations have full control of the security and access
Microsoft SQL Server
What we can choose between on-premise and cloud solution, in the big picture, it depends on our budget and existing system.
If we look for control, then we can choose on-premise solution. Conversely, if we look for scalability, we can choose a cloud service.
3 Set up the physical environments
There are three physical environments in Data Warehouse: development, testing and production.
We need to test changes before they move into the production environment
Running tests against data typically uses extreme data sets or random sets of data from the production environment.
Data integrity is much easier to track and issues are easier to contain if we have three environments running.
4 Data Modelling
It is the most complex phase of Data Warehouse design. It is the process of visualizing data distribution in the warehouse.
Visualize the relationships between data
Set standardized naming conventions
Create relationships between data sets
Establish compliance and security processes
There are bunches of data modeling techniques that businesses use for data warehouse design. Here are top 3 popular ones:
4 Choosing the ETL solution
ETL, stands for Extract, Transform and Load is the process we pull data from the storage solutions to warehouse.
We need to build an easy, replicable and consistent data pipeline because a poor ETL process can break the entire data warehouse.
This post explored the first 4 steps about designing a Data Warehouse in the company. In the future, I will write the next steps.
If you are interested in or have any problems with Data Warehouse, feel free to contact me.
And there are 4 sheets on a dashboard, i.e., Slow FCP Percentage, Fast FCP Percentage, Fast FID Percentage and Slow FID Percentage.
What they actually mean?
Slow FCP Percentage(the percentage of users that experienced a first contentful paint time of 1 second or less)
Fast FCP Percentage(the percentage of users that experienced a first contentful paint time of 2.5 seconds or more)
Fast FID Percentage(the percentage of users that experienced a first input delay time of 50 ms or less)
Slow FID Percentage(the percentage of users that experienced a first input delay time of 250 ms or more)
After this graph, we can roughly see that flightcentre has a higher site speed than rentalcars in user experience.
What we can do in the next step?
After that, we can inform devs and communicate impact according to show exactly the area that the site is falling down. We can point to the fact that it’s from real users and how people actually experiencing the site.
The second part is the data strategy lifecycle in a company.
What is the data strategy lifecycle in a company?
Develop the strategy–>Create the roadmap–>Change management plan–>Analytics lifestyle–>Measurement plan
Scope and Purpose: What data will we manage? How much does our data worth? How do we measure success?
Data collection: Archiving, what data where and when, single source of truth(data lake), integrating data silos
Architecture: Real time vs Batch, data sharing, data management and security, data modelling, visualization
Insights and analysis: Data Exploration, self-service, collaboration, managing results
Data governance: Identify data owners, strategic leadership. data stewardship. data lineage, quality, and cost
Access and security: RBAC, encryption, PII, access processes, audits, regulatory
Retention and SLAs: Data tiers and retention, SLA’s to the business
This post explored the CrUX dashboard BI team can generate and the data strategy in a company. In the future, I will write more.
If you are interested in or have any problems with CrUX or Business Intelligence, feel free to contact me.
Last blog I gave some examples of how we can use the Chrome User Experience report (CrUX) to gain some insights about site speed. This blog I will continue to show you how to use bigquery to compare your site with the competitors.
Log into Google Cloud,
Create a project for the CrUX work
Avigate to BigQuery console
Add the chrome-ux-report dataset and explore the way the tables are structured in ‘preview’
Step one: Figure out what is the origin of your site and the competitor site
like syntax is preferred (Take care of the syntax difference between Standard SQL and T-SQL)
-- created by: Jacqui Wu
-- data source: Chrome-ux-report(202003)
-- last update: 12/05/2020
origin LIKE '%yoursite'
Step two: Figure out what should be queried in the select clause?
What we can query from CrUX?
The specific elements that Google is sharing are:
“Origin”, which consists of the protocol and hostname, as we used in step one, which can make sure the URL link
Effective Connection Type (4G, 3G, etc), which can be queried as the network
Form Factor (desktop, mobile, tablet), which can be queried as the device
Percentile Histogram data for First Paint, First Contentful Paint, DOM Content Loaded and onLoad (these are all nested, so if we want to query them, we need to unnest them)
Here I create a SQL query of FCP percentage in different sites, which measures the time from navigation to the time when the browser renders the first bit of content from the DOM.
This is an important milestone for users because it provides feedback that the page is actually loading.
-- created by: Jacqui Wu
-- data source: Chrome-ux-report(202003) in diffrent sites
-- last update: 12/05/2020
-- Comparing fcp metric in Different Sites
SELECT origin, form_factor.name AS device, effective_connection_type.name AS conn, "first contentful paint" AS metric, bin.start/1000 AS bin, SUM(bin.density) AS volume
SELECT origin, form_factor, effective_connection_type, first_contentful_paint.histogram.bin as bins
WHERE origin IN ("your site URL link", "competitor A site URL link", "competitor B site URL link")
CROSSS JOIN UNNEST(bins) AS bin
GROUP BY origin, device, conn, bin
Step 3: Export the results to the Data Studio(Google visualization tool)
Here are some tips may be useful
Line chart is preferred for comparing different sites in Visual Selection
Set x-axis to bin(which we already calculate it to seconds) and y-axis to percentage of fcp
Set filter(origin, device, conn) in Filtering section
This post explored the data pipeline we can use CrUX report to analyze the site performance. In the future, I will write more about CrUX.
If you are interested in or have any problems with CrUX or Business Intelligence, feel free to contact me.
CrUX stands for the Chrome User Experience Report. It provides real world and real user metrics gathered from the millions of Google Chrome users who load millions of websites (include yours) each month. Of course, they all opt-in to syncing their browsing history and have usage statistic reporting enabled.
According to Google, its goal is ‘capture the full range of external factors that shape and contribute to the final user experience’.
In this post, I will walk you through how to use it to get insights of your site’s performance.
Why we need CrUX?
We all know faster site results in a better user experience and a better customer loyalty, compared to the sites of competitors. It results in the revenue increasing. Google confirmed some details about how they understand the speed. They are available in CrUX.
What are CrUX metrics?
FP(First Paint): when everything loads on the page
FCP(First Content loaded): when some text or an image loaded
DCL(DOM content loaded): when DOM is loaded
ONLOAD: when any additional scripts have loaded
FID(First Input Delay): the time between when a user interacts with your site to when the server actually responds to that
How to generate the CrUX report on PageSpeed Insights?
PageSpeed Insights is a tool for people to understand what a page’s performance is and how to improve it.
It uses the lighthouse to audit the given page and identify opportunities to improve performance. It also integrates with the CrUX to show how real users experience performance on the page.
Take Yahoo as the example, after a few seconds, lighthouse audits will be performed and we will see sections for field and lab data.
In the field data section, we can see FCP and FID (please see the table below as we can see the FCP and FID values).
We can see the Yahoo site is in ‘average’ according to the table. To achieve the ‘fast’, both FCP and FID must be categorized as fast.
Also, a percentile can be shown in each metric. For FCP, the 75th percentile is used and for FID, it is the 95th. For example, 75% of FCP experiences on the page are 1.5s or less.
How to use it in BigQuery?
In BigQuery, we can also extract insights about UX on our site.
SELECT origin, form_factor.name AS device, effective_connection_type.name AS conn,
ROUND(SUM(onload.density),4) as density
UNNEST (onload.histogram.bin) as onload
WHERE origin IN ("https://www.yahoo.com")
GROUP BY origin, device, conn
Then we can see the result in BigQuery.
The raw data is organized like a histogram, with bins have a start time, end time and density value. For example, we can query for the percent of ‘fast’ FCP experiences, where ‘fast’ is defined as happening under a second.
We can compare Yahoo with bing. Here is how the query look:
SUM(fcp.density) AS fast_fcp
UNNEST (first_contentful_paint.histogram.bin) AS fcp
AND origin IN ('https://www.bing.com',
This post explored some methods to get site insights with CrUX report. In the future, I will write more about CrUX.
If you are interested in or have any problems with CrUX, feel free to contact me.
Among Google Cloud Platform family products, there are Google App Engine, Google Compute Engine, Google Cloud Datastore, Google Cloud Storage, Google BigQuery (for analytics), and Google Cloud SQL.
The most important product for BI Analyst is Big Query, it is an OLAP Data Warehouse which supports DW, Join and fully managed. It can make developers use SQL to query massive amounts of data in seconds.
The main advantage is BigQuery can integrate with Google Analytics. It means we can synchronize Session/Event data to BigQuery easily to make custom analytics, not only the Google Analytics functions.
In other words, BigQuery can dump raw GA data into it. So it means some custom analytics which can’t be performed with the GA interface now can be generated by BigQuery.
Moreover, we can also bring in third-party data into it.
What is the difficulty for BI Analyst, it means we need to calculate every metrics in queries.
Which SQL is preferred in Big Query?
Standard SQL syntax is preferred in Big query nowadays.
How we can get the data from Google Analytics?
A daily dataset can be got from GA to BigQuery. Any within each dataset, a table is imported for each day of export. Its name format is ga_sessions_YYYYMMDD.
We can also set some steps to make sure the tables, dashboards and data transfers are always up-to-date.
How to get it a try?
Firstly, set up a Google Cloud Billing account. With a Google Cloud Billing account, we can use BigQuery web UI with Google Analytics 360.
The next step is to run a SQL query and visualize the output. The query editor is standard and follows the SQL syntax.
For example, here is a sample query that queries user-level data, total visits and page views.
In this step, if we need to get a good understanding of ga_sessios_table in BigQuery, we need to make sure what is the available raw GA data fileds can be got in BigQuery.
Google Cloud is seen as a leader in areas including data analytics, machine learning and open source. And digital transformation through the cloud allowed companies to deliver personalised, high quality experiences.
During this Lockdown time in New Zealand, working from home means taking less time on the traffic and a time to learn more advanced techniques.
So stay positive and stay safe!
Thanks to GCP fundamentals, it is a perfect opportunity for those who wants to learn Google Cloud Platform.
If you are interested in or have any problems with Business Intelligence, feel free to contact me.