As we know, the first 70% of SQL is pretty straightforward but the remaining 30% can be pretty tricky.
So, in this blog, some popular hard SQL interview questions will be covered for people to sharpen their skills.
Self-Join Practice Problems
Part 1: How much a key metric, e.g., monthly active users, changes between months, e.g., a table named ‘logins’ is shown as below.
userz-id
date
1
2020-07-01
234
2020-07-02
3
2020-07-02
1
2020-07-02
…
…
234
2020-10-04
table:logins
Q: find the month-over-month percentage change for monthly active users
Solutions:
WITH mau AS
(
SELECT
DATE_TRUNC('month', date) AS month_timestamp,
COUNT(DISTINCT user_id) AS mau
FROM logins
GROUP BY
DATE_TRUNC('month', date)
)
SELECT
a.month_timestamp AS previous_month,
a.mau AS previous_mau,
b.month_timestamp AS current_month,
b.mau AS current_mau,
ROUND(100.0*(b.mau - a.mau)/a.mau,2) AS percent_change
FROM mau a
JOIN mau b
ON a.month_timestamp = b.month_timestamp - interval '1 month'
Next blog we will continue our journey with SQL medium-hard questions.
If you are interested in or have any problems with SQL, feel free to contact me .
The first question is why we need data integration?
Let me give you an example here to answer this question.
Every company has many departments, and different departments use different tools to store their data. For example, marketing team may use hubspot tool.
Now, we have different departments which store different types of data in a company.
However, insightful information is needed to make business decisions through those large amount of data.
What can we do?
Maybe we can connect all the databases everytime to generate reports. However, it will cost us large amount of time, then the term of data integration is raised.
What is data integration?
Data integration is a process in which heterogeneous data is retrieved and combined as an incorporated form and structure.
There are some advantages of data integration:
Reduce data complexity
Data integrity
Easy data collaboration
Smarter business decisions
There are also some well-known tools can do data integration, e.g., Microsoft SQL Server Integration Services (SSIS), Informatica, Oracle Data Integrator, AWS Glue etc.
This blog we will talk about SSIS, which is one of the most popular ETL tool in New Zealand.
Why SSIS?
Data can be loaded in parallel to many varied destinations
SSIS removes the need of hard core programmers
Tight integration with other products of Microsoft
SSIS is cheaper than most other ETL tools
SSIS stands for SQL Server Integration Services, which is a component of the Microsoft SQL Server database software that can be used to perform a broad range of data integration and data transformation tasks.
What can SSIS do?
It combines the data residing in different sources and provides users with an unified view of these data.
Also, it can also be used to automate maintenance of SQL Server databases and update to multidimensional analytical data.
How is works?
These tasks of data transformation and workflow creation is carried out using SSIS package:
Operational Data–>ETL–>Data Warehouse
In the first place, operational data store (ODS) is a database designed to integrate data from multiple sources for additional operations on the data. This is the place where most of the data used in current operation is housed before it’s tranferred to the data warehouse for longer tem storage or achiiving.
Next step is ETL(Extract, Transform and Load), the process of extracting the data from various sources, tranforming this data to meet your requirement and then loading into a target data warehouse.
The third step is data warehousing, which is a large set of data accumulated which is used for assembling and managing data from various sources of answering business questions. Hence, helps in making decisions.
What is SSIS package?
It is an object that implements integration services functionality to extract, transform and load data. A package is composed of:
connection
control flow elements(handle workflow)
data flow elements(handle data transform)
If you want to investigate more about SSIS, check it out on Microsoft official documents.
If you are interested in or have any problems with SSIS, feel free to contact me.
Before, i wrote a blog about a basic SQL mind map for people who want to kick-start their career into business intelligence and data analysis industry.
As we know, SQL is essential to become a Business Intelligence Developer/Data Analyst.
So, in this blog, i drew an advanced SQL mind map for people who want to dive their SQL journey.
If you are interested in or have any problems with SQL, feel free to contact me.
It is a centralized relational database that pulls together data from different sources (CRM, marketing stack, etc.) for better business insights.
It stores current and historical data are used for reporting and analysis.
However, here is the problem:
How we can design a Data Warehouse?
1 Define Business Requirements
Because it touches all areas of a company, all departments need to be onboard with the design. Each department needs to understand what the benefits of data warehouse and what results they can expect from it.
What objectives we can focus on:
Determine the scope of the whole project
Find out what data is useful for analysis and where our dat is current siloed
Create a backup plan in case of failure
Security: monitoring, etc.
2 Choose a data warehouse platform
There are four types of data warehouse platforms:
Traditional database management systems: Row-based relational platforms, e.g., Microsoft SQL Server
Specialized Analytics DBMS: Columnar data stores designed specifically for managing and running analytics, e.g., Teradata
Out-of-box data warehouse appliances: the combination with a software and hardware with a DBMS pre-installed, e.g., Oracle Exadata
Cloud-hosted data warehouse tools
We can choose the suitable one for the company according to budget, employees and infrastructure.
We can choose between cloud or on-premise?
Cloud solution pros:
Scalability: easy, cost-effective, simple and flexible to scale with cloud services
Low entry cost: no servers, hardware and operational cost
Connectivity: easy to connect to other cloud services
Security: cloud providers supply security patches and protocols to keep customers safe
Choices:
Amazon Redshift
Microsoft Azure SQL Data Warehouse
Google Bigquery
Snowflake Computing
On-premise solution pros:
Reliability: With good staff and exceptional hardware, on-premise solutions can be highly available and reliable
Security: Organizations have full control of the security and access
Choices:
Oracle Database
Microsoft SQL Server
MySQL
IBM DB2
PostgreSQL
What we can choose between on-premise and cloud solution, in the big picture, it depends on our budget and existing system.
If we look for control, then we can choose on-premise solution. Conversely, if we look for scalability, we can choose a cloud service.
3 Set up the physical environments
There are three physical environments in Data Warehouse: development, testing and production.
We need to test changes before they move into the production environment
Running tests against data typically uses extreme data sets or random sets of data from the production environment.
Data integrity is much easier to track and issues are easier to contain if we have three environments running.
4 Data Modelling
It is the most complex phase of Data Warehouse design. It is the process of visualizing data distribution in the warehouse.
Visualize the relationships between data
Set standardized naming conventions
Create relationships between data sets
Establish compliance and security processes
There are bunches of data modeling techniques that businesses use for data warehouse design. Here are top 3 popular ones:
Snowflake Schema
Star Schema
Galaxy Schema
4 Choosing the ETL solution
ETL, stands for Extract, Transform and Load is the process we pull data from the storage solutions to warehouse.
We need to build an easy, replicable and consistent data pipeline because a poor ETL process can break the entire data warehouse.
Wrapping up
This post explored the first 4 steps about designing a Data Warehouse in the company. In the future, I will write the next steps.
If you are interested in or have any problems with Data Warehouse, feel free to contact me.
Last Blog I demonstrated the data pipeline we can use CrUX to analyze the site performance. This is from a BI developer perspective.
However, for a company, especially the leadership team, what they want is the final dashboard that generated from Data department, so management plan can be gained.
So this blog I will show you what kind of dashboard we can generate after the steps of data collection from Google public dataset and ETL.
What data visualization tool we need to use?
There are bunches of data visualization tools we can use, e.g., Data Studio, Power BI etc. This time I take Tableau for an example.
I took www.flightcentre.co.nz(blue line) and www.rentalcars.com(red line) as the origin for comparison, set customer’s device is ‘desktop’ (we also can put a filter on it too).
And there are 4 sheets on a dashboard, i.e., Slow FCP Percentage, Fast FCP Percentage, Fast FID Percentage and Slow FID Percentage.
What they actually mean?
Slow FCP Percentage(the percentage of users that experienced a first contentful paint time of 1 second or less)
Fast FCP Percentage(the percentage of users that experienced a first contentful paint time of 2.5 seconds or more)
Fast FID Percentage(the percentage of users that experienced a first input delay time of 50 ms or less)
Slow FID Percentage(the percentage of users that experienced a first input delay time of 250 ms or more)
After this graph, we can roughly see that flightcentre has a higher site speed than rentalcars in user experience.
What we can do in the next step?
After that, we can inform devs and communicate impact according to show exactly the area that the site is falling down. We can point to the fact that it’s from real users and how people actually experiencing the site.
The second part is the data strategy lifecycle in a company.
What is the data strategy lifecycle in a company?
Develop the strategy–>Create the roadmap–>Change management plan–>Analytics lifestyle–>Measurement plan
Perspectives:
Scope and Purpose: What data will we manage? How much does our data worth? How do we measure success?
Data collection: Archiving, what data where and when, single source of truth(data lake), integrating data silos
Architecture: Real time vs Batch, data sharing, data management and security, data modelling, visualization
Insights and analysis: Data Exploration, self-service, collaboration, managing results
Data governance: Identify data owners, strategic leadership. data stewardship. data lineage, quality, and cost
Access and security: RBAC, encryption, PII, access processes, audits, regulatory
Retention and SLAs: Data tiers and retention, SLA’s to the business
Wrapping up
This post explored the CrUX dashboard BI team can generate and the data strategy in a company. In the future, I will write more.
If you are interested in or have any problems with CrUX or Business Intelligence, feel free to contact me.
It is free and easy for SQL beginners to do SQL exercise step by step.
A correlated subquery works like a nested loop: the subquery only has access to rows related to a single record at a time in the outer query.
The technique relies on table aliases to identify two different usages of the same table, which means one usage is in the outer query and another one is in the subquery.
Here is a table called world, which is an example on the sqlzoo website:
name
continent
area
population
gdp
Afghanistan
Asia
652230
25500100
20343000000
Albania
Europe
28748
2831741
12960000000
Algeria
Africa
2381741
37100000
188681000000
Andorra
Europe
468
78115
3712000000
Angola
Africa
1246700
20609294
100990000000
…
…
…
…
…
Question: Find the largest country (by area) in each continent, show the continent, the name and the area.
SQL answer using subquery:
SELECT continent, name, area
FROM world x
WHERE area >= ALL
(SELECT area FROM world y
WHERE y.continent=x.continent AND area>0)
One way to interpret the SQL line in the WHERE clause that references the two table is “… where the correlated values are the same”.
In this example, we can tell “select the country details from the world table where the area is larger than or equal to the area of all countries where the continent is the same”.
If you are interested in or have any problems with SQL, feel free to contact me.
If you meet with the same error as me in SSMS 2016.
Error:
“Cannot execute as the database principal because the principal “dbo” does not exist, this type of principal cannot be impersonated, or you do not have permission (Microsoft SQL Server, Error: 15517)”
Solution:
use [databasename]
GO
EXEC sp_changedbowner 'sa'
GO
Hope this solution can help you.
If you are interested in or have any problems with SQL and SSMS, feel free to contact me .