Last blog I wrote why we need a Data Warehouse.
First, what is the data warehouse?
It is a centralized relational database that pulls together data from different sources (CRM, marketing stack, etc.) for better business insights.
It stores current and historical data are used for reporting and analysis.
However, here is the problem:
How we can design a Data Warehouse?
1 Define Business Requirements
Because it touches all areas of a company, all departments need to be onboard with the design. Each department needs to understand what the benefits of data warehouse and what results they can expect from it.
What objectives we can focus on:
- Determine the scope of the whole project
- Find out what data is useful for analysis and where our dat is current siloed
- Create a backup plan in case of failure
- Security: monitoring, etc.
2 Choose a data warehouse platform
There are four types of data warehouse platforms:
- Traditional database management systems: Row-based relational platforms, e.g., Microsoft SQL Server
- Specialized Analytics DBMS: Columnar data stores designed specifically for managing and running analytics, e.g., Teradata
- Out-of-box data warehouse appliances: the combination with a software and hardware with a DBMS pre-installed, e.g., Oracle Exadata
- Cloud-hosted data warehouse tools
We can choose the suitable one for the company according to budget, employees and infrastructure.
We can choose between cloud or on-premise?
Cloud solution pros:
- Scalability: easy, cost-effective, simple and flexible to scale with cloud services
- Low entry cost: no servers, hardware and operational cost
- Connectivity: easy to connect to other cloud services
- Security: cloud providers supply security patches and protocols to keep customers safe
Choices:
- Amazon Redshift
- Microsoft Azure SQL Data Warehouse
- Google Bigquery
- Snowflake Computing
On-premise solution pros:
- Reliability: With good staff and exceptional hardware, on-premise solutions can be highly available and reliable
- Security: Organizations have full control of the security and access
Choices:
- Oracle Database
- Microsoft SQL Server
- MySQL
- IBM DB2
- PostgreSQL
What we can choose between on-premise and cloud solution, in the big picture, it depends on our budget and existing system.
If we look for control, then we can choose on-premise solution. Conversely, if we look for scalability, we can choose a cloud service.
3 Set up the physical environments
There are three physical environments in Data Warehouse: development, testing and production.
- We need to test changes before they move into the production environment
- Running tests against data typically uses extreme data sets or random sets of data from the production environment.
- Data integrity is much easier to track and issues are easier to contain if we have three environments running.
4 Data Modelling
It is the most complex phase of Data Warehouse design. It is the process of visualizing data distribution in the warehouse.
- Visualize the relationships between data
- Set standardized naming conventions
- Create relationships between data sets
- Establish compliance and security processes
There are bunches of data modeling techniques that businesses use for data warehouse design. Here are top 3 popular ones:
- Snowflake Schema
- Star Schema
- Galaxy Schema
4 Choosing the ETL solution
ETL, stands for Extract, Transform and Load is the process we pull data from the storage solutions to warehouse.
We need to build an easy, replicable and consistent data pipeline because a poor ETL process can break the entire data warehouse.
Wrapping up
This post explored the first 4 steps about designing a Data Warehouse in the company. In the future, I will write the next steps.
If you are interested in or have any problems with Data Warehouse, feel free to contact me.
Or you can connect with me through my LinkedIn.