Natural Language Processing:A Brief Agenda to Get You Started

Recently I start to tutor one paper (COMP700 Text and Vision Intelligence) for AUT students in the weekends.

This paper mostly involves in Computer Vision and Text Mining (Natural Language Processing).

The two aspects are common applications about Aritificial Intelligence.

So, in this blog, I will provide a brief agenda about Natural Language Processing for people who want to get kick-start in Artificial Intelligence area.

General Intro NLP
~General overview
~Tasks to describe languages

Language model:
~Statistical model
~Neural network model

NLP with deep learning:
~Recurrent Neural Network
~Convolutional Neural Network

NLP at different levels:
~Phonetic/Phonological Analysis
~OCR
~Morphological analysis
~Syntactic analysis
~Semantic interpretation
~Discourse processing

NLP in industry:
~Search
~Automated/assisted translation
~Speech recognition
~Sentiment analysis
~Chatbot

What need to be done-representing:
~Morphology
~Sentence structure
~Senmantic
~Vectors

Word vector:
~Wordnet: a resource containing lists of synonym sets and hypernyms
~One-hot: as discrete symbols
~TF-IDF
~Word embedding: A word’s meaning is given by the words that frequently appear close-by

Knowledge graph-semantic drive way:
~Ontologies&Description logic
~OWL&RDF
~Semantic web
~Dgraph, Neo4j

Describing Language tasks:
~Segment(for Chinese)
~POS: Part Of Speech
~NER: Named Entity Recognition
~Algorithml:Rule-based taggers, Probabilistic tagger: HMM and Veterbi, Perceptron, Conditional model: CRF

Statistical Language Model:
~NNLM 2003
~RNNLM 2010
~CBoW: Continuous Bag-of-Words Model
~Skip-gram
~Word2vec

Neural Network:
~Neuron
~Activation function
~Back propagation: What’s cost function, How cost function used updated parameters

Activation function:
~Sigmoid
~Tanh
~ReLu (rectified lieaner)
~Softmax
~…

Deep Learning:
~Learn (multiple levels of) representation and an output from ‘raw’ inputs x
~Universal, learnable framework for representing world, visual and linguistic information
~Can learn unsupervised (from raw text) and supervised
~Why now popular:A large dataset, Faster machines and multicore CPU/GPU
~Why it works

Deep Learning Models:
~Feed-Forward Networks
~Recurrent Neural Networks: simple RNN, LSTM, GRU
~Generative Neural Networks

If you are interested in or have any problems with Natural Language Processing, feel free to contact me.

Or you can connect with me through my LinkedIn.

The Simplest Way to Understand SCD in Data Warehouse

What is SCD?

SCD stands for Slowly Changing Dimensions.

It is very important in Data Warehouse.

As we know, ETL (Extract, Transform, Load) is between data sources and data warehouse.

When ETL runs, it will pick up all records and update them in the Dimension tables.

Why we need SCD?

Because we have some problems in updating data in Data Warehouse when data in data sources are changing.

In the dimension tables, if we want to keep some old records, how we can do this?

Using SCD.

Types of SCD:

Note:The types of SCD are defined on Column level, not on the table level.

There are two popular types in SCD.

Type 1: Overwrite

An old record is updated by the new record. It means covering the old records.

Type 2: Store history by creating another row

Type 2 is to add new records rather than covering the old records.

As long as we have a type 2 in tables, we must have two extra values: ‘StartDate’ and ‘EndDate’. The ‘EndDate’ is when the changes happen, as the end date of historical data.

We can have a third one, ‘IsCurrent’, to identify or mark the current record.

We don’t need to update the records, and we just need to update ‘StartDate’ and ‘EndDate’.

If you are interested in or have any problems with fact tables and dimension tables, feel free to contact me .

Rules of Creating Dimension Tables and Fact Tables

What is a dimension table? What is a fact table?

Why we need both of them in Data Warehouse?

Because Data Warehouse is used to make reports for business decisions.

Every report is made of two parts: Fact and Dimension.

Here is a picture of fact tables and dimension tables in a star schema in data warehouse.

So, this blog we will talk about what are the rules of creating dimension tables and fact tables.

First of all, we need to illustrate a definition.

What is surrogate key?

It is the primary key in demension tables.

Rules of creating dimension tables:

  1. Primary key (surrogate key, auto-increase number, only unique number in data warehouse)
  2. Business key (the key can be linked back to data source, with business meaning)
  3. Attributes (descriptive information from data source)

There are two kinds of data: Master data and transactional data.

Master data refers to the entity (e.g. employee) whereas transactional data refers to all the transactions that are carried out using that entity.

Master data is limited whereas transactional data can be billions.

In dimension tables, most data is master data.

Rules of creating fact tables:

  1. Primary key (surrogate key/alternate key, auto-increase number)
  2. Foreign key (primary key/surrogate key/alternate key from dimension tables)
  3. Measure (addictive number/semi-addictive number)

Tips: No descriptive data in Fact tables.

If you are interested in or have any problems with fact tables and dimension tables, feel free to contact me .

Or you can connect with me through my LinkedIn.

4 Reasons Why We Need Data Warehouse

Here is a basic process in Business Intelligence.

Maybe some people will be confused, why we need data warehouse?

Without data warehouse, we can also analyze the data.

We can get the data and create the report directly.

So, what the benefits of data warehouse in an organization?

Here we list 4 reasons why we need data warehouse.

Integrate data from various data sources and centralize the data into one place.

Have data loaded into data warehouse so that reporting won’t impact live system or database.

That is why we have a seperate data warehouse and the data is stored in the data warehouse.

We can make a scheduled job running at night to centralize the data from operational databases to data warehouse.

Easy access (one place of data and single source of truth).

It is easy for people to go to data warehouse to get the data, and they don’t need to worry other problems, e.g., we have so many data sources and where can I get the data?

We can trust the data warehouse where we can get the data.

Build model: choose the best design model to get the best flexibility and performance, especially for those large datasets.

We usually use kimball methodology – star schema/snowflake schema (de-normalization).

For example, we use the star schema to improve the query performance.

It is a methodology developed by Ralph Kimball which includes a set of methods, techniques and concepts for use in data warehouse design.

There are also some other methodologies we can use in data warehouse, e.g., inmon methodology, datavault methodology.

If you are interested in or have any problems with data warehouse, feel free to contact me.

Or you can connect with me through my LinkedIn.

A mind map for SQL

In this article, I make a basic SQL mind map for people who want to kick-start their career into business intelligence and data analysis industry.

Hope it can give you a basic understanding about SQL.

If you are interested in or have any problems with SQL, feel free to contact me.

Or you can connect with me through my LinkedIn.

Troubleshooting SSMS Error: 15517

If you meet with the same error as me in SSMS 2016.

Error:

“Cannot execute as the database principal because the principal “dbo” does not exist, this type of principal cannot be impersonated, or you do not have permission (Microsoft SQL Server, Error: 15517)”

Solution:

use [databasename]
GO
EXEC sp_changedbowner 'sa'
GO

Hope this solution can help you.

If you are interested in or have any problems with SQL and SSMS, feel free to contact me .

Or you can connect with me through my LinkedIn.

Why Machine Learning is So Popular and What is Learning

The oldest and strongest emotion of mankind is fear, and the oldest and the strongest kind of fear is the fear of unknown.

In order to solve this fear, human can only explore the relationship between input and output from the limited examples.

Then they can predict a situation that has never been seen before.

This is what learning is.

Learning: Exploring the pattern between input and output from limited examples.

So, the dataset is so important. Not only it needs to be large but also it is supposed to be precise.

This is why children have always been curious because they want to get all the examples to explore the world.

But until high school, the learning method we are exposed to is actually Transfer Learning.

It is the precedure that transfering the knowledge of the other people (e.g., teachers, parents) to your own brain.

But in this process, we have made too many mistakes.

Because Transfer Learning is killing the desire of human nature to learn.

At the same time, many people think this is what learning looks like.

They think Machine Learning is to put some patterns into the machine by programming.

In fact, machine learning is to let the machine learn to discover some patterns.

And the pattern found is called knowledge.

Actually, it is why Deep Learning is so important in Machine Learning and human learning methods.

Next blog we will explore what deep learning is and why human learning should act like deep learning.

If you are interested in or have any problems with Machine Learning, feel free to contact me .

Or you can connect with me through my LinkedIn.

Stored Procedure in SQL

This article I will give you some basic syntax about stored procedure in SQL.

Stored Procedure:

Stored Procedure is the prepared SQL codes that we can save, so the code can be reused over and over again.

It is suitable for some SQL queries we need to use frequently.

There are three kinds of stored procedure:

  1. No parameter
  2. One parameter
  3. multiple parameters

Stored Procedure (No parameter) Syntax

CREATE PROCEDURE procedure_name
AS
sql_statement
GO
EXEC procedure_name

Stored Procedure (One parameter) Syntax

CREATE PROCEDURE [dbo].[oneparameter]
@ProductCatergoryID int
AS
SELECT *FROM Production.ProductCategoryID
GO
EXEC oneparameter @ProductCatergoryID = '4'

Stored Procedure (multiparameter) Syntax

CREATE PROCEDURE [dbo].[multiparameter]
@ProductCategoryID int, @Name varchar(50)
AS
SELECT *FROM Production.ProductCategory pc
WHERE pc.ProductCategoryID = @ProductCategoryID and pc.Name = @Name
GO
EXEC multiparameter @ProductCategoryID = '4', @Name = 'Accessories'

I will record all knowledge I touch in my Business Intelligence journey.

Next blog will be around Data Warehouse.

If you are interested in or have any problems with Business Intelligence, feel free to contact me .

Or you can connect with me through my LinkedIn.

Business Intelligence Tutorial

This blog is Business Intelligence tutorial, which contains lots of definitions and terms.

There are three steps in Business Intelligence: Data collecting, Data Warehousing and Reporting.

Data Collecting:

  1. Structured: Standardized and easy for computers to read and query.
  2. Semistructured
  3. Unstructured: Not stored in rows and columns, so it can’t easily read by computers.

As we talked in previous blog, company data can be found in several locations, such as CRM programs, which is also shown in the picture below.

Data Warehouse:

Data warehouse uses a process (ETL, i.e., extract, transform and load) to standardize data, which allows it can be queried.

How does information get to a central location?

ETL———————>Data Warehouse

  1. Extract: unstructutred data is tagged with metadata to make it easier to find
  2. Transform: normalize data
  3. Load: tranfer data to central warehouse or data mart

Turning Data into Powerpoints (Business Intelligence Reporting)

  1. Data visualization: Graphic display of results
  2. Dashboard: Interfaces that represent specific analyses

If you are interested in or have any problems with Business Intelligence, feel free to contact me .

Or you can connect with me through my LinkedIn.

Why Communication is So Important In IT Industry?

Many people have an impression on IT developers: Coding all the time.

Actually, it is a stereotype. In IT Industry, developers spend more time on communication rather than computers.

So, if you want to become an IT professional, you must be good at communication.

Even you more enjoy coding.

Think about how much time you will spend on interacting with others?

  1. Check the email box…
  2. Agile Scrum Meeting…
  3. Communicate with colleagues about the needs…

How we can improve our communication?

1 Don’t judge or criticize

People are good at putting themselves in the centre. Everyone think himself/herself is the most important.

It is wrong and we can’t belittle others.

We should respect them firstly and then others will accept our opinion.

2 Think what others need

The key in communication is not think our own interests, like what I need and what I want.

Instead, we should consider what others need and what they want.

Why they want this function?…

Why this part makes them feel bored?…

3 Avoid fighting

As IT developers, sometimes we are easy to think others will accept the logical thinking.

Actually, most people are sensitive and follow their feelings.

From a book (How to Win Friends and Influence People) written by Dale Carnegie, the only way to solve the fighting is to avoid it.

If you never think about how to communicate with others, this is the right time!

Improve the ability of soft skills, and you will find it is valuable for your life.

People will like you and value you back.

From my BI journey, I learn more about soft skills, English at work and tech chit chats.If you are interested in or have any problems with BI, feel free to contact me .

Or you can connect with me through my LinkedIn.