Cluster Analysis and Decision Tree Induction

Solve a mystery in history,...

Machine Learning to Solve Mystery

WHEN Mar 2019

Role Data Scientist

Tools Used R, ML - Decision Tree & CLuster Analysis, Classification

Visit GitHub

Using Cluster Analysis and Decision Tree algorithm to solve a mystery in history: who wrote the disputed essays, Hamilton or Madison?

The Eucldena distance is calculated to measure the distance between the vectors and in here we use it to measure the similarity between the files. As we can see from the below plot that the files intersecting at the blue point are very similar and the ones at the red are not.

Clustering is an unsupervised learning technique. It is the task of grouping together a set of objects in a way that objects in the same cluster are more similar to each other than to objects in other clusters. Similarity is an amount that reflects the strength of relationship between two data objects. Clustering is mainly used for exploratory data mining. It is used in many fields such as machine learning, pattern recognition, image analysis, information retrieval, bio-informatics, data compression, and computer graphics.

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.

Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, decision tree algorithm can be used for solving regression and classification problems too.

The general motive of using Decision Tree is to create a training model which can use to predict class or value of target variables by learning decision rules inferred from prior data(training data).

Cross-validation is a statistical method used to estimate the skill of machine learning models.