The basic objective of machine learning is to find patterns in data. These patterns can be in the form of:
Classification Algorithms in Machine Learning helps us determine the class of an object which allows the computer to make decisions regarding it based on information about the class. Classification Algorithms in Machine Learning has several different techniques such as:
There are some more Classification Algorithms in Machine Learning but these are the most commonly used ones.
The Clustering Techniques are the most common Classification Algorithms in Machine Learning. The most basic clustering technique is the hierarchical clustering.
In Hierarchical Clustering, the distance between each pair of observations is measured using Euclidian distance. The closest observations are put into the same cluster and the average distance between the two observations is taken for further clustering. This logic continues until there is only one cluster left. The number of clusters is determined when the jump in the distance of the observations becomes too large. It is also possible to find the number of clusters from the dendogram that is formed in SPSS.
Once the number of clusters are identified, the K-Means Clustering can be used. K-Means Clustering is a better method of clustering like observations together but requires the number of clusters as an input. Thus the number of clusters is identified through the Hierarchical Clustering and the members of the clusters is identified through the K-Means Clustering.
Decision Trees are techniques in which there is no need for input apart from the initial dimensions. While the clustering techniques are a bottom-top approach, decision trees are more of a top-to-bottom approach. All the observations are divided according to one dimension and dimensions are steadily added until the tree is formed.
One of the difficulty of choosing decision trees as the Classification Algorithms in Machine Learning is that the decision tree, when allowed to run its course, will end with all the observations classified separately. This is called overfitting. It means that the model developed is adhering to the training data too closely due to which its ability to classify outliers will be greatly reduced. In order to counter this problem, the decision tree is "pruned" which means the division stops once all the groups have a reasonable of members.
Random Forests are another technique used to overcome the problem of overfitting. Here the test data, training data and validation data are identified using sampling with replacement. Following this , the different datasets are used to develop the model. The number of trees formed are huge which means that the bias in any sample is averaged out over the course of the implementation. The overall model is developed by counting the voting of each member across the decision trees before arriving at the required model.
Support Vector Machines can be used as Prediction Algorithm in Machine Learning as well as a Classification Algorithm in Machine Learning. However it is usually used for Classification purposes. In this method, the observations are plotted on an n-dimensional space where n represents the dimensions in the dataset. Following the plotting of the observations, the ideal hyper-plane is chosen which best divides the observations into two clusters. While SVM was developed as a Classification Algorithm in Machine Learning for just two clusters, it has since been expanded into more clusters.
The hyper-plane in the SVM also does not need to be linear in nature. It is possible to transform the data using a technique called kernel which ensures that the division can be done for any dataset.
Naive Bayes is the Classification Algorithm in Machine Learning which assumes that all the dimensions are independent of each other (even if they are interlinked). Thus based on this assumption, the classification model is built.
Logistic Regression is a modification of the Linear Regression technique. While Regression techniques are generally used as Prediction Algorithms in Machine Learning, Logistic Regression is used as a Classification Algorithm in Machine Learning. Another distinction of the logistic regression is that it accepts nominal input as well as ordinal inputs while the linear regression only accepts ordinal value. The logistic regression still runs as a normal regression however when the value is predicted, gives a output of 0 or 1 depending on the value in reference to a set threshold. This threshold is set at 0.5 by default however it can be changed depending on the value of the independent variables.
Classification Algorithms are one of the most important techniques in Machine Learning since it has huge potential for implementation. The most common application of Classification Algorithms in Machine Learning is in Marketing where the practice of segmentation of the target audience is carried out with the help of Classification Algorithms.
Apart from this application, Classification Algorithms in Machine Learning is also used in banks to determine the level of risk that each potential customer represents when they decide on whether to give the person a loan or not.
There are several other applications of Classification Algorithms in Machine Learning and are only going to grow in the future.