Top 10 Machine Learning Algorithms for Newbies

Machine Learning requires correct interpretation and understanding of the problem. Top issues that people face are converting the real-world problems into machine adaptive problem. Identifying which machine learning would do the trick is the key. For newbies, learning the basic approaches of machine learning is of top most importance - classification, regression, clustering or recommendation which will give a model to create a problem statement, features and labels.

The size of the data set and the type of problem will determine the approach to be taken. It also helps to try out various similar algorithms to see the results for your problem. Most machine learning models are based on predictive modeling, which essentially shows a result say Y as a function of certain variables, say X or Z. The values of X and Z may change and also the way these variable connect with each other will also be different in different models. Let us see the top 10 machine learning algorithms which beginners could look at -

  • Linear Regression - finding the value of a variable in the form of a real number - such as calculating a budget, finding mean of marks in a classroom. A set of input variables are used to find the values of another variables. Both these variables have an equational relationship.
  • Logistic Regression - Used for binary classifications such as 0 and 1, YES or NO. Using a logistical function such as h(x) = 1/ (1 + e^ -x).
  • CART - Classification and Regression Trees (CART) helps in working with Decision Trees, among others such as ID3, C4.5.
  • Naïve Bayes - Calculating the probability of an event happening or not happening is all covered under this approach. The idea is to determine ‘yes’ or ‘no’ value of the variables. It is termed as naive as it is assumed that one outcome has no dependency on another.
  • KNN - this approach treats the entire dataset as training set rather than test sets. The algorithm goes through the entire set to find the k number of instances nearest to the new record.
  • Apriori - It uses a transactional database to pull out frequently occurring datasets and then generate rules for them. For example - if a person buys a car, he will buy gas. There are set patterns in certain variables.
  • K - means - This is an iterative algorithm to group datasets into groupings or clusters. By looking at the central point of each k cluster, assigning a data point to that cluster and then measuring the distance between the central point and data point will give us the least distance.
  • PCA - Principal Component Analysis (PCA) captures the dataset into a new co-ordinate system of axes which captures variability.
  • Bagging with Random Forests - A single learner decision tree has been improved upon by using random sample of records with each split constructed using a random sample of predictors.
  • Boosting with AdaBoost - Bagging which means “simple voting” with Adaptive Boosting. This means “weighted voting” for each vote to determine an outcome. While bagging is parallel, boosting is sequential in its working. Here each model is built on correcting the miscalculations of the previous model.

These 10 approaches are essential for a newbie to understand and are part of ‘Data Warehousing and Mining’ (DWM) methods. The above techniques are in divided part of Supervised, Unsupervised and Ensemble learning techniques.