Apply your Machine learning skills

Vijayesh Vijayan
5 min readJun 29, 2021

If you are a machine learning enthusiast, knew the fundamentals of machine learning, then during your early ML problem-solving days, a few questions(sometimes a lot) that can come to your mind. This happens to many and can be extremely hard to get through with the right approach that you may need. Here I try to give some simple guidelines and approaches which may help many, without being an expert on data science or calculus.

Apply ML everywhere?

Simple question, but a little thought needed here. We may say all problems that need some prediction or classification( predict price, or classify spam) or recommendation (product suggestions) through data analysis, action and reward problems( checkers game), or in general, a problem that needs some analysis, than a routine one. However, you need to think of its outcome vs cost of development before applying it.

For eg: predicting a search on a simple shopping site using machine learning may not be needed, as a regex match could achieve the purpose within a limited product set the site handles; whereas predictive search using ML on big e-commerce portals with wide range of products, may give more value add for the effort. Likewise, a basic script that checks if the attachment is missing on an email client, from the email content, is handy than an ML algorithm that recommends it.

Long story short, ML algorithms are great tools when used at right problems. Not just any automation problem needs it. When there is data that needs analysis, needs a predicted output, recommendation etc, and that adds value to the product or process, go for Machine learning.

Any algorithm for my given problem gives the result?

Classification, regression or supervised, unsupervised, reinforcement… you may have heard a lot of such terms and may understand the concepts too. When to use what, and what kind of problem is on hand, maybe a little tough part when you are on starting days.

Look at the data and problem statement you have.

Along with the input variables, is the result variable also available? Then it’s a supervised learning problem. Eg: you have data on temperature, cloud cover, month and tells if it rained or not; then for a new set of values, you can predict if it rains. In another word, if you have labelled result, then it’s supervised learning.

If your labelled result has continuous values like the percentage of growth on fuel price, then it’s a regression problem, go for regression algorithms.

Otherwise, if the labelled result can only be from a set of discrete values, like { 0 or 1}, or {apple, orange, banana, mango}, then go for a classification algorithm.

From the data, if you have to find some grouping, and the groups are not named or labelled, it’s unsupervised learning. So go for an unsupervised learning algorithm, like clustering algorithm, where you get clusters or grouping based on data which were not labelled. Another type of problems are recommendation systems, on which the result group is unlabelled.

Many real-time problems can have action and reward associated with it. You make a move on the checkers board; that can yield a positive or negative impact on your chances of winning. Learning that kind of a problem can be reinforcement learning.

Still, a lot of algorithms to chose from

Yes. I consider primarily 2 things – bias and variance. Use a comfortable algorithm that is suited for your problem type, find if there is high bias or high variance. From here you can either chose another algorithm or try for some optimization.

High variance(overfitting) means your algorithm learns extremely well that it fits all the training data, and fails to be flexible for any new data. i.e Testing errors will be high. High bias (underfitting) means it fails to fit the given data using the algorithm.

Linear algorithms are high bias algorithms(linear regression, logistic regression etc), others can be considered high variance ones(SVM, decision tree etc). To find out bias and variance, go for a learning curve- plot with prediction error vs size of data. Do this on training and testing data. If the error is low on the training set and high on the testing set, it is high variance. If training data has a high error, and testing continues the same, it’s high bias.

Tuning

Regularization is one shield to your algorithm from overfitting. Consider it based on the variance-bias factors. Another way to reduce high variance is to use fewer input variables, or lower polynomial complexity (if using polynomial degrees on inputs), and to add more data to the training set.

For high bias, go for more input variables, complex input variables, no or minimal regularization. Here adding more training data is unlikely to help. That’s where the importance of knowing if your algorithm has high bias or high variance, so important. That helps avoiding cost of applying the wrong optimizing method.

Bias, variance and other irreducible errors occur on your learning model. Finding a sweet spot between high variance and high bias is the key.

Another thing to consider is the normalization of input variables. For eg your input x1 ranges from 1 to 100 and x2 ranges from 0 to 1, this can have an impact on learning. Normalize all those input variables to the same range before applying the algorithm.

Save the learning model

Many times, after ML models are created, verified and optimized, those are converted to a serialized file, called a pickle file. You can use such saved pickle file to deserialize and load the pre-learned model, which makes your program run faster. In other words you don't have to apply all complex algorithms on a set of data and form the model every time you want to get the result; reuse the learned model. However make sure to refresh the model to fit to more accurate data, either periodically or based on some conditions, that depends on the problem you got.

If you understand the idea of machine learning and some basic algorithms, you are all good to start applying them to ML problems. The above guidelines and approaches help as the tools to choose your algorithm and tune the right factor. A lot of trial and error goes here, which should be fine. Having said this, if you dive a bit deep into how each algorithm reduces its cost functions, some of the performance matrices etc, it would help you become an expert even fast.

--

--

Vijayesh Vijayan

IT professional, scrum enthusiast, cricket lover, multicultural sensitive, humorous, nature lover, and many more...