On Tuesday, I attended a Meetup event at General Assembly that included two talks from Dataiku and the Memorial Sloan Kettering Center. Hosted by Dataiku, the evening started off with Will Nowak, Solutions Architect at Dataiku, and his talk on: “Do you do AutoML?”

## What is AutoML?

But first, what is ML? Machine learning is the study of algorithms and statistical models to teach a machine (or a computer) how to solve a problem without specific instructions. Essentially, a machine can learn on its own based on past experience. The machine learning workflow consists of several steps, therefore, AutoML automates this, so that you can solve a problem without having to go through individual steps.

In the past, machine learning was written entirely in code. Today, machine learning can be done in platforms which eliminates the code by using buttons to set features of your model and pick models to test. The future is AutoML, where automation platforms will do everything from start to finish.

### How can the ML Process be Automated?

Automation can be used throughout the machine learning process from automating linear or multiplicative feature generations to removing redundant or low quality data. AutoML can leverage the type of prediction problem to pick an algorithm type to test; a process that is usually done manually. Additionally, it could run hyperparameter optimizations to set max parameters to test in the model that it chooses.

### When Should You use AutoML?

AutoML is useful for when you are creating quick prototype models, need to measure baseline performance or want to speed up every day mundane tasks. AutoML can really change the landscape of work for data scientists. For more information on how, read Dataiku’s white paper on: “The Importance of AutoML for Augmented Analytic and the Rise of the Citizen Data Scientist“.

## Bayesian Methods and Overlapping Clusters

The second half of the evening included a talk from Sandhya Prabhakaran, Research Fellow at Memorial Sloan Kettering Cancer Center, on: “A Bayesian Approach To Model Overlapping Objects Available As Distance Data”.

### What is a Cluster?

In machine learning, a cluster is simply a group of similar data observations. Clustering is a part of unsupervised machine learning in which a machine uses unknown data and patterns to make inferences and group the data. Clustering is the most common kind of unsupervised learning and is applicable in many fields.

Sandhya observed that most applications of clustering pertained to mutually exclusive clusters, while in reality data observations may actually belong to overlapping clusters. Therefore, Sandhya’s goal was to find a way to model data observations that can exist in multiple clusters.

### A Bayesian Approach

Sandhya found that the best way to approach this problem was to use Bayesian methods. Bayesian methods use probability as a distribution, meaning a model can be built before data is even collected, and can be updated once data is collected (a.k.a. prior and posterior distributions).

This can be applied to finding overlapping clusters by using the pairwise distance data between data observations in order to work backwards from a posterior distribution to the Bayesian prior through inference. Specifically, using a beta-binomial bayesian distribution can find the probability of success for each data pairing being similar or dis-similar, and will assign to clusters accordingly. In this case, the model decides how many clusters are needed based on the data provided.

### Overlapping Clusters of HIV Protease Inhibitors

To demonstrate the benefits of this, Sandhya tested this on HIV protease inhibitors. There are currently 10 HIV protease inhibitors that exist, but all exhibit similar behaviors and are toxic to be administered as anti-HIV drugs. Sandhya wanted to know if there were any alternatives, specifically by examining any dis-similarities between the existing protease inhibitors through overlapping clusters.

Using the Bayesian approach, Sandhya was able to run a test among these HIV protease inhibitors and was able to identify differences among them. Click here for her full paper.