Comparative Analysis of Machine Learning Algorithms in  Different Datasets

Abstract  

This project investigates the performance of various machine learning models in handling  classification tasks within medical data. The study employs Python and scikit-learn, a user friendly machine learning library, to assess and compare the effectiveness of different  classifiers. Conducted on the Google Colab platform, the project utilizes Python as the  programming language and scikit-learn as the primary machine learning tool. The dataset is  loaded and split into training and testing sets, with an 80:20 ratio. Different classifiers are  applied, and the evaluation metric used is accuracy. This research provides valuable insights  into the comparative performance of machine learning models in medical data classification,  offering a practical guide for selecting appropriate algorithms in similar applications. 

1. Introduction 

Machine learning is a field of computer science that uses data and algorithms to mimic human  learning ability [1]. Machine learning is very important in the field of data science. Machine  learning categories fall into three primary categories [1]:  

1. Supervised learning  

In the face of uncertainty, supervised machine learning develops a model that predicts  using the available data. By using a known set of input data and known output responses  (responses to the data), a supervised learning algorithm trains a model to produce  plausible predictions for the reaction to new data. 

2. Unsupervised learning  

In artificial intelligence, machine learning that takes place in the absence of human  supervision is known as unsupervised learning. Unsupervised machine learning  models, in contrast to supervised learning, are given unlabeled data and left to find  patterns and insights on their own—without explicit direction or instruction.  

3. Semi-supervised learning  

semi-supervised learning is a broad category of machine learning techniques that make  use of both labeled and unlabeled data; that is, it is a hiybrid approach between  supervised and unsupervised learning. 

The main idea behind semi-supervision is to treat a datapoint differently depending on  whether or not it has a label. If a point has a label, the algorithm updates the model  weights using traditional supervision; if it doesn’t, it minimizes the difference in  predictions between other training examples that are similar to it. 

In the context of machine learning, a classifier is an algorithm that automatically sorts or groups  data into one or more “classes.” An email classifier, which filters emails based on their class  label—Spam or Not Spam—is one of the most popular examples. Machine learning algorithms  are valuable for automating tasks that were traditionally performed manually. They have the  potential to enhance cost and time efficiency significantly. Data analysis is a crucial part of  biomedical engineering. different types of data (get ideas from the slides). In this project, we  used a dataset from the UCI repository [2].

The aim of this project is to examine the effectiveness of different classifiers in the analysis of  medical data. This study’s objectives are to first develop a comprehensive understanding of the  medical data, which encompasses its attributes, classes, and types. Then, we will implement  different classifiers, including k-nearest neighbours, support vector machines, and decision  trees, to determine their performance in medical data analysis. 

Business Management

In-person

Cultivate leadership skills in New York, where multinational...

Career Insights
Provides a comprehensive introduction to various professions. Suitable for students starting to consider their future careers and wishing to explore different professions.
Laptop with Google analytics dashboard on screen
Ages: 15-18

Coding

In-person

Master the language of technology in Cambridge, a...

Academic Insights
Provides a thorough introduction to diverse academic fields. Ideal for students beginning to contemplate their future academic paths and eager to explore various disciplines.
A laptop that shows programming codes.
Ages: 16-18

2. Material and methods 

This section explains the materials and the research method. First the dataset will be explained  and then the classifiers that are being implemented will be provided. 

2.1. Dataset 

In machine learning, a “dataset” is a collection of information a computer uses to learn from.  It is similar to how, for example, a student learns from different subjects in school. One of the  most well-known sources for these datasets is the UCI Machine Learning Repository [3]. It is  a large online resource of various datasets for different machine-learning tasks. 

In this study, we use three famous datasets from the UCI repository: 

A. Iris Dataset: This dataset is about iris flowers. It includes measurements of 150 iris  flowers, focusing on the lengths and widths of their petals and sepals. It’s commonly used  in machine learning to help computers learn how to classify different species of iris flowers  based on these measurements. 

B. Wine Dataset: It contains data about the chemical composition of different drinks, such  as hue and colour intensity. This dataset is useful for teaching computers to differentiate  between various types of materials based on their chemical properties. 

C. Breast Cancer Dataset: This dataset provides detailed information about breast cancer  characteristics, focusing on the attributes of cell nuclei in breast cancer samples. The task  is to distinguish between benign (non-dangerous) and malignant (harmful) cancer cells,  which is crucial in medical diagnoses. 

Table 1 provides the information of the dataset that are used in this project 

Table 1. The details of the dataset; dimensionality, ….

Number of attributes Classes Samples
Iris 150
Wine 13 178
Breast cancer 30 569

2.2. Methods 

We used Google Colab as the platform for implementing the project. The codes are written in  Python, and scikit-learn is used for machine learning implementations. The split (train-test)  ratio is 80% to 20%. We used three popular machine learning techniques: k-nearest neighbour,  Decision Trees, and Support Vector Machines (SVM). 

We decided to evaluate the effectiveness of these techniques by using a metric known as  accuracy. Accuracy is a measure of how frequently the method provided the correct solution.  This metric is useful in determining which technique is most appropriate for different types of  data. This method helped us to make a clear comparison and gain a better understanding of the  advantages and disadvantages of each technique.  

Join the Immerse Education 2024 
Essay Competition

The Immerse Essay Competition is open for 2024! Follow the instructions to write and submit your best essay for a chance to be awarded a 100% scholarship.

3. Results 

This section provides the experimental results of implementing the machine learning  algorithms on the selected UCI dataset. Table 2 presents the performance metrics of three  machine learning algorithms, i.e. k-nearest neighbour, decision trees, and support vector  machines – applied to three datasets previously mentioned in Section 2.1. 

Table 2. Performance metrics of machine learning algorithms on different datasets 

Dataset K-Nearest Neighbour Decision Trees SVM
Iris 1.0 1.0 1.0
Wine 0.805 0.944 0.805
Breast cancer 0.929 0.947 0.947

The results indicate a perfect score of 1.0 across all three algorithms for the Iris dataset, which  is a complete accuracy score. For the Wine dataset, the performance varied: the decision trees  method achieved the highest score (0.944), while both k-nearest neighbour and SVM obtained  a lower score of 0.805. In the case of the cancer dataset, decision trees and SVM showed similar  high performance with a score of 0.947, slightly better than the k-nearest neighbour method,  which achieved a score of 0.929.

4. Discussion  

In this project, we have investigated and implemented different machine learning algorithms:  k-nearest neighbour, decision trees, and SVM. The experimental results show that the  effectiveness of these algorithms varies depending on the dataset parameters. Increasing the  number of features in a dataset complicates the classification task. This complexity occurs  because the algorithm must consider more factors when making decisions, which can  sometimes lead to increased error. The number of classes and the number of samples in a  dataset can also impact the performance of the classifiers. The number of samples is also  important for performance. Too few samples may not be enough for accurate classification. On  the other hand, a large number of samples can improve the learning process. In conclusion,  considering these factors is essential for careful dataset selection and preparation in machine  learning projects.

References  

[1] https://www.ibm.com/id-en/topics/machine-learning 

[2] https://archive.ics.uci.edu/ 

[3] Wolberg,William, Mangasarian,Olvi, Street,Nick, and Street,W.. (1995). Breast Cancer  Wisconsin (Diagnostic). UCI Machine Learning Repository.  https://doi.org/10.24432/C5DW2B

Why Apply To The Immerse Education Essay Competition?

Are you a highly motivated student aged 13-18? Have you ever wanted to experience studying at one of the world’s top universities?

The Immerse Education essay competition allows you the chance to submit an essay for the chance to be awarded a scholarship to attend one of our award-winning academic or career-based summer schools.

How To Apply To The Immerse Education Essay Competition

If you’re aged 13-18 and interested in applying to the Immerse Education essay competition, please visit our essay competition page for more details.