Comparative Analysis of Machine Learning Algorithms in Different Datasets
Abstract
This project investigates the performance of various machine learning models in handling classification tasks within medical data. The study employs Python and scikit-learn, a user friendly machine learning library, to assess and compare the effectiveness of different classifiers. Conducted on the Google Colab platform, the project utilizes Python as the programming language and scikit-learn as the primary machine learning tool. The dataset is loaded and split into training and testing sets, with an 80:20 ratio. Different classifiers are applied, and the evaluation metric used is accuracy. This research provides valuable insights into the comparative performance of machine learning models in medical data classification, offering a practical guide for selecting appropriate algorithms in similar applications.
1. Introduction
Machine learning is a field of computer science that uses data and algorithms to mimic human learning ability [1]. Machine learning is very important in the field of data science. Machine learning categories fall into three primary categories [1]:
1. Supervised learning
In the face of uncertainty, supervised machine learning develops a model that predicts using the available data. By using a known set of input data and known output responses (responses to the data), a supervised learning algorithm trains a model to produce plausible predictions for the reaction to new data.
2. Unsupervised learning
In artificial intelligence, machine learning that takes place in the absence of human supervision is known as unsupervised learning. Unsupervised machine learning models, in contrast to supervised learning, are given unlabeled data and left to find patterns and insights on their own—without explicit direction or instruction.
3. Semi-supervised learning
semi-supervised learning is a broad category of machine learning techniques that make use of both labeled and unlabeled data; that is, it is a hiybrid approach between supervised and unsupervised learning.
The main idea behind semi-supervision is to treat a datapoint differently depending on whether or not it has a label. If a point has a label, the algorithm updates the model weights using traditional supervision; if it doesn’t, it minimizes the difference in predictions between other training examples that are similar to it.
In the context of machine learning, a classifier is an algorithm that automatically sorts or groups data into one or more “classes.” An email classifier, which filters emails based on their class label—Spam or Not Spam—is one of the most popular examples. Machine learning algorithms are valuable for automating tasks that were traditionally performed manually. They have the potential to enhance cost and time efficiency significantly. Data analysis is a crucial part of biomedical engineering. different types of data (get ideas from the slides). In this project, we used a dataset from the UCI repository [2].
The aim of this project is to examine the effectiveness of different classifiers in the analysis of medical data. This study’s objectives are to first develop a comprehensive understanding of the medical data, which encompasses its attributes, classes, and types. Then, we will implement different classifiers, including k-nearest neighbours, support vector machines, and decision trees, to determine their performance in medical data analysis.
2. Material and methods
This section explains the materials and the research method. First the dataset will be explained and then the classifiers that are being implemented will be provided.
2.1. Dataset
In machine learning, a “dataset” is a collection of information a computer uses to learn from. It is similar to how, for example, a student learns from different subjects in school. One of the most well-known sources for these datasets is the UCI Machine Learning Repository [3]. It is a large online resource of various datasets for different machine-learning tasks.
In this study, we use three famous datasets from the UCI repository:
A. Iris Dataset: This dataset is about iris flowers. It includes measurements of 150 iris flowers, focusing on the lengths and widths of their petals and sepals. It’s commonly used in machine learning to help computers learn how to classify different species of iris flowers based on these measurements.
B. Wine Dataset: It contains data about the chemical composition of different drinks, such as hue and colour intensity. This dataset is useful for teaching computers to differentiate between various types of materials based on their chemical properties.
C. Breast Cancer Dataset: This dataset provides detailed information about breast cancer characteristics, focusing on the attributes of cell nuclei in breast cancer samples. The task is to distinguish between benign (non-dangerous) and malignant (harmful) cancer cells, which is crucial in medical diagnoses.
Table 1 provides the information of the dataset that are used in this project
Table 1. The details of the dataset; dimensionality, ….
Number of attributes | Classes | Samples | |
Iris | 4 | 3 | 150 |
Wine | 13 | 3 | 178 |
Breast cancer | 30 | 2 | 569 |
2.2. Methods
We used Google Colab as the platform for implementing the project. The codes are written in Python, and scikit-learn is used for machine learning implementations. The split (train-test) ratio is 80% to 20%. We used three popular machine learning techniques: k-nearest neighbour, Decision Trees, and Support Vector Machines (SVM).
We decided to evaluate the effectiveness of these techniques by using a metric known as accuracy. Accuracy is a measure of how frequently the method provided the correct solution. This metric is useful in determining which technique is most appropriate for different types of data. This method helped us to make a clear comparison and gain a better understanding of the advantages and disadvantages of each technique.
Join the Immerse Education 2024 Essay Competition
The Immerse Essay Competition is open for 2024! Follow the instructions to write and submit your best essay for a chance to be awarded a 100% scholarship.
3. Results
This section provides the experimental results of implementing the machine learning algorithms on the selected UCI dataset. Table 2 presents the performance metrics of three machine learning algorithms, i.e. k-nearest neighbour, decision trees, and support vector machines – applied to three datasets previously mentioned in Section 2.1.
Table 2. Performance metrics of machine learning algorithms on different datasets
Dataset | K-Nearest Neighbour | Decision Trees | SVM |
Iris | 1.0 | 1.0 | 1.0 |
Wine | 0.805 | 0.944 | 0.805 |
Breast cancer | 0.929 | 0.947 | 0.947 |
The results indicate a perfect score of 1.0 across all three algorithms for the Iris dataset, which is a complete accuracy score. For the Wine dataset, the performance varied: the decision trees method achieved the highest score (0.944), while both k-nearest neighbour and SVM obtained a lower score of 0.805. In the case of the cancer dataset, decision trees and SVM showed similar high performance with a score of 0.947, slightly better than the k-nearest neighbour method, which achieved a score of 0.929.
4. Discussion
In this project, we have investigated and implemented different machine learning algorithms: k-nearest neighbour, decision trees, and SVM. The experimental results show that the effectiveness of these algorithms varies depending on the dataset parameters. Increasing the number of features in a dataset complicates the classification task. This complexity occurs because the algorithm must consider more factors when making decisions, which can sometimes lead to increased error. The number of classes and the number of samples in a dataset can also impact the performance of the classifiers. The number of samples is also important for performance. Too few samples may not be enough for accurate classification. On the other hand, a large number of samples can improve the learning process. In conclusion, considering these factors is essential for careful dataset selection and preparation in machine learning projects.
References
[1] https://www.ibm.com/id-en/topics/machine-learning
[2] https://archive.ics.uci.edu/
[3] Wolberg,William, Mangasarian,Olvi, Street,Nick, and Street,W.. (1995). Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository. https://doi.org/10.24432/C5DW2B
Why Apply To The Immerse Education Essay Competition?
Are you a highly motivated student aged 13-18? Have you ever wanted to experience studying at one of the world’s top universities?
The Immerse Education essay competition allows you the chance to submit an essay for the chance to be awarded a scholarship to attend one of our award-winning academic or career-based summer schools.
How To Apply To The Immerse Education Essay Competition
If you’re aged 13-18 and interested in applying to the Immerse Education essay competition, please visit our essay competition page for more details.