Model Card for Model ID
This modelcard aims to be a base template for new models. It has been generated using this raw template.
Model Details
Model Description
A predictive machine learning model was developed that can classify data points into distinct categories based on symptoms using diseases data.
- Developed by: Priyanka Kamila
- Model type: RandomForestClassifier, SVC
- Language(s) (NLP): EN
Uses
Direct Use
This model can be directly used for disease diagnosis based on binary encoded medical features. By inputting patient symptoms in the form of binary vectors, the model predicts the likely medical condition. Here’s how you can utilize the model:
Prepare Input Data:
Ensure that the input data is formatted as a binary matrix, where each row represents a patient and each column represents a symptom or feature. The target variable should be a categorical label representing the medical condition. Load the Model:
Load the trained Random Forest Classifier or SVM Classifier from the repository. You can use libraries like joblib or pickle in Python to load the pre-trained model.
Make Predictions:
Use the loaded model to make predictions on new input data. For instance, in Python: python Copy code import joblib model = joblib.load('path_to_model.pkl') predictions = model.predict(new_input_data)
Interpret Results:
The model will output the predicted medical condition for each input row. These predictions can be used by healthcare professionals to assist in diagnosing patients.
This model is intended for direct use in clinical decision support systems or healthcare applications where quick and accurate disease diagnosis is critical. It can be integrated into electronic health records (EHR) systems, patient management software, or used as a standalone diagnostic tool.
Out-of-Scope Use
This model is designed specifically for diagnosing diseases based on binary encoded medical features. It is important to recognize the limitations and potential misuse of the model:
Non-Medical Applications:
The model is not suitable for non-medical applications or any use cases outside of healthcare diagnostics. Using this model for unrelated classification tasks will yield inaccurate and irrelevant results.
Incomplete or Inaccurate Input Data:
The model relies on precise binary encoding of medical symptoms. Providing incomplete, inaccurate, or improperly formatted data can lead to incorrect diagnoses. It is crucial to ensure that input data is complete and correctly formatted according to the binary encoding schema used during model training.
Real-Time Critical Decisions:
While the model can aid in diagnosis, it should not be solely relied upon for real-time critical medical decisions without human oversight. Healthcare professionals should verify the model’s predictions and consider additional clinical information and diagnostics before making final decisions.
Malicious Use:
The model should not be used to intentionally misdiagnose or manipulate medical diagnoses for fraudulent purposes. Ensuring ethical use of the model is paramount, and it should only be used to assist in improving patient care. Diagnostic Scope Limitation:
The model is trained on specific diseases included in the dataset. It may not perform well in diagnosing conditions outside the scope of its training data. For diseases not represented in the training data, the model might default to predicting "other," which should be interpreted with caution.
General Population Screening:
This model is not intended for general population screening or predicting disease prevalence in broad, non-clinical populations. It is designed for use with patients already presenting symptoms or those in a clinical setting. By understanding these limitations and potential misuse scenarios, users can ensure that the model is applied appropriately and ethically in relevant healthcare contexts.
Training Details
Training Data
The training data used for this model consists of a custom dataset with binary encoded medical features. Each row in the dataset represents a patient's symptoms encoded as binary values, and the corresponding label represents the diagnosed disease. The dataset includes a wide range of medical conditions, with the aim of providing a comprehensive diagnostic tool.
Source of Data:
The dataset was compiled from https://huggingface.co/datasets/duxprajapati/symptom-disease-dataset from huggingface which was then processed in terms of data-labeling using Smabbler's QueryLab platform ensuring a accurate representation of data-labels for common and rare diseases.
Pre-processing:
The pre-processing stage is very crucial to the building of an accurate machine learning model and in terms of ensuring its reliability to be used in medical domain. It involves data cleaning process which is a bit labor-intensive involving extensive manual checks for consistency and iterative validation for retaining high quality of final dataset. These processes are particularly complex while dealing with medical data.
Here the data was pre-processed to ensure consistency and accuracy. This involved cleaning the data, handling missing values, and normalizing the binary encoding. Each symptom was converted into a binary feature (0 or 1), indicating its absence or presence respectively. The labels were mapped to specific diseases using a detailed mapping file to ensure accurate representation.
Smabbler made the pre-processing method easy by providing automated labeling,reducing the manual effort, ensuring consistency, and maintained high accuracy in the pre-processed dataset, making it a crucial asset in building a reliable disease diagnostic model.
The data cleaning process, which would have been labor-intensive and time-consuming, was significantly expedited by Smabbler's tools and features.The platform's automation, standardization, and validation capabilities ensured that the pre-processing was not only quicker but also more reliable and accurate.
Label Mapping:
The labels in the dataset correspond to various diseases. A mapping file (mapping.json) was used to translate encoded labels to human-readable disease names. Top labels include diseases like Psoriasis, Malaria, Bronchial Asthma, Dengue, Arthritis, Heart Attack, and many more.
Additional Documentation:
Detailed documentation on data pre-processing and filtering steps is provided to ensure reproducibility and transparency. The dataset card includes information on the data sources, pre-processing steps, and any additional filtering or transformations applied.
Training Procedure
The training procedure for this model involves several key steps to ensure robust and accurate disease diagnosis using Random Forest and SVM classifiers. Below are the detailed steps and technical specifications related to the training procedure:
Data Splitting:
The dataset was split into training and testing sets using an 80-20 split ratio. The training set was used to train the classifiers, while the testing set was used to evaluate the model’s performance. Feature Selection:
Binary encoded features representing the presence or absence of symptoms were selected as input features. The target variable was the disease label, which was mapped from encoded integers to human-readable disease names. Model Initialization:
Two classifiers were initialized: Random Forest Classifier and Support Vector Machine (SVM) Classifier. Both classifiers were initialized with default parameters and a fixed random state to ensure reproducibility.
Training the Models:
Random Forest Classifier: The Random Forest model was trained on the training data using the fit method. Hyperparameters such as the number of trees and depth were tuned to optimize performance. SVM Classifier: The SVM model was similarly trained using the fit method. Kernel type, regularization parameters, and other hyperparameters were adjusted for optimal classification.
Evaluation
The performance of both models was evaluated on the testing set. Metrics such as accuracy, precision, recall, and f1-score were calculated to assess model performance. Confusion matrices were generated to visualize the performance of each classifier in predicting the correct disease labels.
Results
Summary
This model utilizes both Random Forest and SVM classifiers to accurately diagnose a variety of diseases based on binary encoded medical features. The training involved data pre-processing, feature selection, model training, and extensive evaluation to ensure reliability. Designed for healthcare applications, it aids professionals in making informed diagnostic decisions efficiently.
Model Card Authors
Priyanka Kamila