Contraception method choice among women varies across the world. Among the 1.9 billion women of reproductive age (15-49 years) living in the world in 2019, 1.1 billion have a need for family planning.
The experience, or awareness, of side effects and inconveniences of using specific contraceptive methods and their effectiveness at preventing pregnancy, play a role in the choice of the method used.
At the population level, several factors play a role in the heterogeneous distribution, including access to different contraceptive methods to the level of knowledge on contraceptives.
Due to the nature of ethics surrounding clinical data, there is limited access to specific data on factors influencing contraceptive use among women. The Contraceptive Method Choice Dataset compiles data collected from the 1987 National Indonesia Contraceptive Prevalence Survey. This article demonstrates how one would use Decision Trees as a data modeling method to predict a woman’s contraceptive usage based on socio-economic factors.
#Loading and Setting up the data Import libraries from panda. Loaded the file as a .data file. Then reference the columns.
#set up
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
df = pd.read_csv('cmc.data')
df.head()
24 | 2 | 3 | 3.1 | 1 | 1.1 | 2.1 | 3.2 | 0 | 1.2 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 45 | 1 | 3 | 10 | 1 | 1 | 3 | 4 | 0 | 1 |
1 | 43 | 2 | 3 | 7 | 1 | 1 | 3 | 4 | 0 | 1 |
2 | 42 | 3 | 2 | 9 | 1 | 1 | 3 | 3 | 0 | 1 |
3 | 36 | 3 | 3 | 8 | 1 | 1 | 3 | 2 | 0 | 1 |
4 | 19 | 4 | 4 | 0 | 1 | 1 | 3 | 3 | 0 | 1 |
#Assigning column names to the dataset.
df.columns=['age','wife_education','husband_education','no#_children',
'religion', 'working', 'husband_occupation','std_index','media_exposure',
'contraceptive_method']
df.shape
(1472, 10)
#Mapping values to the coded dataset, and storing in a new dataframe
wife_education_mapping ={1:"low", 2:"medium-low", 3:"medium-high", 4:"high"}
husband_education_mapping = {1:"low", 2:"medium-low", 3:"medium-high", 4:"high"}
religion_mapping ={0:"Non-Islam", 1:"Islam"}
working_mapping={0:"Yes", 1:"No"}
husband_occupation_mapping ={1:"low", 2:"medium-low", 3:"medium-high", 4:"high"}
std_index_mapping ={1:"low", 2:"medium-low", 3:"medium-high", 4:"high"}
media_exposure_mapping = {0:"Good", 1:"Not good"}
#contraceptive_method_mapping ={1:"No use", 2:"Long term", 3:"Short term"}
df['wife_education_label'] = df['wife_education'].map(wife_education_mapping)
df['husband_education_label'] = df['husband_education'].map(husband_education_mapping)
df['religion_label'] = df['religion'].map(religion_mapping)
df['working_label'] = df['working'].map(working_mapping)
df['husband_occupation_label'] = df['husband_occupation'].map(husband_occupation_mapping)
df['std_index_label'] = df['std_index'].map(std_index_mapping)
df['media_exposure_label'] = df['media_exposure'].map(media_exposure_mapping)
new_df = df[['age','wife_education_label','husband_education_label','no#_children','religion_label','working_label',
'husband_occupation_label','std_index_label','media_exposure_label','contraceptive_method']]
new_df.shape
(1472, 10)
##Viewing the data types
new_df.dtypes
age int64
wife_education_label object
husband_education_label object
no#_children int64
religion_label object
working_label object
husband_occupation_label object
std_index_label object
media_exposure_label object
contraceptive_method int64
dtype: object
#Data Transformation
##OneHotEncode the dataset
transformer = make_column_transformer((OneHotEncoder(),['wife_education_label','husband_education_label','religion_label','working_label',
'husband_occupation_label','std_index_label','media_exposure_label']), remainder = 'passthrough')
transformed = transformer.fit_transform(new_df)
transformed_new_df = pd.DataFrame(transformed, columns = transformer.get_feature_names())
#print(transformed_new_df.head())
transformed_new_df.dtypes
onehotencoder__x0_high float64
onehotencoder__x0_low float64
onehotencoder__x0_medium-high float64
onehotencoder__x0_medium-low float64
onehotencoder__x1_high float64
onehotencoder__x1_low float64
onehotencoder__x1_medium-high float64
onehotencoder__x1_medium-low float64
onehotencoder__x2_Islam float64
onehotencoder__x2_Non-Islam float64
onehotencoder__x3_No float64
onehotencoder__x3_Yes float64
onehotencoder__x4_high float64
onehotencoder__x4_low float64
onehotencoder__x4_medium-high float64
onehotencoder__x4_medium-low float64
onehotencoder__x5_high float64
onehotencoder__x5_low float64
onehotencoder__x5_medium-high float64
onehotencoder__x5_medium-low float64
onehotencoder__x6_Good float64
onehotencoder__x6_Not good float64
age float64
no#_children float64
contraceptive_method float64
dtype: object
## Convert contraceptive method from a float to a category data type for classification purpose
s= transformed_new_df.loc[:,('contraceptive_method')].astype('category')
transformed_new_df.insert(len(new_df.columns),'contraceptive_method_label',s.values)
transformed_new_df = transformed_new_df.drop('contraceptive_method', axis=1)
## Convert age, no#_children to interger data type
transformed_new_df[['age','no#_children']]= transformed_new_df[['age','no#_children']].astype('int')
transformed_new_df.dtypes
onehotencoder__x0_high float64
onehotencoder__x0_low float64
onehotencoder__x0_medium-high float64
onehotencoder__x0_medium-low float64
onehotencoder__x1_high float64
onehotencoder__x1_low float64
onehotencoder__x1_medium-high float64
onehotencoder__x1_medium-low float64
onehotencoder__x2_Islam float64
onehotencoder__x2_Non-Islam float64
contraceptive_method_label category
onehotencoder__x3_No float64
onehotencoder__x3_Yes float64
onehotencoder__x4_high float64
onehotencoder__x4_low float64
onehotencoder__x4_medium-high float64
onehotencoder__x4_medium-low float64
onehotencoder__x5_high float64
onehotencoder__x5_low float64
onehotencoder__x5_medium-high float64
onehotencoder__x5_medium-low float64
onehotencoder__x6_Good float64
onehotencoder__x6_Not good float64
age int32
no#_children int32
dtype: object
#Implementing Decision Trees
#Prepairing data for modeling
inputs = transformed_new_df.drop('contraceptive_method_label', axis='columns')
target = transformed_new_df['contraceptive_method_label']
from sklearn.model_selection import train_test_split
X = inputs
y = target
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33)
#Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
#Prediction Test of the Decision Tree Classifier on the Test data set
from sklearn.metrics import accuracy_score
from sklearn import tree
a = accuracy_score(clf.predict(X_train),y_train)
b = accuracy_score(y_test, predictions)
print("Training accuracy of the model was at:",a)
print("Validation accuracy of the model was at:",b)
#print("This", accuracy_score)
#print("That", accuracy_vscore)
Training accuracy of the model was at: 0.9665314401622718
Validation accuracy of the model was at: 0.5061728395061729
One-hot-encoding is a powerful technique to treat categorical data, but it can lead to increased dimensionality, sparsity, and overfitting. It is essential to use it cautiously. Our decision tree model had a low accuracy score of 50.62%, while the training score was high at 96.65%. Overfitting occurs when a model fits too closely to the training data and may become less accurate when encountering new data or predicting future outcomes. Methods such as pruning were not applied in this case. Many factors must be considered when faced with a task, such as predicting contraceptive methods. More variables must be factored in, and the correct context must be considered.