top of page

Medical Appointments No Show EDA in SAS

Medical Appointment Status CMS Analysis in SAS

In this project, I tried to analyze why some patients did not show up for his/her medical appointment and whether there are reasons for that using the data at hand. I tried to find some correlation between different attributes the data has and if they have any effect on whether the patient shows up or not. The dataset I used contains 110527 medical appointments and its 14 associated variables ( PatientId, AppointmentID, Gender, ScheduledDay, AppointmentDay, Age, Neighbourhood, Scholarship, Hypertension, Diabetes, Alcoholism, Handcap, SMS_received, No-show ).


Data Dictionary


01 — PatientId — Identification of a patient

02 — AppointmentID — Identification of each appointment

03 — Gender — Male or Female . Female is the greater proportion, woman takes way more care of they health in comparison to man.

04 — Appointmentday — The day of the actual appointment, when they have to visit the doctor.

05 — Sheduledday — The day someone called or registered the appointment, this is before appointment of course.

06 — Age — How old is the patient.

07 — Neighbourhood — Where the appointment takes place.

08 — Scholarship — True of False (1, 0)

09 — Hipertension — True or False (1, 0)

10 — Diabetes — True or False (1, 0)

11 — Alcoholism — True or False (1, 0)

12 — Handcap — True or False (1, 0)

13 — SMS_received — 1 or more messages sent to the patient.

14 — No-show — True or False (1, 0)


Questions I tried to answer -

  • What is the percentage of no-shows?

  • What are the factors which may affect if a patient will show up for their scheduled appointment or not?

  • Is there any effect of Gender on if a patient will show up for the booked appointment or not?

  • Are patients with scholarship more likely to miss their appointment?

  • Are patients who don’t receive sms more likely to miss their appointment?

  • Does the time difference between the scheduling and appointment have any impact on whether a patient will show up for their appointment or not?

  • Does a patient's age affect whether a patient will show up or not?

  • What are five neighborhoods with the highest number of patients missing their appointments from all the neighborhoods.?

  • On which weekdays people don’t show up most often?



A — SETUP


  1. After downloading the file in .CSV format, I uploaded the datafile to the SAS Studio folder created called 'Medical Appointment Data’. Then I used the PROC IMPORT procedure to read the .csv file into the SAS environment.


proc import datafile='Medical_Apt..csv' dbms=csv

out=EDA_PROJECT;

run;





B — DATA WRANGING


2. Checking the data to get the details of the variables in the dataset.


I could have used the PROC PRINT procedure to check the names of the variables in the data but it will show only the names of the columns and will print the whole dataset by default unless I use OBS= option to limit the number of observations printed. The PROC PRINT will not be giving out the type of data for each variable and the length of each variable.


So, Instead of PROC PRINT, I used the PROC CONTENTS to get the detailed information about data including the number of observations = 110527, Variables = 14 along with length and types of variables in the data set. The VARNUM option shows the variables in the order of their logical position in the data set.


Proc contents data=eda_project varnum;

run;




3. Check descriptive statistics of the numerical variables in the data.


I used PROC MEANS procedure to show the mean values for all the numeric variables in the data by default.

*A zero value for the following variables means -

Scholarship = 0 means no scholarship

Hipertension = 0 No Hipertension (as spelt in the original dataset)

Diabetes = 0 No Diabetes

Alcoholism = 0 No Alcoholism

Handicap = 0 No Handicap

SMS_received means = 0 - No sms received 1 — sms received


proc means data=eda_project;

run;



4. Check missing values in the data set


proc means data=eda_project nmiss;

run;




There are no Missing values in the data as evident from the PROC MEANS output.



NOTES ON DATA EXPLORATION

It is evident from the info we got from the initial analysis that there are some columns that need to have their type corrected like dates. Another finding is that our data doesn’t have any missing values. Most of the duplicate values are for the same person for a future appointment. Also, the column no-show can be a bit confusing with answers ‘Yes’ and ‘No’ so I inverted the values in data cleaning step to make it more intuitive (renamed it to ‘SHOW’ instead of ‘No-Show’) and also turned it into integer instead of yes or no.



C — DATA CLEANING

  • Convert No-Show to Show with integer values

  • Getting the date only part from datetime values

  • Rename hipertension column to Hypertension

  • Create a new column day_diff for difference between scheduling and appointment days

  • Drop patientid and appointmentid columns


6. Changing the name of variable ‘No-show’ to ‘Show’ and changing the label from No-Show to Show.


data apt_data1 (rename=('No-Show'n = Show));

set eda_project;

label 'No-Show'n='Show';

run;

Inverting No-Show → Show values with integer values.

data apt_data2;

set apt_data1;

if show = 'No' then Show = 1;

else if show = 'Yes' then Show = 0;

run;


/*to check the names of the variables*/


proc print data=apt_data1 (obs=5);

run;




7. Get the date only part out of datetime values of scheduledday and appointmentday in the data.


  • Rename hipertension column to Hypertension

  • Create a new column for difference between scheduling and appointment days

  • Drop patientid and appointmentid columns


data EDA (drop=patientid appointmentid);

set apt_data2 (rename=(Hipertension=Hypertension));

drop scheduledday appointmentday;

Schld_date = datepart(scheduledday);

Apt_date = datepart(Appointmentday);

format schld_date apt_date date9.;

day_diff = (apt_date - schld_date);

run;




8. Checking the data one last time


Proc contents data=eda;

Run;




NOTES ON DATA CLEANING

Now that we have our data cleaned and with the proper type for every column. We got the datepart out of datetime values in the data, renamed the hipertension variable to Hypertension and also created a new day_diff column. Now, we can analyze the data at hand and try to find the correlation between different variables and the show column.


D — EXPLORATORY DATA ANALYSIS


9. I used PROC FREQ procedure to get the percentage of Show and create a bar graph.


ods graphics on;

proc freq data=work.eda;

tables show/ nocum plots=freqplot(type=bar scale=percent);

run;

ods graphics off;




10. Is gender related to whether a patient will be there at the scheduled appointment or not.?


ods graphics on;

proc freq data=eda;

tables show*gender/ plots=freqplot(twoway=stacked orient=Horizontal);

run;

ods graphics off;





Females who missed their appointment — 14594/110527 = 0.132 * 100 = 13.2

Males who missed their appointment — 7725/110527 = 0.0699 * 100 = 6.99


Findings — As seen from the results, percentage of Females missing their appointment is double the number of the males missing their appointment. Based on findings we can say that females are more likely to miss their appointment.


11. Are patients with scholarships more likely to miss their appointment?


ods graphics on;

proc freq data=eda;

tables show*scholarship/ plots=freqplot(twoway=stacked orient=vertical);

run;

ods graphics off;





Patients without scholarship who show up for their appointment — 79925/99666 = 0.8019 * 100 = 80.19

Patients with scholarship who show up for their appointment — 8283/10861 = 0.7626 * 100 = 76.26


Finding — It seems that patients with scholarships are actually more likely to miss their appointment.


12. Are patients with hypertension more likely to miss their appointment?


ods graphics on;

proc freq data=eda;

Tables show*hypertension/ plots=freqplot(twoway=stacked orient=Horizontal);

run;

ods graphics off;




The patients who do not have hypertension and showed up for their appointment — 70179/88208=0.7956

Patients with hypertension who showed up for their appointments — 18029/21801=0.827


Finding — It seems that patients with hypertension are actually more likely to show up for their appointment




13. Are patients who don't receive SMS more likely to miss their appointment?


ods graphics on;

proc freq data=eda;

tables show*sms_received/ plots=freqplot(twoway=stacked orient=Horizontal);

run;

ods graphics off;




Patients who did not receive sms and showed up for their appointment — 62510/75045=0.833

Pateints who received sms and showed up for their appointment — 25698/35482 = 0.7243


Finding — A strange finding here suggests that patients who received an SMS are more likely to miss their appointment.




14. Is the time difference between the scheduling and appointment related to whether a patient will show up for an appointment or not?


data day_cat;

set eda;

length apt_name $ 16;

if day_diff <= 0 then day_diff2 = 'Same Day';

else if day_diff <= 4 then day_diff2 = 'Few Days';

else if day_diff > 4 and day_diff <= 15 then day_diff2 = 'More than 4';

else day_diff2 = 'More than 15';

run;


proc freq data=day_cat;

tables day_diff2/nocum;

run;


ods graphics on;

proc freq data=day_cat;

tables show*day_diff2 /plots=freqplot(twoway=grouphorizontal orient=vertical);

run;

ods graphics off;




No Show Percentages as per difference in dates of scheduling an appointment and actual appointment day

Few day — 4601/22319 = 0.2061 *100 = 20.61

More than 15 days — 8579/22319= 0.3844 *100 = 38.44

More than 4 days = 7333/22319 = 0.3286 *100 = 32.86

Same day = 1806/22319 = 0.0809 *100 = 8.09


Finding — It appears that the longer the period between the scheduling date and appointment date the more likely is that the patient won’t show up for their appointment.


15. Is there an effect of Age if the patient will turn up for their appointment or not.?


proc univariate data=eda;

   class show;

   var age ;

   histogram age/ overlay;

run;




Finding — There is no clear correlation between age and whether the patient shows up for an appointment or not but it is evident from the histogram above that younger patients between the age of 6 years and 42 years are likely to miss their appointments.


16. Which are the neighborhood where patients more than 500 are missing their appointments ?


I subset the data with No-Show values only and then used Bar chart to show the frequencies of neighborhoods with the number of no shows in each neighborhood with a reference line at 500 to show the neighborhoods with number of patients more than 500 who did not show up for their appointments. It gives us 22319 observations with No-Show.


data No_Show;

set eda;

if show = 0;

run;

ods graphics / reset width=14in height=6in imagemap;


proc sgplot data=WORK.NO_SHOW;

vbar Neighbourhood / group=Show groupdisplay=cluster

 fillattrs=(transparency=0.25) datalabelfitpolicy=none;

yaxis grid;

refline 508 / axis=y lineattrs=(thickness=2 color=Red) label

 labelattrs=(color=Red);

keylegend / location=inside;

run;


ods graphics / reset;




As evident from the graph above — The following neighborhoods are the ones where more than 500 patients missed their appointment. Jardim Camburi tops the chart with 1465 missed appointments.


  1. Jardim Camburi — 1465

  2. Maria Ortiz — 1219

  3. ITARARE — 923

  4. Resistencia — 906

  5. Centro — 703


17. On which weekdays people don’t show up most often?


/* After Changing the name of variable 'No-show' to 'Show' and changing the

label from No-Show to Show */

data apt_data1 (rename=('No-Show'n = Show));

set eda_project;

label 'No-Show'n='Show';

run;


/* Invered No-Show - > Show values with integer values  */


data apt_data2;

set apt_data1;

if show = 'No' then Show = 1;

else if show = 'Yes' then Show = 0;

run;


/*7. Getting the date only part out of datetime values of scheduledday and

appointmentday in data  */

/* Create a new column for difference between scheduling and appointment */


data EDA (drop=patientid appointmentid);

set apt_data2 (rename=(Hipertension=Hypertension));

drop scheduledday appointmentday;

Schld_date = datepart(scheduledday);

Apt_date = datepart(Appointmentday);

format schld_date apt_date date9.;

day_diff = (apt_date - schld_date);

run;


/*7. After changing the dataset by steps mentioned above, I changed the date

to weekday to checon which dates the patients missed most of their

appointments,

I also changed the numeric weekdays to weekday names using if then statement*/


data weekdays;

set eda;

apt_day =weekday(apt_date);

if apt_day = 1 then week_day = 'Sun';

else if apt_day = 2 then week_day = 'Mon';

else if apt_day = 3 then week_day = 'Tues';

else if apt_day = 4 then week_day = 'Wednes';

else if apt_day = 5 then week_day = 'Thurs';

else if apt_day = 6 then week_day = 'Fri';

else week_day = 'Satr';

run;


*/Bar char*/


title "Weekdays On Which Most of the Appointments Were Missed ";

proc sgplot data = weekdays;

   vbar week_day/ group=show groupdisplay=cluster stat=freq  ;

run;




We can see that most of the appointments were missed on Tuesdays and Wednesdays. Surprisingly, Most appointments for which the patient showed up also fall on Tuesday and Wednesdays.

** SAS shows days of week as digits where 1 = Sunday, 2 = Mondays and so on where 7 = Saturday.



CONCLUSIONS


After analyzing the dataset here are some findings:

  1. Percentage of patients who didn’t show up for their appointment is 20.19%.

  2. The percentage of females missing their appointment is nearly two times the number of males. So females are more likely to miss their appointment.

  3. It appears that the longer the period between the scheduling and appointment the more likely the patient won’t show up.

  4. It seems that patients with scholarships are actually more likely to miss their appointment.

  5. A strange finding here suggests that patients who received an SMS are more likely to miss their appointment.

  6. There is no clear relation between the age and whether the patients show up or not but patients between 6 and 42 years are more likely to miss their appointments.

  7. Tuesday and Wednesdays are the days when most of the appointments were missed.


Analysis Shortcoming & Data Limitations

  1. The data doesn’t state the exact hour of the appointment which would have been very useful to try to find out which hours have the most missing appointments and which doesn’t. It could also be very useful to know the difference between scheduling and the appointment since many of the scheduling are on the same day.

  2. The data doesn’t state if any day is a vacation or not which can indicate if people tend to miss their appointments more on working days.

  3. The age column had a negative value but according to the data creator, it means a baby not born yet (a pregnant woman).

  4. When calculating the day difference between the scheduling and appointment days we had some negative value which makes no sense and might mean that the records of questions have wrong data.





Power in Numbers

Project Gallery

bottom of page