Fraud Detection in Healthcare using Machine Learning

Saarathi Anbuazhagan
14 min readDec 25, 2020

--

This article describes about different machine learning models used in detecting fraud claim in healthcare insurance and gives important features that help in detecting those claim.

Table of contents :

  1. Business Problem
  2. Source of Data
  3. Mapping to Machine Learning Problem
  4. Performance Metrics
  5. Exploratory Data Analysis
  6. Feature Engineering
  7. Machine Learning Models
  8. Feature Importance
  9. Comparison of Different Models
  10. Future Work
  11. Deployment of model using Heroku
  12. GitHub repository and LinkedIn
  13. References

1. Business Problem

Health insurance companies provide coverage for medical expenses to the policy holder depending on the health insurance plan chosen by the patient which the patient is eligible to claim. These amounts may vary depending upon diagnosis and treatment undergone by the patient and depending upon doctor and hospitals. Many healthcare providers settle huge amounts for patients. But some insured individuals or the provider of health services attempt to make fake claims by giving false claim details such as showing fake bills, submitting same bills repeatedly, undergoing many treatments that were actually not necessary for a particular disease diagnosed, and so on.. which is considered a medical crime.

So, Insurance companies are forced to provide benefits for fake claims unknowingly and face many problems in providing benefits to real claims and also those companies are heavily impacted due these bad practices.

In 2016, the US spent $3.4 trillion on health care expenditures. In this fraud, waste and abuse would range from $102 billion to $340 billion. This remained on the rise, up from 6 percent in 2014 to 10 percent in 2017. Experts are projecting that health care expenditures will soar as high as $5.5 trillion by 2025.

This has become a serious issue and insurance providers have started finding ways to overcome this global problem. The goal of this case study is to accurately predict claims into fraudulent or real claims which will save a very huge amount of money from frauds and help those in real need.

2. Source of Data

The dataset for this problem statement is taken from Kaggle. Follow the link to download the dataset.

3. Mapping to Machine Learning Problem

This is a binary classification problem. We need to classify data into fraudulent claim or real claim. Also find important features that help in predicting fraudulent claims.

Business objectives and constraints

1. Rate of misclassification should be very less

2. No low latency requirement

4.Performance Metrics

1. Precision , Recall and F1 Score

Precision : This talks about how precise/accurate your model is out of those predicted positive, how many of them are actual positive.

Recall : This talks about out of Actual Positive how many is predicted as positive. This cane be used to select our best model when there is a high cost associated with False Negative.

For instance, in our case, if a fraudulent claim (True Positive) is predicted as non-fraudulent (False Negative), the consequence can be very bad. So we should consider having high recall (getting less number of FN) in our case.

For this, we will be selecting best threshold which can be found using
sklearn.metrics.roc_curv
e to classify the point into positive or negative and build F1 score which is a function of precision and recall using this threshold. The formula for F1 score is following,

2. Macro F1 score

As there is slight imbalance of class ( 64:36 ) in the dataset, using macro will be useful. It is calculated by giving equal importance to both class.

‘Macro’ calculates the F1 separated by class but not using weights for the aggregation: F1class1+F1class2. This results in a bigger penalization when your model does not perform well with the minority classes.

Precision for both classes is calculated and averaged and recall for both classes is calculated and averaged and finally, macro f1 score is calculated by averaging these precision and recall.

5. Exploratory Data Analysis

1. Data set analysis : ​

Outpatient data:​

​❖ Bene ID , Claim ID , Provider : unique ID of beneficiary, unique ID for each claim, healthcare provider unique ID
❖ Claim Start Date and Claim End Date: date the claim was submitted and date claim was reimbursed
❖ Attending physician, Operating physician, Other physician : doctor, general doctor ID who attended the patient
❖ Claim Diagnosis Code — 1–10, Claim Admit Diagnosis Code: each code specify particular symptoms, disease, etc., . ( 4–10 most NA value )
❖ Claim Procedure Code — 1–6: each code specify procedure conducted to patient ( 5,6 NA value )
❖ Insurance Claim Amt Reimbursed, Deductible Amt Paid : Money settled to beneficiary, money patient should pay to provider before applying for claim

Inpatient data:

❖ Bene ID, Claim ID, ​Claim Start Date and Claim End Date, Provider, Insurance Claim Amt Reimbursed, Attending physician, Operating physician,
Other physician, Claim Diagnosis Code , Claim Admit Diagnosis Code, Claim Procedure Code, Deductible Amt Paid
❖ Admission date and Discharge date: patient admitted and discharge date
❖ Diagnosis Group Code: ​ ​classifies hospital cases into certain groups

Beneficiary data:

❖ DOB, DOD : Date of birth, Date of death of the beneficiaries
❖ Gender, State, Country, Race : Gender, State, Country, Race of the beneficiaries
❖ No of Months Part A Coverage, No of Months Part B Coverage: number of month coverage for treatment
❖RenalDiseaseIndicator,ChronicCond_Alzheimer,ChronicCond_Heartfailure,ChronicCond_KidneyDisease,ChronicCond_Cancer,ChronicCond_ObstrPulmonary,ChronicCond_IschemicHeart,ChronicCond_Osteoporasis,ChronicCond_rheumatoidarthritis,ChronicCond_stroke,ChronicCond_Depression,ChronicCond_Diabetes:
whether patient is diagnosed previously with these kind of ​disease or not
❖ IP Annual Reimbursement Amt, OP Annual Reimbursement Amt, IP Annual Deductible Amt, OP Annual Deductible Amt: amount claimed by beneficiary when admitted as outpatient and after admitted in inpatient and money patient should pay to provider before applying for claim

Target:

❖ Provider: healthcare provider unique ID
❖ Potential Fraud: whether claim is fraud or not

2. Univariate analysis — Analyzing Attending Physician feature with bar plot for outpatient/inpatient

Attending Physician feature — outpatient/inpatient

Conclusion:

  1. The most attended physician for outpatient is PHY330576 attending 2534 cases and for inpatient is PHY422134 attending 386 cases.
  2. If a physician has attended more number of case, then the physician involved in making fraud claim is more
  3. The physician who attended only very few cases make non fraud claim
  4. The top 5 attended physician for outpatient are PHY330576,PHY350277,PHY412132,PHY423534,PHY314027 and the top 5 attended physician for inpatient are PHY422134,PHY341560,PHY315112,PHY411541,PHY362864

3. Univariate analysis — Analyzing Provider feature with bar plot for outpatient/inpatient

Provider feature — outpatient/inpatient

Conclusion:

  1. The top provider for outpatient is PRV51459 with 8240 cases and for inpatient is PRV52019 with 516 cases
  2. top 93 provider of outpatient and top 83 provider of inpatient has made fraud claim , indicating top provider have more probability for making fraud claim
  3. here provider those made non fraud claim give coverage for only few cases
  4. The top 5 attended provider for outpatient are PRV51459,PRV53797,PRV51574,PRV53918,PRV54895. The top 5 attended provider for inpatient are PRV52019,PRV55462,PRV54367,PRV53706,PRV55209

4. Bivariate analysis — pair plot on Attending Physician and Provider

Pair Plot on Attending Physician and Provider

Conclusion:

  1. Top 5 Physician and Provider that occur together has class label: ((‘PHY330576’, ‘PRV53918’), ‘Yes’), ((‘PHY350277’, ‘PRV51567’), ‘Yes’) ,((‘PHY412132’, ‘PRV53797’), ‘Yes’), ((‘PHY423534’, ‘PRV51459’), ‘Yes’) ,((‘PHY314027’, ‘PRV51459’), ‘Yes’)
  2. provider PRV51459 even occur twice in top 5, so that provider may be highly making fraud claim
  3. This show, if top attended physician and top provider occur together, they are supposed to make fraud claim. It is good to check the claim in more depth

5. Univariate analysis — Analyzing Bene ID feature with bar plot for outpatient/inpatient

Bene ID feature — outpatient/inpatient

Conclusion:

  1. The most frequent outpatient is BENE42721 with 29 claim made, for inpatient is BENE134170 with 8 claim made
  2. Most number of claim made by outpatient is between 1 to 9 , and for inpatient is only 1 or 2
  3. Outpatient who already made claim between 2 to 9 time tend to make more fraud claim
  4. The top 5 attended outpatient are BENE42721,BENE118316,BENE59303,BENE63544,BENE63504 , and the top 5 attended inpatient are BENE134170,BENE121796,BENE117116,BENE119457,BENE62091

6. Bivariate analysis — pair plot on Attending Physician and Bene ID

Pair Plot on Attending Physician and Bene ID

Conclusion:

  1. number of unique physician are: 82063 and number of unique bene are : 138556
  2. Top 5 Physician and Bene ID that occur together has class label: ((‘PHY339042’, ‘BENE66093’), ‘Yes’), ((‘PHY313322’, ‘BENE118316’), ‘Yes’), ((‘PHY385072’, ‘BENE41087’), ‘Yes’), ((‘PHY385072’, ‘BENE26003’), ‘Yes’), ((‘PHY344367’, ‘BENE155227’), ‘Yes’)
  3. physician PHY385072 even occur twice in top 5, so that physician may be highly making fraud claim
  4. This show, if top attended physician and top patient making claim occur together, they are supposed to make fraud claim

7. Bivariate analysis — Analyzing Attending Physician or State feature with pair plot

Claim with either Attending Physician or State feature using pair plot

Conclusion:

  1. percentage of class including both top 10 attended physician and top 10 state : {‘Yes’: 6578}
  2. percentage of class either from top 10 attended physician or top 10 state : {‘Yes’: 120186, ‘No’: 167821}
  3. this show when either top physician or state(containing more cases) occur, they tend to make fraud claim equal to non fraud claim.
  4. In this around 40% claim are made fraud.

8. Univariate analysis — Analyzing Diagnosis Code feature with bar plot

Top Diagnosis Code with bar plot

Conclusion:

  1. top 10 diagnosis code and their number of occur are

{ 4019 : 77056 , 25000 : 37356 , 2724 : 35763 , V5869 : 24904 , 4011 :23773 , 42731 : 20138 , V5861 : 20001 , 2720 : 18268 , 2449 :17600 , 4280 : 15507 }

2. Diagnosis code ‘4019’ has 77056 highest number of occur
3. 4.4 % of patient undergone diagnosis code ‘4019’
4. these top 10 code are frequent so these code may be important in making fraud claim

9. Race feature

Race vs Potential Fraud

Conclusion:

  1. plot shows people belonging to race with category 1 occur more than other
  2. This give that race 1 has made more fraud claim than other
  3. claim from other category occur only very few and vary with same distribution on fraud or non fraud

10. Bivariate analysis — pair plot on Insurance Claim Amt Reimbursed and IP Annual Reimbursement Amt

Pair Plot on Insurance Claim Amt Reimbursed and IP Annual Reimbursement Amt

Conclusion:

  1. Insurance Claim Amt Reimbursed and IP Annual Reimbursement Amt overlap more, but show that some fraud claim are made with higher amount than non fraud claim
  2. this show these two feature can help in classifying fraud claim that are made with higher amount

11. number of Days Admitted vs Potential Fraud using count plot

Bivariate analysis — number of Days Admitted and Attending Physician using pair plot

number of Days Admitted vs Potential Fraud using count plot — — — — — — number of Days Admitted and Attending Physician using pair plot

Conclusion:

  1. number of day patient admitted can be got from, discharge date — admitted
  2. from counterplot, patient who are admitted between 2 to 7 tend to show more fraud claim
  3. from pair plot, top attended physician tend to make more fraud claim in all case of number of day admitted
  4. from all analysis it is shown that top attended physician tend to make fraud claim

Before going into feature engineering, let us merge outpatient file, inpatient file, beneficiary file and target files.

Data Preprocessing

Let us do some data preprocessing and make our data clean and understandable.

1. Checking null value

After checking for null values, “ Procedure Code5, Procedure Code6 ’’ have all values as NA. Also “ Number of Months Part A Coverage, Number of Months Part B Coverage ’’ has all values as 12, which may be due to false collection of data. These columns can be dropped because they show no information.

Some other columns are also found to have NA, we will be handling this when preparing our model in two different ways. First filling missing values with 0 and second filling with mode.

2. Checking for duplicate row

Total number of claims made and number of unique claims is same, showing there is no duplicate rows in data.

print('num of claim :',final_data.shape[0])
print('num of unique claim:',len(final_data['ClaimID'].value_counts()))
Output: num of claim : 558211
Output: num of unique claim : 558211

3. Data handling

Now we will change some feature values into numeric.

  1. Potential Fraud has yes and no — replacing with 1 and 0
  2. Renal Disease Indicator has y and 0 — replacing with 1 and 0
  3. Chronic Cond has 2 and 1 — replacing with 0 and 1

6. Feature Engineering

“Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.”

— Prof. Andrew Ng.

Great thing about feature engineering is it can boost your model performance. Here are features that were build and some are found to improve model performance drastically.

Frequency encoding

  1. Bene Count — counting number of times patient made claim

2. Provider Count — counting number of times provider covered claim

3. Attending Physician Count — counting number of times physician attended all patient

4. number of days patient admitted — [ Discharge Dt — Admission Dt ]

5. number of days for claim took to reimbursed — [ Claim End Dt — Claim Start Dt ]

6. IP OP total amount — calculating total IP, OP amount reimburse

[ (IP Annual Reimbursement Amt + OP Annual Reimbursement Amt)-(IP Annual Deductible Amt + OP Annual Deductible Amt) ]

7. number of chronic — calculating total disease patient was diagnosed before ( adding all chronic disease )

8. number of diagnosis procedure — calculating number of diagnosis procedure undergonw by patient

9. number of physician — calculating number of physician treating patient [Attending Physician + Operating Physician + Other Physician ]

10. [ ‘DiagnosisCode_1_count’, ‘DiagnosisCode_2_count’, ‘DiagnosisCode_3_count’, ‘DiagnosisCode_4_count’, ‘DiagnosisCode_5_count’, ‘DiagnosisCode_6_count’, ‘DiagnosisCode_7_count’, ‘DiagnosisCode_8_count’, ‘DiagnosisCode_9_count’, ‘DiagnosisCode_10_count’, ‘Claim Admit Diagnosis Code count’, ‘Diagnosis Group Code count’ ] — counting number of times code occurred, replacing with count

one hot encoding

  1. diagnosis, procedure code — each diagnosis column has more than 1000 category, so took top ten code in diagnosis and top five code in procedure and performed one hot encoding on that code only
    ‘diagnosis_4019’, ‘diagnosis_25000’, ‘diagnosis_2724’, ‘diagnosis_V5869’, ‘diagnosis_4011’, ‘diagnosis_42731’, ‘diagnosis_V5861’, ‘diagnosis_2720’, ‘diagnosis_2449’, ‘diagnosis_4280’, ‘procedure_4019.0’, ‘procedure_9904.0’, ‘procedure_2724.0’, ‘procedure_8154.0’, ‘procedure_66.0’

After complete feature engineering, we will create two different dataset. One with NA filled with 0, other with NA filled with mode.

Finally, columns with strings values whose information are extracted by feature engineering are dropped.

Final dataset with 59 columns:

## dividing data and class label
Y = final_data['PotentialFraud']
X = final_data.drop('PotentialFraud',axis=1)

Min Max Normalization:

For every feature, the minimum value of that feature gets transformed into a 0, the maximum value gets transformed into a 1, and every other value gets transformed into a decimal between 0 and 1. Feature scaling is done separately on train and test to avoid data leak

7. Machine Learning Model

We will be training four different classification models such as Logistic Regression, Decision Tree, Random Forest, XG_Boost.

To find best parameters, hyperparameter tuning is done using random search CV.

High recall is needed, so number of point which is actually fraud being predicted as non fraud decreases. To achieve this best threshold is found using roc_curve and f1 score is build on this.

Experimenting with different approaches and finding best model.

  1. Training classification model with null values filled with 0.
  2. Training classification model with null values filled with mode.
  3. Training using custom ensemble model.

Train test split:

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, stratify = Y)

1. Training classification model with null values filled with 0

We have null values with column Diagnosis code, Procedure code, group code, Attending Physician Count , number of days admitted , number of days for claim , number of physician , Deductible Amt Paid. Replacing null value with 0

Model Performance:

Confusion Matrix from best model XG_Boost

Here we can see Precision with 97.96 percentage and Recall with 97.28 percentage on test data.

2. Training classification model with null values filled with mode

We have null values with column Diagnosis code, Procedure code, group code, Attending Physician Count , number of days admitted , number of days for claim , number of physician , Deductible Amt Paid. Replacing null value with mode and filled with mean for [ number of days admitted , number of days for claim ] because mode is 0

Model Performance:

3. Training using custom ensemble model

Model with null filled with 0 seem to have good performance, so we will be using that dataset for our ensemble model.

Procedure Followed :
1. split data into 80:20 train, test
2. split train into D1, D2 50:50
3. do sampling with replacement on D1 and take k samples
4. train k model on this k sample ( DT as base model )
5. use D2 and pass it to this k model and get k prediction
6. here we have actual y of D2 (from step 2) and target of D2 (from step 5)
7. train meta model with data of step 6
8. for evaluation take test data (from step 1)
9. take test and predict on k model (step 4) and get k prediction
10. here we have actual y of test (from step 1) and target of test (from step 9)
11. as performance evaluation, train on (pred_for_D2,D2Y) and fit on (pred_for_test,test_Y)
12. number of base model tuned with 30,45,40. In 40 seem to have good accuracy

Different meta model trained:
1. LR
2. DT
3. RF
4. XG_Boost

Model Performance:

8. Feature Importance :

Feature that help best predicting fraud in all model are — provider count, attending physician count, county, state, diagnosis code 1 count.

Important features from best model XG_Boost is given below :

+-----------------------------+---------+
| feature | weight |
+-----------------------------+---------+
| ProviderCount | 0.11823 |
| County | 0.07264 |
| OPAnnualReimbursementAmt | 0.06902 |
| AttendingPhysicianCount | 0.06714 |
| DiagnosisCode_1_count | 0.06436 |
| State | 0.0629 |
| OPAnnualDeductibleAmt | 0.06165 |
| ip_op_total_amount | 0.06164 |
| InscClaimAmtReimbursed | 0.04316 |
| DiagnosisCode_2_count | 0.04158 |
| DiagnosisCode_3_count | 0.03246 |
| BeneCount | 0.02967 |
| num_of_chronic | 0.02331 |
| IPAnnualReimbursementAmt | 0.02187 |
| ClmAdmitDiagnosisCode_count | 0.02084 |
| DiagnosisCode_4_count | 0.02018 |
| num_of_diag_proc | 0.0171 |
| DiagnosisCode_5_count | 0.01254 |
| DiagnosisCode_6_count | 0.0108 |
| num_of_phy | 0.01079 |
+-----------------------------+---------+

9. Comparison of Different Models

Model with null filled with 0

XG_Boost with 0.981 macro F1 score, and DT with 0.975 have better performance than other models

Model with null filled with mode

Performance doesn’t seem to have any big change in any model

Custom Ensemble Model with null filled with 0

  1. LR, DT, RF, XG_Boost all metamodel perform good and same.

“ Overall performance of XG_Boost is better than all other model “

10. Deployment of model using Heroku

Model trained with XG_Boost with null values filled with 0 and best parameters tuned is deployed using Heroku

Live working of this deployment can be found here:

Deployment using Heroku

11. Future work

More complex deep learning model like Multi-Layer Perceptron ( MLP ) can be trained in future.

12. GitHub repository and LinkedIn

https://github.com/Paarthasaarathi

https://www.linkedin.com/in/paarthasaarathi-a-650840172/

13. References

appliedaicourse.com

https://www.researchgate.net/profile/Jianjun_Shi/publication/23290716_A_survey_on_statistical_methods_for_health_care_fraud_detection/links/553e4b5b0cf210c0bda937cf/A-survey-on-statistical-methods-for-health-care-fraud-detection.pdf

http://scholar.google.co.in/scholar_url?url=https://www.researchgate.net/profile/Hoi_Ht/publication/343888670_Proceedings_of_2020_International_Conference_on_Management_of_e-Commerce_and_e-Government_ICMECG_2020_Advertising_Vietnam%2527s_Tourism_Products_in_the_Technology_Age/links/5f466a18299bf13c5033cb34/Proceedings-of-2020-International-Conference-on-Management-of-e-Commerce-and-e-Government-ICMECG-2020-Advertising-Vietnams-Tourism-Products-in-the-Technology-Age.pdf%23page%3D103&hl=en&sa=X&ei=Yzi6X9efA6XEywTdhrKQDw&scisig=AAGBfm0dvBmfO0EkxW9bB5RpEFXJ0x0V_g&nossl=1&oi=scholarr

https://towardsdatascience.com/2-features-for-healthcare-fraud-waste-and-abuse-7c262ac59859

--

--

No responses yet