Bank Detect Fraudulent

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

Data Set Description
Exploratory Data Analysis
Split the Data into Training and Test
Modelling Part
Presentation Layer

## Python: 3.6.6 (v3.6.6:4cf1f54eb7, Jun 27 2018, 03:37:03) [MSC v.1900 64 bit (AMD64)]

## Numpy: 1.14.0

## Pandas: 0.20.3

## Matplotlib: 2.1.0

## Seaborn: 0.8.1

## Scipy: 1.0.0

Data Set Description

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise. The entire data set is available at link.

We will demonstrate the application of Machine Learning modelling for classification problem where we want to classify ‘Class’ feature depends on the rest of the independent variables in ‘creditcardfraud.csv’ data set.

Exploratory Data Analysis

# Load the dataset from the csv file using pandas
data = pd.read_csv('D:\Python\AlphaIT\CreditCardFraudDetection\Credit Card Fraud Detection\creditcard.csv')
print("Data Set contains of {0} rows and {1} columns".format(data.shape[0], data.shape[1]))

## Data Set contains of 284807 rows and 31 columns

print(data.head())

##    Time        V1        V2        V3        V4        V5        V6        V7  \
## 0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
## 1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
## 2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
## 3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
## 4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   
## 
##          V8        V9  ...         V21       V22       V23       V24  \
## 0  0.098698  0.363787  ...   -0.018307  0.277838 -0.110474  0.066928   
## 1  0.085102 -0.255425  ...   -0.225775 -0.638672  0.101288 -0.339846   
## 2  0.247676 -1.514654  ...    0.247998  0.771679  0.909412 -0.689281   
## 3  0.377436 -1.387024  ...   -0.108300  0.005274 -0.190321 -1.175575   
## 4 -0.270533  0.817739  ...   -0.009431  0.798278 -0.137458  0.141267   
## 
##         V25       V26       V27       V28  Amount  Class  
## 0  0.128539 -0.189115  0.133558 -0.021053  149.62      0  
## 1  0.167170  0.125895 -0.008983  0.014724    2.69      0  
## 2 -0.327642 -0.139097 -0.055353 -0.059752  378.66      0  
## 3  0.647376 -0.221929  0.062723  0.061458  123.50      0  
## 4 -0.206010  0.502292  0.219422  0.215153   69.99      0  
## 
## [5 rows x 31 columns]

print(data['Class'].describe())

## count    284807.000000
## mean          0.001727
## std           0.041527
## min           0.000000
## 25%           0.000000
## 50%           0.000000
## 75%           0.000000
## max           1.000000
## Name: Class, dtype: float64

Split the data into training and test

train_data = data.sample(frac = 0.1, random_state = 1)
## Determine dependent and independent variables
columns = train_data.columns.tolist()
columns = [c for c in columns if c not in ['Class']]
target = 'Class'
X = train_data[columns]
Y = train_data[target]
cormat = train_data.corr()
sns.heatmap(cormat, vmax = .8, square = True)
plt.title(r'Correlation matrix')
plt.show()

Modelling Part

from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor, KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
# define a random state
state = 1
# define outlier detection tools to be compared
classifiers = {
    "Isolation Forest": IsolationForest(max_samples=len(X),
                                        contamination=fraction,
                                        random_state=state),
    "Local Outlier Factor": LocalOutlierFactor(
        n_neighbors=20,
        contamination=fraction),
    "K-NN": KNeighborsClassifier(n_neighbors=20),
    "Logistic Regression": LogisticRegression(random_state=state, solver='lbfgs',
                                              multi_class='multinomial'),
    "Naive Bayes": GaussianNB()
}

Fit the models:

n_outliers = len(data[data['Class']==1])
for i, (clf_name, clf) in enumerate(classifiers.items()):
    # fit the data and tag outliers
    if clf_name == "Local Outlier Factor":
        y_pred = clf.fit_predict(X)
        scores_pred = clf.negative_outlier_factor_
    elif clf_name == "Isolation Forest":
        clf.fit(X)
        scores_pred = clf.decision_function(X)
        y_pred = clf.predict(X)
    elif clf_name == 'K-NN':
        clf.fit(X, Y)
        y_pred = clf.predict(X)
    elif clf_name == 'Logistic Regression':
        clf.fit(X, Y)
        y_pred = clf.predict(X)
    elif clf_name == 'Naive Bayes':
        clf.fit(X, Y)
        y_pred = clf.predict(X)
        
    
    # Reshape the prediction values to 0 for valid, 1 for fraud.
    if clf_name == "Local Outlier Factor" or clf_name == "Isolation Forest":
        y_pred[y_pred == 1] = 0
        y_pred[y_pred == -1] = 1
    
    n_errors = (y_pred != Y).sum()
    
    # Run classification metrics
    # print('{}: {}'.format(clf_name, n_errors))
    # print(accuracy_score(Y, y_pred))
    # print(classification_report(Y, y_pred))

Isolation Forest: Incorrectly classified: 71 Accuracy Score: 0.99750711000316

	Precision	Recall	f1-Score	Support
0	1.00	1.00	1.00	28432
1	0.28	0.29	0.28	49

Local Outlier Factor: Incorrectly classified: 97 Accuracy Score: 0.9965942207085425

	Precision	Recall	f1-Score	Support
0	1.00	1.00	1.00	28432
1	0.02	0.02	0.02	49

K-NN: Incorrectly classified: 49 Accuracy Score: 0.9982795547909132

	Precision	Recall	f1-Score	Support
0	1.00	1.00	1.00	28432
1	0.00	0.00	0.00	49

Logistic Regression: Incorrectly classified: 36 Accuracy Score: 0.998735999438222

	Precision	Recall	f1-Score	Support
0	1.00	1.00	1.00	28432
1	0.68	0.51	0.58	49

Naive Bayes: Incorrectly classified: 241 Accuracy Score: 0.9915382184614304

	Precision	Recall	f1-Score	Support
0	1.00	0.99	1.00	28432
1	0.12	0.63	0.20	49

We could of course have the best score with KNN - 1, 1-NN respectively. But this is misleading - see the note below.

error = []
# Calculating error for K values between 1 and 40
for i in range(1, 40):  
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X, Y)
    pred_i = knn.predict(X)
    error.append(np.mean(pred_i != Y))
plt.figure(figsize=(12, 6))  
plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',  
         markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')  
plt.xlabel('K Value')  
plt.ylabel('Mean Error')
plt.show()

Note to 1NN: The variance is high in this case, because optimizing on only 1-nearest point means that the probability that we model the noise in our data is really high. For 1-NN this point depends only of 1 single other point. E.g. we want to split our samples into two groups (classification) - red and blue. If we train our model for a certain point p for which the nearest 4 neighbors would be red, blue, blue, blue (ascending by distance to p). Then a 4-NN would classify our point to blue (3 times blue and 1 time red), but our 1-NN model classifies it to red, because red is the nearest point. This means, that our model is really close to our training data and therefore the bias is low. If we compute the RSS between our model and our training data it is close to 0. In contrast to this the variance in our model is high, because our model is extremely sensitive and wiggly.

Presentation Layer

markdown
html
javascript