It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.
## Python: 3.6.6 (v3.6.6:4cf1f54eb7, Jun 27 2018, 03:37:03) [MSC v.1900 64 bit (AMD64)]
## Numpy: 1.14.0
## Pandas: 0.20.3
## Matplotlib: 2.1.0
## Seaborn: 0.8.1
## Scipy: 1.0.0
The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise. The entire data set is available at link.
We will demonstrate the application of Machine Learning modelling for classification problem where we want to classify ‘Class’ feature depends on the rest of the independent variables in ‘creditcardfraud.csv’ data set.
# Load the dataset from the csv file using pandas
data = pd.read_csv('D:\Python\AlphaIT\CreditCardFraudDetection\Credit Card Fraud Detection\creditcard.csv')
print("Data Set contains of {0} rows and {1} columns".format(data.shape[0], data.shape[1]))
## Data Set contains of 284807 rows and 31 columns
print(data.head())
## Time V1 V2 V3 V4 V5 V6 V7 \
## 0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
## 1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
## 2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461
## 3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609
## 4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941
##
## V8 V9 ... V21 V22 V23 V24 \
## 0 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928
## 1 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846
## 2 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281
## 3 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575
## 4 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267
##
## V25 V26 V27 V28 Amount Class
## 0 0.128539 -0.189115 0.133558 -0.021053 149.62 0
## 1 0.167170 0.125895 -0.008983 0.014724 2.69 0
## 2 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
## 3 0.647376 -0.221929 0.062723 0.061458 123.50 0
## 4 -0.206010 0.502292 0.219422 0.215153 69.99 0
##
## [5 rows x 31 columns]
print(data['Class'].describe())
## count 284807.000000
## mean 0.001727
## std 0.041527
## min 0.000000
## 25% 0.000000
## 50% 0.000000
## 75% 0.000000
## max 1.000000
## Name: Class, dtype: float64
train_data = data.sample(frac = 0.1, random_state = 1)
## Determine dependent and independent variables
columns = train_data.columns.tolist()
columns = [c for c in columns if c not in ['Class']]
target = 'Class'
X = train_data[columns]
Y = train_data[target]
cormat = train_data.corr()
sns.heatmap(cormat, vmax = .8, square = True)
plt.title(r'Correlation matrix')
plt.show()
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor, KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
# define a random state
state = 1
# define outlier detection tools to be compared
classifiers = {
"Isolation Forest": IsolationForest(max_samples=len(X),
contamination=fraction,
random_state=state),
"Local Outlier Factor": LocalOutlierFactor(
n_neighbors=20,
contamination=fraction),
"K-NN": KNeighborsClassifier(n_neighbors=20),
"Logistic Regression": LogisticRegression(random_state=state, solver='lbfgs',
multi_class='multinomial'),
"Naive Bayes": GaussianNB()
}
Fit the models:
n_outliers = len(data[data['Class']==1])
for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit the data and tag outliers
if clf_name == "Local Outlier Factor":
y_pred = clf.fit_predict(X)
scores_pred = clf.negative_outlier_factor_
elif clf_name == "Isolation Forest":
clf.fit(X)
scores_pred = clf.decision_function(X)
y_pred = clf.predict(X)
elif clf_name == 'K-NN':
clf.fit(X, Y)
y_pred = clf.predict(X)
elif clf_name == 'Logistic Regression':
clf.fit(X, Y)
y_pred = clf.predict(X)
elif clf_name == 'Naive Bayes':
clf.fit(X, Y)
y_pred = clf.predict(X)
# Reshape the prediction values to 0 for valid, 1 for fraud.
if clf_name == "Local Outlier Factor" or clf_name == "Isolation Forest":
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
n_errors = (y_pred != Y).sum()
# Run classification metrics
# print('{}: {}'.format(clf_name, n_errors))
# print(accuracy_score(Y, y_pred))
# print(classification_report(Y, y_pred))
Isolation Forest: Incorrectly classified: 71 Accuracy Score: 0.99750711000316
Precision | Recall | f1-Score | Support | |
---|---|---|---|---|
0 | 1.00 | 1.00 | 1.00 | 28432 |
1 | 0.28 | 0.29 | 0.28 | 49 |
Local Outlier Factor: Incorrectly classified: 97 Accuracy Score: 0.9965942207085425
Precision | Recall | f1-Score | Support | |
---|---|---|---|---|
0 | 1.00 | 1.00 | 1.00 | 28432 |
1 | 0.02 | 0.02 | 0.02 | 49 |
K-NN: Incorrectly classified: 49 Accuracy Score: 0.9982795547909132
Precision | Recall | f1-Score | Support | |
---|---|---|---|---|
0 | 1.00 | 1.00 | 1.00 | 28432 |
1 | 0.00 | 0.00 | 0.00 | 49 |
Logistic Regression: Incorrectly classified: 36 Accuracy Score: 0.998735999438222
Precision | Recall | f1-Score | Support | |
---|---|---|---|---|
0 | 1.00 | 1.00 | 1.00 | 28432 |
1 | 0.68 | 0.51 | 0.58 | 49 |
Naive Bayes: Incorrectly classified: 241 Accuracy Score: 0.9915382184614304
Precision | Recall | f1-Score | Support | |
---|---|---|---|---|
0 | 1.00 | 0.99 | 1.00 | 28432 |
1 | 0.12 | 0.63 | 0.20 | 49 |
We could of course have the best score with KNN - 1, 1-NN respectively. But this is misleading - see the note below.
error = []
# Calculating error for K values between 1 and 40
for i in range(1, 40):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X, Y)
pred_i = knn.predict(X)
error.append(np.mean(pred_i != Y))
plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',
markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')
plt.show()
Note to 1NN: The variance is high in this case, because optimizing on only 1-nearest point means that the probability that we model the noise in our data is really high. For 1-NN this point depends only of 1 single other point. E.g. we want to split our samples into two groups (classification) - red and blue. If we train our model for a certain point p for which the nearest 4 neighbors would be red, blue, blue, blue (ascending by distance to p). Then a 4-NN would classify our point to blue (3 times blue and 1 time red), but our 1-NN model classifies it to red, because red is the nearest point. This means, that our model is really close to our training data and therefore the bias is low. If we compute the RSS between our model and our training data it is close to 0. In contrast to this the variance in our model is high, because our model is extremely sensitive and wiggly.