DT Fraud Detection | chenoiLab

icon

password

You can find the original Code here.

Data Understanding:

The dataset we are dealing with contains multiple features, such as:

step: Represents the unit of time (each step equals 1 hour).

type: The type of online transaction.

amount: The transaction amount.

nameOrig, nameDest: The initiator and recipient of the transaction.

oldbalanceOrg, newbalanceOrig, oldbalanceDest, newbalanceDest: The account balances of both parties before and after the transaction.

isFraud: Indicates whether the transaction is fraudulent.

Exploring Null Values and Categories

Ensure data quality by detecting null values.

The distribution of transaction types (type) may reveal which types of transactions are more susceptible to fraud.

Relevance Analysis

Function: Calculate the correlation between all features in the dataset and "isFraud" (whether it is a fraudulent transaction), and print them in descending order.

Data Preprocessing

Before performing machine learning tasks, it is important to ensure the quality of the data and select appropriate features. In the original example, the data type ("type") was mapped to integers because most machine learning algorithms require numerical inputs. Additionally, we also mapped the target variable 'isFraud' to a more readable string, although this may not be necessary in practice.

Function: This part of the code converts transaction types and fraud labels from text to numerical or categorical labels that can be used for model training.

Model Evaluation

The performance of the model is evaluated using multiple metrics. In this example, we are using accuracy, but in imbalanced datasets such as fraud detection, other metrics such as precision, recall, or F1 score may be more relevant.

Model Analysis

The decision tree model is one of the supervised learning algorithms that predicts by learning decision rules from data features. The decision tree is presented in a tree-like structure, where each node represents an attribute or feature, each branch represents a decision rule, and each leaf node represents an output. In fraud detection problems, the data is often highly imbalanced because fraudulent behavior is usually rare in a large number of legitimate transactions. The decision tree is a suitable choice for such problems because it can handle imbalanced datasets and nonlinear feature relationships well. Additionally, decision trees can provide clear decision paths and rules, which is especially important for interpreting model predictions and making decisions.

Easy to understand and interpret:

The results of a decision tree are easy to understand and can be explained without complex statistical knowledge.

Through graphical tools, decision paths can be visually displayed, making it easy to share understanding with non-technical experts.

Requires less data preprocessing:

No need to normalize or standardize the data.

Since decision trees can handle nonlinear feature relationships and imbalanced datasets, they are often used for classification problems like fraud detection.

Can handle multiple data types:

Can handle both numerical and categorical data simultaneously.

Can handle missing values.

Feature importance:

Can intuitively show the importance of each feature for predicting the target.

Suitable for nonlinear problems:

Can fit nonlinear relationships well.

Limitations

Overfitting problem:

Decision trees are prone to generating complex tree structures that closely fit the data, thereby capturing noise in the data.

Low stability:

Small changes in the data can lead to the generation of a completely different tree.

Local optima problem:

Optimizing through greedy algorithms may result in solutions that are not globally optimal.

Knowledge Required for Decision Trees - Derivation of Information Entropy

Information entropy is a measure of the amount of information in data and represents the level of disorder in the data. For a given data set D, the calculation of its information entropy H(D) is defined as:

Here, represents the proportion of the kth class samples in the data set D.

Why is Information Entropy Defined in This Way?

Entropy originates from thermodynamics as a measure of the disorder in a system, and in information theory, the concept of information entropy is introduced to measure the uncertainty of information. If all messages are equally probable, then the entropy is maximum, indicating maximum uncertainty. To obtain this expression, based on some axiomatic criteria such as non-negativity and additivity, we can derive the aforementioned definition of entropy.

Working Principle of Decision Trees

Decision trees construct a tree of decision splits by optimizing one or more evaluation criteria to select the feature and threshold for splitting nodes. The goal is to generate a decision tree that is structurally simple (e.g., with small depth) yet has strong predictive ability. To achieve this goal, we need mathematical methods to quantify the "goodness" of feature selection.

What metrics does “node splitting” rely on?

Information Gain (IG)

Information gain measures the reduction in uncertainty (or information entropy) by splitting nodes. If the uncertainty (or disorder) of the resulting data subsets decreases after splitting, it is considered a good split. The feature with the highest information gain is selected for splitting.

Given a dataset D and a feature A, we use information gain to assess the quality of splitting the dataset D using feature A. Information gain is defined as the difference between the entropy of dataset D and the conditional entropy of D given feature A:

where

Hereis the probability of selecting the k-th class.

Here, is the entropy of dataset D, Values(A) represents all possible values of feature A, is the subset of samples where A takes value v, and is its proportion.

Conditional entropy represents the uncertainty of D given feature A. We want the uncertainty of dataset D to be as small as possible under the given feature A, so that we can obtain as much information about D by observing the values of feature A.

Gini Impurity

Gini impurity measures the disorder in a dataset. A smaller Gini impurity indicates higher purity of the data subset, meaning a more homogeneous class. The feature with the lowest Gini impurity is selected for splitting.

Gini impurity originates from the Gini coefficient in economics, which measures wealth inequality in an economy. In decision trees, we use Gini impurity to quantify data purity:

Here, is the probability of selecting the k-th class when A takes value v.

Why do we use this form of Gini impurity?

Gini impurity measures the product of the probability of a sample being chosen and the probability of it being misclassified. We want this value to be as small as possible, indicating that the data subset should be as "pure" as possible, i.e., the proportion of different classes in a subset should be skewed. If all samples in a subset belong to the same class, the Gini impurity is 0, which is the ideal scenario.

Stopping criteria for splitting
All instances belong to the same class;
All features have the same value;
Splitting does not improve prediction accuracy/reduce impurity.

In the decision tree implementation of scikit-learn, these methods aim to reduce the uncertainty of the data after splitting (or increase the purity of the data). During the tree construction process, the algorithm traverses all possible split points and chooses the split point that minimizes (or maximizes) the objective criterion. After calculating the impurity or information gain of each feature, the algorithm selects the "best" feature for splitting and recursively repeats this process until the tree reaches the predetermined maximum depth or can no longer reduce the impurity of the nodes further.