Sitemap

Synthetic data generation applied to fraud detection

12 min readJun 22, 2023
Press enter or click to view image in full size

Creating synthetic data can unlock a lot of value across a range of different situations. As explored in the first article of this series by Heka, we explained how creating synthetic data can allow companies to ensure privacy-preservation when sharing their data and lighten the risks and costs related to the GDPR. Synthetic data can also be useful to enrich and complement datasets, for example through data augmentation. We chose the prism of fraud detection to illustrate how we use our library to both augment the data and anonymize it. This article will walk you through the different steps when augmenting and synthetizing a dataset containing fraudulent claims.

The need for fraud detection models

Fraud detection models are an essential tool for companies in various industries to detect and prevent fraudulent activities such as credit card fraud or false claims in insurance.

These models have the objective to identify fraudulent transactions or activities by analyzing large amounts of data and detecting patterns that deviate from normal behavior.

Failing to detect and prevent fraudulent activities, can lead to significant financial losses and damage to a company’s reputation: In 2022 alone, 70% of Financial Institutions lost over $500K to fraud. As a consequence, it is crucial for companies to have access to detection models that are as accurate and effective as possible.

Modeling limitations and traditional solutions

In fraud detection, the fraud instances (the minority class) are by far less frequent than the non-fraudulent ones (the majority class). Because of this class imbalance, the model struggles to distinguish between the two classes and therefore is not properly capable of detecting fraud. To overcome this issue, different techniques have been used to adjust the class distribution. Here we provide a short overview and outline their limitations¹:

Press enter or click to view image in full size
More information on SMOTE here

Data Augmentation using GAN

Generative methods for data augmentation

Given the limitations of the traditional augmentation techniques, there is a need for a method able to handle large amounts of data of high dimensionality and to create synthetic data that is diverse enough. Generative AI, and in particular Generative Adversarial Networks (GAN), known for the diversity of synthetic samples it can generate, looks like a viable method for rebalancing classes in datasets.

In the following section, we present how we used the toolbox we’ve developed to address the fraud detection use case using GANs. Our method augments the minority class — fraud — while providing full data anonymization.

The Generative Methods Library

The library that we developed was inspired by the Synthetic Data Vault project that can be used to generate synthetic single-table data with both numerical and categorical columns.

First, the generator can be used to fit a generative model to the real data. The user can choose between two models: CTGAN² and REalTabFormer. CTGAN was developed to deal with multimodal and non-gaussian distributions, and corrects underrepresented values in categorical columns. The REalTabFormer³ is a transformer-based model that will perform better in some cases. In any case, the user specifies the number of synthetic samples that need to be generated. Additionally, the library includes similarity and privacy analysis that we are going to explain later in this article.

Data augmentation through GANs: a nice complement to traditional data augmentation methods.

To augment data using our generative method, we first extract the minority class instances. Then, we use the Generator class to create a synthetic sample that is similar to the minority class. To do so, we need to specify some parameters, such as the desired probability of fraud, the architecture (the model) and the number of epochs for training. Lastly, we merge the newly generated fraud samples with the original data.

Note: From our experiments, it is best not to choose a fraud sample size equal to the majority class, but rather 10–30% of the size of the majority class. In fact, when augmenting up until class balance, the models lost accuracy as they were over-predicting fraud. However, the optimal proportion depends necessarily on the dataset and model used.

Press enter or click to view image in full size
Press enter or click to view image in full size
Example of how to call the augmentation function and which parameters to pass

Testing methodology⁴

We test the quality of the generated fraud samples in two ways:

  • First, we check how similar the newly generated fraud samples are compared to the original fraudulent claims. This is important as it guarantees that the generated samples are consistent with the original data. In fact, if they were to be imprecise, they would end up biasing the model, making it incapable of accurately predicting fraudulent behavior.
  • Second, we test if the model performance is improved by the data augmentation and whether it performs better than other data augmentation techniques. This assessment is at the core of our experimentation as it would determine whether generation for augmentation can actually enhance the fraud detection capabilities of our models or not.

Similarity check

We use the similarity check tool from our library, which enables the examination of different similarity aspects between real and synthetic data. It first estimates the distance between the real and synthetic marginal distributions for each variable. It also shows the differences between the correlations of these variables in the real and synthetic datasets. Lastly, it enables a visual comparison of every column, allowing the user to quickly identify the columns that were not well replicated.

We performed the similarity check between the minority class and the synthetic fraud samples. Here are some of the graphs representing the visual comparison between variables’ distribution.

Visual comparison of real and synthetic features’ probability distributions

While not perfect, the Generator did a pretty good job at replicating the distributions of the variables. This is confirmed by the “Overall quality score” which summarizes all the similarity scores of all the column shapes and column pair trends.

The ‘Overall Quality Score’ gives a sense of how well the generator synthesized the original data

Impact on Model performance

To compare model performances, we use three versions of the training set: the original dataset used as a baseline, the same dataset augmented with SMOTE, and finally, the same data again, but augmented with our generative library. We used the three datasets to train the same classifier (XGBoost Classifier) and tested their performance on a portion of the real data, the validation set. Since the models are tested on the exact same data, any difference in performance can be attributed to the data augmentation technique.

Press enter or click to view image in full size
Predicting Fraud — Comparative model performance on augmented vs original data

As we can see from the graphs, the model trained on the data augmented through our generation method (in this instance we used the RealTabFormer model) performs sligthly better than the model trained on the data augmented with SMOTE. However, we did not see any performance improvement from the baseline, which is the model trained on the original dataset.

Throughout our tests, we found that performing data augmentation — using synthetic data — before the classic preprocessing tasks such as one-hot encoding of categorical variables increased the model performance. This might be the case because generation performance is hindered by sparse datasets like the ones that went through one-hot-encoding. This is not possible with SMOTE, while the generator can deal with not pre-processed variables internally.

The conclusion to these experiments must be considered carefully: indeed, it is uncommon that a seriously unbalanced dataset as the one we used leads to such high performance, and we strongly encourage testing for datasets and in other domains than fraud detection. However, what we showed is that data augmentation through synthetic data generation performs as well as SMOTE, if not better, and that it is therefore a strong complementary tool for enriching datasets.

Another complementary approach would be to test different types of models using this method. Indeed, XGBoost tends to be relatively good at predicting imbalanced datasets. Data augmentation could then be more useful on other types of models that don’t handle minority classes as well.

Fully synthetic and augmented datasets using GAN

Introducing anonymization into the pipeline

One additional issue that can arise from fraud detection modeling is how we handle private, sensitive information. In fact, for most use cases, fraud detection models are fed data that can contain personally identifiable information (PII) that must be protected (for example in Insurance claims). As a consequence, it might not always be possible to use real data to train models.

Especially when there is data transfer between stakeholders, there is the need to fully anonymize the dataset to protect personal data that it may hold and respect GDPR constraints. In this part, we explain how we can anonymize the dataset used for fraud detection models, while augmenting it to solve imbalance class issue.

Workflow

We consider several approaches. On one hand, we could first fully anonymize the dataset, and then augment it, considering that in a B2B setting, the data controller would anonymize the dataset before handing the data over to the data services provider, who will then augment it.

The downside of this approach is the following: since both the anonymization and the augmentation involve creating synthetic data, there is a risk of error propagation. Indeed, as the similarity of the generated synthetic data to the original data is never perfect (see our first article on using GANs for synthetic data generation in which we mention the similarity-privacy trade-off), we can consider that there is a slight error ε between the original data and the synthetic data. If we then decide to augment the data, for example in the case we want to generate more fraudulent data points, using GAN will again introduce a slight error ε in the new fraud samples. As the augmentation will be based on synthetic data, the error between the original portion of the fraud data f and the final subset of fraud data (anonymized and augmented) risks being f + 2 ε .

So, we came up with a different way of going by anonymization with augmentation, to try keeping the error as small as possible: we designed a function to augment and anonymize the data simultaneously. The generation method allows to select the number of samples that we want to generate. So, by splitting the data into two parts, corresponding to the two classes — fraud and non-fraud — and running the generator on both subsets individually, we can control the number of synthetic data points being created. This way, we prevent the error between the original data and the final synthetic data set to go over ε. Below is a visual explanation of both workflows:

Workflow #1: anonymization followed by data augmentation (with the risk of error propagation)
Workflow #2: Augmentation and anonymization done simultaneously (leading to a maximum error of e)

Workflow #2 requires a little extra work on the data controller’s side, but we believe that with the right guidance, the process should be straightforward to put in place.

Model Performance

Like in the previous section, we compared the performance of different predictive models trained on the original dataset augmented with generative methods, and on the fully synthesized, augmented dataset, by testing the models on a portion of the original data. In our case, while random forest showed a stable performance across the two datasets, XGBoost showed a lower performance when trained on the fully synthetic (and augmented) dataset, than when trained on the augmented original data.

Press enter or click to view image in full size

Model performances on original augmented and synthetic datasets. Random Forest shows a greater stability in performance when trained on the fully synthetic data, with regards to the original augmented dataset. XGBoost shows a slightly larger difference in performance. These gaps in performance suggest that synthetic data is more prone to overfitting, especially with the second model.

The insights we get from these tests are several: The synthetic data that we produced is close to the real data, as shown by the similar model performances. So, a model trained on synthetic data is likely to work well on the real client data, meaning that in that case there is no need for the client to share its data for this purpose. We took note that there might be an overfitting issue when training models on the fully synthetic and augmented data, here especially with XGBoost. It would be worthy to do some research on exploring the determining differences between the original and the synthetic data, as our similarity check shows satisfactory results.

Similarity

In the same way, we checked how close the synthetic and augmented data is to the original dataset using our library. We also visualized the differences in selected features’ correlation.

Our library’s similarity check allows to visualize the difference in the correlation of features between the original and synthetic dataset.

The best way to control the degree of similarity is surely changing the number of epochs, or else it is possible to introduce some noise in the data before synthesizing it.

Privacy-Preservation

Moreover, we tested our library’s privacy metrics. The first type of privacy report is from SDV’s library and is concise, while giving a sense of how well the synthetic data is covering the real data. As the generator learns the distribution of the real data over the epochs, it is possible that sometimes, some rows from the real dataset appear in the synthetic dataset, but this event is rare and can be solved by decreasing the number of epochs.

Press enter or click to view image in full size
Example of the privacy report that the SDV library can ouput

As we wanted a clearer grasp of how much the privacy is compromized, we developed a customized privacy metric. The nearest-neighbour-based privacy check calculates a distance between each row in the real and the synthetic dataset. For numerical columns the absolute difference between normalized values is calculated, and for categorical columns we check equality of the values. Based on the obtained distance, the nearest synthetic neighbor of every real sample is found, and it is possible to visualize the closest samples from both datasets. This gives the user a chance to decide whether the synthetic sample resembles a real sample too much, and retrain the model accordingly or simply remove the closest samples.

Press enter or click to view image in full size
Example output of the privacy metric based on nearest neighbours

General Insights and Limitations

From our work on fraud detection using generative methods, we retain several insights and shall not forget some limitations we encountered.

We showed that our generative method creates high quality and diverse synthetic samples that closely mimic the real data. For data augmentation, we showed that the method performed similarly to a traditional data augmentation method in a class imbalance problem. The model performance was not increased when trained comparatively on the original data, but we believe that the method should be tested further on different, more imperfect datasets. From our experiments, model performance is increased when the data augmentation is done before basic data preprocessing, surely because of less sparsity (data sparsity is increased with one-hot encoding).

For full data anonymization, we showed that models trained on the original augmented data and the fully synthetic data have a very similar performance across several models. For XGBoost on synthetic data, the model performance is a little lower, suggesting that the fake data is prone to higher overfitting. This could also be due to the fact that the synthetic data never perfectly resembles the real data. This can be controlled according to the desired privacy vs. similarity level. To go further, more research on the relationship between the generated data and the models trained on it should be carried out.

Additionally, generative methods can be computationally expensive when faced with high dimensional data, so this aspect is also definitely to be considered when choosing the right data augmentation and anonymization approach.

Finally, we encourage more research and experimentation to be done around our generative method, especially on different use cases. For example, it could be interesting to apply the method to a regression task or a clustering job, where access to training data would be limited, to keep exploring the behaviour of the library and improving its capabilities.

Keep posted for the next article on synthetic data generation for time series!

[1] Goswami, S. Class Imablance, SMOTE, Borderline SMOTE, ADASYN. https://towardsdatascience.com/class-imbalance-smote-borderline-smote-adasyn-6e36c78d804

[2] Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K. ModelingTabular Data Using Conditional GAN. Available at: https://proceedings.neurips.cc/paper/2019/file/254ed7d2de3b23ab10936522dd547b78-Paper.pdf

[3] Solatorio, A. & Dupriez, O. REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers. Available at: https://arxiv.org/pdf/2302.02041.pdf

[4] To carry out these tests we use the car insurance fraud data available on Kaggle at the following link: https://www.kaggle.com/datasets/incarnyx/car-insurance-fraud

--

--

Sia AI
Sia AI

Written by Sia AI

Solutions to help the enterprise navigate change

No responses yet