Fraud detection in car insurance: the problem of unbalanced sampling

Kononova K.; Havrylenko A.; Kononova K.; Havrylenko A.; Кононова К.; Гавриленко А.

doi:10.33111/nfmte.2020.138

Neuro-Fuzzy Modeling Techniques in Economics

ISSN 2415-3516

Kateryna Kononova

Anna Havrylenko

Fraud detection in car insurance: the problem of unbalanced sampling

DOI:

10.33111/nfmte.2020.138

Анотація: Вирішуючи завдання класифікації методами машинного навчання, фахівці з аналізу даних часто стикаються з проблемою незбалансованих даних. Наявність дисбалансу класів характерна для даних фінансового сектору, зокрема для задач з виявлення шахрайства в автострахуванні. Навчання моделей на незбалансованих даних може призвести до неправильної класифікації та великої кількості помилкових визначень через схильність класифікатора відносити випадки до класу більшості.

Дана робота присвячена дослідженню способів вирішення проблеми дисбалансу класів у задачі класифікації страхових випадків. Для вирішення поставленого завдання було використано базу даних у сфері автострахування, в якій міститься інформація щодо наявності чи відсутності шахрайства за позовами клієнтів. Клас шахрайських випадків, який цікавить нас найбільше, представлений у базі втричі меншою кількістю записів за правомірні позови. Задля уникнення проблем моделювання на незбалансованих даних були застосовані методи передискретизації, зокрема випадковий оверсемплінг та SMOTE. Оцінка результатів, отриманих на різних вибірках, показує, що методи балансування дозволяють суттєво покращити якість класифікації.

У ході дослідження на отриманих наборах даних були побудовані класифікатори на основі логістичної регресії, методу опорних векторів, алгоритму k-найближчих сусідів, Байєсівського класифікатора, дерева рішень, випадкового лісу та нейронної мережі персептронного типу. Порівняльний аналіз показників якості побудованих класифікаторів допоміг визначити найкращі методи для виявлення шахрайських претензій. Для обох наборів даних такими методами були визнані логістична регресія та нейронна мережа, які мають високий рівень виявлення шахрайських випадків у поєднанні з належною загальною прогностичною силою моделі.

Abstract: Solving classification problems using machine learning methods, data scientists often face the problem of data imbalances. Class imbalance is common in financial sector, in particular for the task of fraud detection in car insurance. Training models on unbalanced data can lead to misclassifications and large numbers of false positives due to the tendency of the model to classify observed cases as the majority class.

This paper deals with the study of ways to solve the problem of class imbalance in the task of insurance claims classifying. To solve this problem, a database in the field of auto insurance was used, which provide information about the presence or absence of fraudulent customer claims. The class of fraudulent cases that interests us the most is represented in the database by three times fewer records than for legitimate claims. Oversampling techniques including random oversampling and SMOTE were applied to avoid modeling problems on unbalanced data. Evaluation of the results obtained on different samples indicates that balancing methods can significantly improve the quality of the classification.

Logistic regression, support vector machine, k-nearest neighbors classifier, Bayesian classifier, decision tree, random forest and perceptron type neural network were built on the obtained datasets. A comparative analysis of the models’ qualities allowed to determine the best methods for detecting fraudulent claims. For both datasets, logistic regression and neural network were recognized as such methods, having a high level of fraud detection combined with a good predictive power of the model.

Key words: machine learning, neural network, logistic regression, decision tree, classification, unbalanced data, oversampling, random oversampling, SMOTE

UDC: 519.2:368

JEL: C52 C55 G22

To cite paper

In APA style

Kononova, K., & Havrylenko, A. (2020). Fraud detection in car insurance: the problem of unbalanced sampling. Neuro-Fuzzy Modeling Techniques in Economics, 9, 138-155. http://doi.org/10.33111/nfmte.2020.138

In MON style

Кононова К., Гавриленко А. Виявлення шахрайства в автострахуванні: проблема незбалансованої вибірки. Нейро-нечіткі технології моделювання в економіці. 2020. № 9. С. 138-155. http://doi.org/10.33111/nfmte.2020.138 (дата звернення: 05.01.2026).

With transliteration

Kononova, K., Havrylenko, A. (2020) Vyiavlennia shakhraistva v avtostrakhuvanni: problema nezbalansovanoi vybirky [Fraud detection in car insurance: the problem of unbalanced sampling]. Neuro-Fuzzy Modeling Techniques in Economics, no. 9. pp. 138-155. http://doi.org/10.33111/nfmte.2020.138 [in Ukrainian] (accessed 05 Jan 2026).

# 9 / 2020

Download Paper

504

Views

181

Downloads

0

Cited by

Меню

Виявлення шахрайства в автострахуванні: проблема незбалансованої вибірки

Fraud detection in car insurance: the problem of unbalanced sampling

10.33111/nfmte.2020.138

References