Fraud Detection and Text Mining
“Data Mining” is the refinement of data using statistical algorithms to
learn patterns and correlations in data. Data Mining looks through and gets
information from corporate data warehouses, or details that users have dropped
on a website, which can help in advancement and enhancement in the knowledge
and usage of the data.
In Data Mining we find out patterns and relationships unexplained and
unknown in data previously and is a part of a larger process called “knowledge
discovery” which set out the steps that provide meaningful results. The data
mining technology has developed in research areas such as machine learning,
statistics, and artificial intelligence. Data mining tools takes data and
constructs a representation of materiality in the form of a model. The obtained
model set out patterns and relationships unknown previously in data.
Fraud refers to use of an organization’s system for personal enrichment
through the intentional misuse or operating of the employing organization’s
resources or assets i.e. not leading to direct legal outcome.
In professional systems, fraudulent
activities have reoccurred in many professional systems including mobile communications,
online transactions, e-commerce, telecommunication industry, and credit card,
insurance. It is essential to
recognize potentially fraudulent
activities and the irregular use of data obtained
determine practices to obtain
fraudulent access into customer accounts
figuring uncommon activities that
require essential recognition and handling e.g. busy-hour repetitive call attempts,
switch and router jamming patterns, repeated calls from automatic dial-out
gadgets e.g. fax machine that are programmed differently in purpose
Fraud detecting methodologies are
developing in order to hold criminals from adapting to these techniques. The
advancement of fraud detection methods is getting difficult due to serious
restriction on transfer of ideas .There are many methods for fraud detection
like artificial immune systems, parallel and distributed computing, statistics,
image processing and pattern recognition ,machine learning, artificial intelligence,
fuzzy logic ,database, expert systems and many others.
Fraud is identified from
inconsistency in data obtained and pattern.
Categories of Fraud
The different types of fraud are:-
Credit-Card Fraud Detection- Credit Card Fraud Detection is
confidential and not revealed easily in public.
Outlier Detection – An observation may consist of data
that does not follow regular behaviour or model of the data called outliers.
These rare events are more identified than generally occurring ones in Outlier analysis.
Unsupervised methods which do not possess any knowledge beforehand related to non-fraudulent
and fraudulent methods, and will detect changes in irregular and different transactions
than regular and thus detect undisclosed frauds as well. With supervised
methods which possess information from historical databases to distinguish
between non-fraudulent and fraudulent behaviour, but only to identify and
classify previously identified frauds.
2. Neural Networks: These have tolerance for changes in data and capability to identify
changes for which they have no information and are not trained so can be used
for both non-supervised and supervised learning. They can work with very less
knowledge of classes and attributes. They are formed of a set of nodes and each
node consists a weighted connection with other nodes in subsequent layers. All
nodes take the input received from adjacent nodes and weight along with a
function to derive output values.
B. Computer Intrusion Detection-An
intrusion refers to a group of actions which might affect availability, confidentiality,
or integrity of network components like file systems, user accounts etc. “Misuse
detection” finds user behaviour and sequence in programs that are previously
detected intrusion cases, stored in form of signatures. These signatures are
recorded by experts through knowledge and study of intrusion techniques. If the
sequence matches, then signals are generated. But only previously identified
signatures are detected and new or unknown signatures are not identified or
1. Expert Systems –
The computing system which has ability to represent and reason some
information-rich area with the purpose of solving and analysing problems and
providing solution are known as Expert Systems. They have statistical analysis
ability for rule based analysis and detecting anomalies to detect misuse of
2. Neural Networks – “NNID
(Neural Network Intrusion Detector)” is system for detecting misuse and anomaly
intrusion and implemented by “back propagation neural network”.
3. Model based Reasoning- A
misuse detecting process that observes activities and detects attack through
C. Telecommunication Networks – Record
data from call details are used to create customer profiles and deviations and
unregularities are detected using these profiles.
1. Rule based Approach: The
differential and absolute usage is tested against certain rule based approach
mapped to data. It works on user profiles with explicit information where rules
are fraud criteria. It combines customer data and behaviour data.
2. Neural Network: It
analyses user profiles independently and elegantly adapting different user’s behaviour.
3. Visualization methods: It
detects anomalies analysing human pattern recognition and provides real time data
feeds. It combines machines with human detection to increase computational
capacity and to manipulate the numbers of calls between different users
geographically through graphical representation to identify frauds in
Text Mining refers to the
process to derive numeric meaningful indices i.e. structured data from unstructured
data. It is used to analyse a cluster of words, used to find the
relation among variables
under observation like non fraud or fraud.
The text information
obtained from financial statements is in unstructured form i.e.
amorphous and needs to be in
structured data before used for data mining techniques like classification or
clustering to identify fraudulent financial reporting.
The variables under
examining were obtained from financial statements like income statements,
balance sheets etc.
To obtain unknown important
information in textual financial documents we use text mining approach to
detect frauds in increasing volume of data.
Mining: Detecting Frauds in Financial Statement
As input we provide copies
of financial statements. These documents include financial statements from non-fraudulent
and also fraudulent organizations as initial step of process. Input must
consist of financial statements of non-fraudulent organisations corresponding to
that of fraudulent organization statement and also of same sized in terms of
sales and assets.
Next step includes
extracting qualitative results from the financial documents which are then
arranged in a document as the document is the basic unit for text mining
analysis. While pre-processing documents words of all financial statements are
converted to lower case to prevent treating same words differently.
All punctuations and numbers
are removed from collection of statements as input data must be in textual form
only. Articles like the, an, a etc., Conjunctions like and, but, etc.,
Prepositions like in, on, etc. must also be removed as they do not contribute
for distinguishing documents.
In text mining, the
syntactical structure of sentence is avoided since group of words and their
order might change resulting in no impact on analysis result in an appropriate way.
However, the count of occurrence of words should be maintained. This collection
of textual data in unordered form is used to represent the financial document
with a vector of occurrence count for word in document. This vector obtained is
used to compare against the vector obtained from given non- fraudulent and
fraudulent document. Documents resulting in dissimilar vector are regarded as
different whereas the documents resulting in similar vector are regarded as
The vectors obtained are
used to classify documents as non-fraud or as fraud. “Support Vector Machine” a
classification method is used to detect fraudulent financial reporting as it is
considered best to classify between non-fraudulent and fraudulent financial