More than ever we are faced with huge amount of data related to particular business domain or either customers. Machine Learning is very powerful tool to make sense from this huge amount and make it useful to the business, e.x by providing insight into the customers trends,preferences or even help in getting better business decision. In this post we are going to explore and implement Anomaly Detection algorithm. No specific specialization or deep knowledge in the field is required as topics are explored from a high level perspective as possible and also full working code is provided for playing around with data(feel free to comment new ideas). In this post algorithm is going to be implemented using a tool like OCTAVE(MATLAB can be used as well). Next post will be about implementing algorithm in JAVA using SPARK with bigger amount of data as OCTAVE is limited in this perspective.
Table of Contents
Anomaly Detection Problem
We have a significant amount of data and want to know if new coming data/example is anomaly or abnormal.Most part of the data are similar in their structure and content but among them we may usually have a very small proportion of those data that do not follow that trend.
Applications
Aircraft engines
In this kind of applications we constantly monitor aircraft engine state like “heat generated”, “vibration”,”fuel consumption”,”power generated”… and we want to detect the case when the state looks different from what we have seen in past and notify or investigated as possible defect.
Monitoring computers in data centers
Similarly with aircraft engines we monitor several computers in data centers for “memory use”,”disk access”,”CPU”,”network traffic” and want to detect when particular computer is behaving strange to what we usually see. For example we may have a case when a high CPU load is detected in very small memory use and want to see if it can be a potential virus.
Fraud detection
This is the most common case of Anomaly Detection algorithm applications. We may have application in web page when users behaviors are monitored in order to detect if any of them is behaving strange and has not friendly intentions.It can happen that the current logged-in user may not be the real user and we want to investigate if the account is hacked.In this case a behavior profile maybe be kept for each user.
The application which will be implemented in this post is “Fraud Finance Transaction “. We will apply algorithm in huge amount of finance data and detect transaction which may be fraud. Fraud transaction have a big impact on companies, especially banks considering the wide usage of internet online transactions and still growing.
Anomaly Detection Algorithm
Anomaly Detection Algorithm has two important properties. First one is that usually we do not have a lot of anomalies data in comparison to regular data.And the second one is that we do not know in advance how regular or anomaly looks like but instead we are hoping algorithm to find out for us. To understand better the impact of those two properties we can compare “Anomaly Detection” to a “Classification Problem“. At first look it appears like “Anomaly Detection Problem” is the same as “Classification Problem” and perhaps in way it is because after all it classifies data into regular data and anomalies. But in reality it differs essentially by two reasons:
- “Classification Problems” have a very precise definition of the classification categories like for example the emails are SPAM or NOT SPAM. While in fraud detection we do not really know the nature of anomalies.Of course we may have an idea what obvious fraud finance transaction may look like but there is no easy way we may know other more personalized fraud detection like per each user. For example a User A had always for considerable amount of time used only small or medium transfer transfers amount and now is using a big amount , or maybe User B buys online only on weekend and so on… So in a few words we do not know or can predict all anomalies and as consequence we do not have all categories.
- Usually in “Classification Problems” we have a lot of positive examples like in the example of spam emails we have a considerable examples on SPAM or we can generate them as we know precisely how they look like(at least most of them). In the case of “Anomaly Detection” we have only very small amount of anomalies comparing to regular or other data. Classification Problem Algorithm like “Logistic Regression” required reasonably good amount of positive examples in order to train their self and predict reasonably good.
Gaussian Distribution
Anomaly Detection is based on Gaussian Distribution. Proper Gaussian Distribution explanation is behind the scope of this post so here only the basic is explained just to understand the algorithm in high level perspective.
Average values on horizontal axis
Average square difference between values on horizontal axis and µ value calculated above.
As the figure shows the values on horizontal axis become more dense when we move near to µ and more spare when we go fare from it. So when a new example is coming basically depending on the position on the graph(value) we can tell if it is located on the dense part or on the spare part. If it is located on the dense part where most of the red points are located we can say that is a normal occurrence. Otherwise if located on spare part we can say that this is an anomaly. Once µ and σ² are found there is a standard formula for finding the value or probability of a new coming example(basically the position):
For example if we apply values µ=0 and σ²=1 on the equation above and take the example for X=0 than we have f(x)=0.399* e-(0-0)/2*1 which is equal to approximately 0.4 same as figures shows.
Depending on this value or probability we can than decide if it is anomaly or not. If the value of the function is smaller than a value ε we can say it is anomaly and normal if greater. We can control how aggressive our algorithm is by playing with ε value. By having big values of ε we make our algorithm flag a lot of anomalies and as as a consequence we increase number of normal occurrence wrongly classified as anomaly. On the other hand by having small value of ε we may miss anomalies as our algorithm becomes more tolerant.So finding a balanced value of ε is crucial to our algorithm success. Later on this post we will say one of the way to do just that.
Multidimensional Data
The example we saw above was one dimensional data example when we have just one dimensional vector with values like for example X=[1,2,3,4..]. Real examples are multi dimensional.For example to detect a fraud on finance transaction we need several information like below:
type | amount | nameOrig | oldbalanceOrg | newbalanceOrig | nameDest | oldbalanceDest | newbalanceDest |
PAYMENT | 9839.64 | C1231006815 | 170136 | 160296.36 | M1979787155 | 0 | 0 |
PAYMENT | 1864.28 | C1666544295 | 21249 | 19384.72 | M2044282225 | 0 | 0 |
TRANSFER | 181 | C1305486145 | 181 | 0 | C553264065 | 0 | 0 |
CASH_OUT | 181 | C840083671 | 181 | 0 | C38997010 | 21182 | 0 |
PAYMENT | 11668.14 | C2048537720 | 41554 | 29885.86 | M1230701703 | 0 | 0 |
With multidimensional data we need to do some small non essential changes to the way µ ,σ² and probability f(x) are calculated. Lets address data as follows :
- Xij where i refers to rows and j refers to columns like for example X22 is 1864.28
- µj refers to µ for column j so µ3=234 is value for column oldbalanceOrg
- σ²j refers to σ² for column j so σ²4=21312 is value for column newbalanceOrig.
than more formally
Average of m rows values on column number j
Average square difference of rows on column j with µ for rows on column j.
As we can see now µ and σ² are vectors with dimension equal to the columns of the data rather than a single value. This is the difference compared with the simple example we saw before, now each value on vector represents value only for one column. Columns of the date are called also Features. So µ and σ² are vectors at the same size as Features of the data and each value on those vectors represent value for that particular feature of the data.
Now that we have µ and σ² we need to calculate probability or f(x) function. Also in this case there is a small modification:
Computing f(xj) per each column j as it were a one dimensional vector(same as the simple example) and than just compute the multiplication of all f(x) for each feature.
Prototyping
Before digging into the code is usually a good idea to start playing around with the data so we can get an idea what to expect. Prototype may show us that data we have are not enough or that we need to add or remove Features(columns) from our data or actual data and algorithm is not going to give us what we want and is better to stop at this phase rather than waste a lot of time and effort with well written code.
Data
In finance is very hard to find data because of privacy but fortunately there is already an generator online available that can provide enough data:
“This work is part of the research project ”Scalable resource-efficient systems for big data analytics” funded by the Knowledge Foundation (grant: 20140032) in Sweden.
PaySim first paper of the simulator:
E. A. Lopez-Rojas , A. Elmir, and S. Axelsson. “PaySim: A financial mobile money simulator for fraud detection”. In: The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus. 2016
PaySim first paper of the simulator:
E. A. Lopez-Rojas , A. Elmir, and S. Axelsson. “PaySim: A financial mobile money simulator for fraud detection”. In: The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus. 2016″.
Feel free to download original data from here.
Data Preparation
The file has more than we need for a prototype so before using it we need to some non essential changes.
First we are going to work only with TRANSFER type as consequence from 7 million of data in total we filter only 532.000 rows to work with.
After that we convert nameOrig and nameDest to number. Most part of it is number but they start with a character like C or D, we simply replace C with 1 and D with 2.
And in the end we also remove type TRANSFER column/feature as this the same for all our data rows. Feel free to download prototypeData.csv.
OCTAVE
Since most of the time we are going to work with matrices a good tool will be MATLAB or Octave or other tool that is good at manipulating matrices. I choose Octave because is free to use. Beside the Octave page you can find good tutorial at famous Coursera Course on Machine Learning by Andrew Ng here.. Even if you do not feel to like to learn either of them I will provide full working scripts so you can still play easily with the data. But for more flexible experiments at least some basics have to be learned anyway which I recommend as Octave is really easy to master the basics.
Installing Octave is easy just download from here and install or extract one of the zip packages.
To run Octave just start octave-gui.exe at \bin folder and than you need to download the source code here. After that you need to point octave to the source code with the scripts. This can be found easily on top with the name “Current Directory”(usually default at the bin folder). Now you are ready to run the prototype and play with the data.
Visualize Data
So now we have 532.909 transaction with 10 Features(columns) as below:
Step,Type,Amount,NameOrig,OldbalanceOrg,NewbalanceOrig,NameDest, OldbalanceDest,NewbalanceDest,isFraud,isFlaggedFraud
Our data are labeled as fraud with the 10th feature isFraud. So first thing to do is to query how many of the data is fraud and as result 4097(0.77%) out of 532.909 are fraud . So the data are really unbalanced which meets the condition to use an “Anomaly Detection” rather than “Classification Algorithm”.
Since our data are multidimensional(10 columns) we cannot really visualize or plot anything. Fortunately there exist one way to reduce data dimension to lower dimensions with using “Principal Component Analysis(PCA)“. Basically we can think it as algorithm that reduces data dimension (data compression) without loosing essentially their content. Since PCA basically reduces columns of our data matrices it can really speed up machine learning algorithms in general especially when you have data with hundred of features.But also it can be used to plot the data in 3D or 2D since you can reduce the features(columns) from n to lets say to 2 or 3. In Computer Vision you may have lets say 100X100 images which in total it makes 10.000 pixels and since all pixels are features it really speeds up the algorithms if we can compress the image(reduce pixels/features) by some degree. In our data we are going to reduce the dimension from 10 to 2 and print it. This can be done by executing:
- loadDataAndApplyPCA()
- Or you can do in two steps:
- data=csvread(‘data/prototypeData.csv’) ->loads the data so you can use them and manipulate
- applyPCA(data) -> applies PCA to just loaded data
Than we will be able to see below graph:
With red are the regular cases and with blue(please zoom in )fraud cases.
The data are grouped in islands and is hard to tell what trend the regular data follow and what trend spare or frauds tend to follow. Even when zoom we can see that in each island frauds(blue) are located in the edges of islands.
In order to get a better view we need to apply normalization of data and see if we have better luck. Basically what normalization does is transforms each feature values in more uniform values. For example Amount Feature may have values like 100 than 10 than 1 than 104000 than 1000000 than 123. As we can see this values are within different ranges and we want to make them look more or less the same. One way to do that is by applying below formula:
Where μi is the average of all the values for feature (i) and si is the range of values (max – min), or the standard deviation.
This can be achieved by executing:
- loadDataAndApplyPCAWithNormalization()
- Or you can do in two steps:
- load pay.mat ->loads the data so you can use them and manipulate
- who ->gives you current variables in memroy
- applyPCAWithNormalization(data)
Than the graph looks like below:
As we can say this graph looks better because is clearly showing the trend of regular data(red) and fraud data with blue.
Please find below zoom in view:
From the graph we can notice that our data make sense and “Anomaly Detection with Gauss Distribution” will most probably give good results. We say make sense because we can clearly see what trend regular and fraud data follow. Even when particular data graph do not show good result like the graph do not make sense , is still worth it to try the algorithm and see how many frauds it detects. Visualizing the data with PCA help is just a step that help us to get an idea of how our data look like or what trend they follow.
Applying the Anomaly Detection Algorithm
Before applying the Gaussian Anomaly Detection Algorithm to the data and see the results there few steps we need to clarify.
Training
As most of the Machine Learning algorithms also Anomaly Detection has a training phase which defines how basically how algorithm will behave in future. Data used to train the algorithm are called Training Set and we are going to pick up randomly 60% of data which are not labeled as anomaly(we need 4097 anomalies for later phases since there are few of them).
Cross Validation
At this point we have ready the μ average vector and σ vector so for each new example X we can apply the Gaussian Formula f(xi) for each feature(column) and than multiply to have the final F(X). Even if we have the the F(X) ready we do not know yet how to judge if this is a anomaly or not. For this purpose we need to choose a balanced value of ε. ε is just a threshold that decides if a new coming examples is fraud or not. We do not want this value to be to large(a lot of frauds are flagged) or to small(we may ignore real frauds). In order to choose the ε we will use a Cross Validation Set which will picked up as randomly 20%(100.000) of the data and has 2000 random anomalies(almost half). For each example from Cross Validation Test we are calculating F(X). First we pick up a initial ε value as smallest F(X) calculated. Than we keep changing ε in the range of smallest F(X) and biggest F(X).For every new value of ε we keep track of ε which maximizes FScore(more below) so in the end you have best ε which produces biggest FScore. Intuition behind FScore is to make sure we are not missing a lot of anomalies and in the same time not flagging a lot of non frauds(basically if you flag every thing as fraud you have 100% score but there is not something we want). More formally:
Test Validation
Now we have μ average vector ,σ deviation vector and best ε so is time to test how accurate our algorithm works. To do this we pick up other left 20%(100.000) data and also rest of 2097 frauds. For each of them we calculate F(X) and if it is smaller than ε we flagged as fraud and otherwise not. Since we choose our data randomly is always a good idea to run the algorithm several times by keeping track of best result and register best μ ,σ ε. In the end of this phase we test how well algorithm behaves and if the result satisfies our expectations. If results are not satisfactory than probably we need to work more on Training Phase or more probably on Cross Validation Phase.
Execute algorithm and Results
To execute the algorithm we simply do the following:
- loadDataAndApplyAnomalyDetection() -> execute cross validation test with 2000 frauds and 100000(approx. 20%) of data and test phase with 2097 frauds and 100000(approx. 20%) of data.
- Or you can
- data=csvread(‘data/prototypeData.csv’)
- applyAnomalyDetection(data,2000,100000,2097,100000)
After that we will be able to see to graphs as below:
With Red are marked regular data and with Magenta are marked all anomalies before applying the anomaly detection algorithm.
Now we will see below how many of those Magenta will be detected by our algorithm.
With Red are marked regular data and with Magenta are marked all the anomalies that were not detected by algorithm ,with Green are marked successfully detected anomalies and with Blue are marked data that are flagged as anomaly but in reality they are not such.
If we zoom in with octave we will see a better visual of how well our algorithm went:
As we can see a lot of anomalies were not detected(with magenta). More precisely with my current run :
Flagged As Fraud 6775
Found Fraud 2033
Missed Fraud on Test Validation 1048
Total Missed Fraud 2064
As we can the results are not satisfactory since half of the frauds were not successfully detected. Can we do better?
When the results are not the expected one usually we can investigate :
- Do we need more data?-> From our visualization it looks like the data are good but usually is good to investigate according to this well described course.
- Can we add more feature to the data? Maybe the features(10) are not enough for detecting frauds.-> We can add more features If we think they can help with detecting more frauds. In our case we cannot add more features as we do not have control at the data.
- Is anything specific about the algorithm we can optimize? In this case Anomaly Detection Algorithm with Gaussian can work better(not guaranteed) if all the features are more like a Gaussian Bell shaped curved.
Make features look more like Gaussian Bell Shaped Curved
We can check our feature by piloting histogram and see if they look more or less like Bell Shaped. In Octave is this is really easy to do just:
- data=csvread(‘data/prototypeData.csv’)
- hist(data(:,[3]),50)-> plot histogram for all rows and column/feature 3.
This looks like this:
Obviosly this do not look like Gaussian Bell shaped curve so we can several transformation like below:
- log(dataWithSpecificFeature)
- log(dataWithSpecificFeature+1)
- log(dataWithSpecificFeature+c) for some constant
- dataWithSpecificFeature1/3
After we tried some times we found that the number that works for feature 3 is dataWithThirdFeature0.1 and after plotting(hist(data(:,[3]).^0.1,50))it looks like this:
Wed do this for all features(for some maybe not possible just do your best) and than try again to execute like below:
- loadDataAndApplyAnomalyDetectionWithGaussianFeatures() -> execute cross validation test with 2000 frauds and 100000(approx. 20%) of data and test phase with 2097 frauds and 100000(approx. 20%) of data.
- Or you can
- data=csvread(‘data/prototypeData.csv’)
- applyAnomalyDetectionWithGaussianFeatures(data,2000,100000,2097,100000)
After that lets see how the graphs and the results looks like below:
With Red are marked regular data and with Magenta are marked all anomalies before applying the anomaly detection algorithm.
Lets see how the picture looks like after executing algorithm on the data:
With Red are marked regular data and with Magenta are marked all the anomalies that were not detected by algorithm ,with Green are marked successfully detected anomalies and with Blue are marked data that are flagged as anomaly but in reality they are not such.
As we can see now more anomalies were detected and the results look like below:
Flagged As Fraud 12139
Found Fraud 4086
Missed Fraud on Test Validation 8
Total Missed Fraud 11
Indeed we were able to detect 99.7% of anomalies but in difference to previous algorithm we flagged more anomalies 12139 comparing to 6775 so basically number was doubled.
Conclusion
From the results we can notice that first is always a good idea to normalize data as the visualization gets better and also the algorithm tends to perform better.
Making our feature look more like bell shaped caused the algorithm to perform better by missing only few anomalies (11 from 4097) but it also flagged more frauds that actually were not frauds.
For production purposes I see experimenting with epsilon to find something that satisfied our need. Maybe we can be more tolerant and lower a bit the number of false frauds.
Now we ready to apply the algorithm in larger amount of data using more ready production language like JAVA. Next post we will be implementing Fraud Detection with Gaussian function with JAVA and SPARK.