email spam classifier

In this post we are going to develop a Spark based Java Application which detects spam emails. In the previous post ‘Logistic Regression‘ algorithm was used and in this post we are going to use SVM(Support Vector Machines) algorithm. Full code and working application are provided together with results including a comparison with logistic regression. Generally no deep knowledge on the field is required and topics are kept as high level as possible.

Support Vector Machine

SVM is an algorithm used for classification problems similar to Logistic Regression. Labeled data are given(spam , not spam) and we want to detect in which category new coming data belongs to. Most part of labeled data are used for training our algorithm and based on the training we predict in which category new examples belong to.

Application of SVM are very similar to logistic regression so generally classification problems.

How it works

SVM tends to be a bit complex to understand as a lot of math,algebra is needed to fully explain it.This post is not going to give a deep math explanation(this post does a great job) but rather a more high level explanation from a practical perspective.

SVM Hypothesis

SVM hypothesis is a discriminator function producing 1 or 0 rather than probability value like Logistic Regression.

SVM Hypothesis producing either 0 or 1

Below is Logistic Regression hypothesis from previous post:

Logistic Regression Hypothesis producing value close to 1 or 0 as probability like 0.23,0.76,0.89,0.99

SVM Cost Function

SVM does a small modification to the Logistic Regression cost function. To understand the modification and the difference we will need to plot Cost Functions in relation to Z instead on the hypothesis h(X) as we did in the previous post (because SVM hypothesis produces either 1 or 0 instead of continues probability values like Logistic Regression 0.7,0.3,0.9 see figure). Similar also here we have two hypothesis and as consequence two costs functions cost for 1 and for 0 : cost1 and cost0 respectively.

Lets look at cost1 : Z in relation to Cost Function for 1 category

Instead of having low Cost Function value(meaning is 1 category) when Z>0 we modified SVM to have low Cost Function value when Z>1.And Instead of having continues values like Logistic Regression we have discrete value 0.

Z > 1 => 1 (spam ,cancer)

Lets look at cost0 : Z in relation to Cost Function for 0 category

Instead of having low Cost Function value(meaning is 0 category) when Z<0 we modified SVM to have low Cost Function value when Z<-1.And Instead of having continues values like Logistic Regression we have discrete value 0.

Z < -1 => 0 (not spam , no cancer)

More formally we can calculate costs as below:

k i just a coefficient that defines the slope of the line.
If we replace Z greater than -1 at cost0 we get 0.
And if we replace Z greater than 1 and cost1 we get 0.

What really does? Differently?

Personally while I was first learning SVM I got confused of what it does differently from Logistic Regression because of their similarity. This section is going to give a more visual and intuitive explanation of what SVM really doing and what it does differently from Logistic Regression. Going back to our hypothesis function we have :

Lets simplify and refer only to a two dimensional data so we have :

or if we simplify further(X¹=1) we have:

Z=b+a*X

Geometrically is simply a line.

Recalling from section above Cost Function, Z is defined as below:

Z > 1 => 1, (b + a * X > 1)=>1 (spam ,cancer)

Z < -1 => 0, (b + a * X < -1)=>1 (not spam , no cancer)

So graphically we can present as below:

Two hypothesis separating the data into a two dimensional data.

So what SVM is doing? Is just try to maximize as much as possible the gab between two lines without passing the points of the group(keeping the space in between clear of points). So basically the two lines are trying to be as much as possible fare from each other and respecting the constraints so every point of the group must be <-1 or >1. Graphically it will look like below:

In the end SVM is giving us only one set of coefficient or only one hypothesis so the question is why here we have two lines? Basically in the end SVM is giving the line in the middle of the two lines so the line equally far from both of the lines.

This line beside that divides the examples has the nice opportunity to be as fare as possible from each of the data categories and that makes SVM a Large Margin Classifier.

Why is different from Logistic Regression(LR)? Because their Cost Function are different , SVM Cost Function is ax+b<-1(0 case) and ax+b>1(1 case) where LR Cost Function is log(1-h(x))(0 case) and log(h(x))(1 case) where h(x) is hypothesis representing Sigmoid Function. What that means is that LR is trying to minimize average values of log(1-h(x))(0 case) and log(h(x))(1 case) cost functions to be as low as possible. So the line separating data in case of LR has different intention comparing to SVM.

Why we care about Margin? Although usually SVM and LR give same results margins can help algorithm to generalize better. That is if the boundary is as fare as possible from the data it means that we have some space there left for both of the categories in case any data comes in. Below is case when line still separates correctly but leaves little space for green cases. So when a new green example is coming we fail to detected as green and flag as red incorrectly.

So basically is good to have some distance from the groups or tolerance as it helps our algorithm to generalize better at new coming data examples.

Kernels

So fare we have seen data that are linearly separable, so data can be separated by a line or in multidimensional data by a hyperplane. But there are also cases when data is not linearly separable like for example in below picture:

Data when when there is not line which can clearly separate both of the cases

In order to make SVM work we can transform our data to higher dimension and make them more linearly separable. The functions which can do that are called kernels. Most common function is ‘Similarity Function’ or ‘Gaussian Kernel’. What it does in principle is just calculating the similarity of one point in comparison to all other points.If two data points are far from each other the function return almost 0 while if the points are similar it returns almost 1.

So basically if you have 10.000 of data rows you end up with a matrix of 10.000 X 10.000 since each row is compared to all other rows and for each of the comparison we get a number(close to 0(not similar) or close 1(similar) ). More formally Gaussian Kernel can be defined as below:

x is current row and l is one of other rows we are comparing to

So lets try to apply Gaussian Kernel to above picture and than apply PCA to reduce dimensions since Gaussian Kernel increases the dimension to the size of the data. The picture we get is as below:

As we can see our data are more separable even if not perfectly. This can be also due to the fact we applied PCA to the data.

One more thing to note is that Kernels are not bounded to SVM but rather they can be applied also to Logistic Regression since after all they just transform the data.The reason Kernels are found to be used with SVM is because of their execution optimization. So Kernels are much faster with SVM rather than other algorithms like Logistic Regression. Gaussian Kernel cannot be applied to all the data(Mercer’s Theorem must be satisfied) so generally we can apply other kernels like : string kernel,chi square-kernel,histogram intersection kernel.

SVM VS Logistic Regression

Usually both algorithm Logistic Regression and SVM with linear Kernel have similar results and the results usually depends mostly on how skilled we are in understanding the data or the data quality. Because of faster execution SVM with Gaussian Kernel outperforms Logistic Regression for medium size of data.

The first difference we already mention in the above topic so SVM find a solution which is as fare as possible for the two categories while LR has not this property. Please find below visualization in OCTAVE of two algorithms execution:

Please notice how SVM has a more balanced boundary between two categories.

LR is more sensitive to outliers than SVM because Logistic function diverges faster than hinge lost. So putting an outlier on above picture would give below picture:Notice how in LR the boundary change to include also new example. In reality this may not be a good idea so being not sensitive like SVM tends to be better.
Because of SVM execution optimization we can use Kernels like Gaussian Kernel and make our data automatically linearly separable.
Logistic Regression produces probabilistic values while SVM produces 1 or 0. So in a few words LR makes not absolute prediction and it does not assume data is enough to give a final decision. This maybe be good property when what we want is an estimation or we do not have high confidence into data also it may fit well also with other probabilistic existing systems.
SVM usually is better text classification like Emails Spam because of the high number of feature this applications have.
When we have big number of features like 10.000 and maybe low amount of data like 1000 or so either of algorithm can be used. and SVM is suggested with linear kernel since we do not have enough data to fit a complex non linear function.
When we have relatively small number of feature like 1000 and intermediate number of data like 50.000 maybe 500.000 than SVM with Gaussian Kernel fist good. Algorithm will be enough fast to run and Gaussian Kernel will separate classes well even with non linearly separable data.
When we have really big amount of data like greater than 500.000 than SVM with Gaussian Kernel will be very slow to run. So what is suggested is to run Logistic Regression or SVN with linear kernel and manually create/remove new features that makes sense until data gets linearly separable or we get good results from tests.

Application

Application can be downloaded and executed without any knowledge of java beside JAVA has to be installed on your computer. You can test with your own email the algorithm.

We can run the application from source by simply executing the RUN class or if you do not fill to open it with IDE just run mvn clean install exec:java..

After that you should be able to see something like this:

Please train first the algorithm by clicking train(it may take 1-2 minutes), after finishing a pop up will tell the precision achieved. After that please copy paste an email of your choice on the white area and hit test. After that a pop window will tell the algorithm prediction.

The precision achieved during my execution was approximately 95.6% using random 80% of the date for training and 20% for testing. No cross validation test just training and test(for accuracy measure) set was used in this example for more about dividing the data please refer here.

To my surprise Logistic Regression performs better with 97% accuracy comparing to SVM and also it runs twice faster(46 seconds VS 80 seconds).

But than I found out that this is because of Spark MLib implementation of two algorithms. Spark MLib has :

LogisticRegressionWithLBFGS
LogisticRegressionWithSGD
SVMWithSGD

and *SGD implementations perform similar with each other int terms of accuracy(LR 95,6% and SVM 95.7%) and time(LR 91 seconds and SVM 80 seconds). So this results show a more realistic view since we know that SVM and Logistic Regression perform almost the same and SVM is faster. Feel free to download application and experiment further. Anyway I can clearly see that Spark MLib is not the best library to use in the case of SVM so maybe trying other implementation of SVM will bring different results or bigger gap between them.