UC3 : Brand protection, fight against typo-squatting
Développement du Use Case pédagogique "UC 3 : Lutte contre le typosquatting" dans le cadre du GT IA et Cyber
Catégorie : Commun Statut : Production 1 : Idée - 2 : Prototype - 3 : Validation - 4 : Production
Overview[modifier | modifier le wikicode]
Business Objective[modifier | modifier le wikicode]
Typo-squatting is a form of cybersquatting based mainly on typing and spelling errors made by the Internet user when entering a web address in a browser.
Concretely, the typosquatter buys a certain domain names, whose spelling or phonetics is close to that of a very frequented site or a well-known brand, so that the user making a spelling error or an unintentional typo is directed to the site owned by the hacker.
This practice can produced multiple risks for the users (malware, ransomware) and/or for the well-known brand (lost of clients, bad images).
This project proposed a solution to detect suspicious typosquatting websites.
Project Objective[modifier | modifier le wikicode]
In order to detect typosquatting websites, this set of notebooks compute, for a website, a score of similarity relative to the reference website (high trafic website).
This score is a value between 0 and 1. 1 means exact similarity between the two websites. The more closed to 0 the similarity is for a website, the more suspicious the website is.
Results[modifier | modifier le wikicode]
The algorithm create a blacklist : a list of suspicous URL, considered as typosquatting websites.
- Authors : Nicolas Stucki & Ugo Biancone & Mathieu Lacroix
- Keywords: Permutations, Lines of Threat, WebScraping, WordEmbedding, Jaccard Similarity
Data[modifier | modifier le wikicode]
A csv file with only the target website URL. Ex : lvmh.com, facebook.com, ...
Notebooks[modifier | modifier le wikicode]
Notebook Data Science step Typo_p1_DNSTwist.ipynb Generate child url addresses from permutation of the mother address Typo_p2_Webscraping.ipynb Retrieves the content of the child and parent web page Typo_p3_webscrapingjaccard Compute a measure of similarity between the child and the parent page (Jaccard)
Notebook | Data Science step | |
---|---|---|
Typo_p1_DNSTwist.ipynb | Generate child url addresses from permutation of the mother address | |
Typo_p2_Webscraping.ipynb | Retrieves the content of the child and parent web page | |
Typo_p3_webscrapingjaccard | Compute a measure of similarity between the child and the parent page (Jaccard) |
Requirements[modifier | modifier le wikicode]
- Python (3.6 or +)
- Pandas (1.1 or +)
- Numpy (1.19 or +)
- DNSTwist (20220131)
- Requests (2.27)
- BeautifulSoup (4 or +)
- Gensim (4.2.0)
- Scipy (1.6.2)
Notebooks du use case[modifier | modifier le wikicode]
Retrouvez tous les éléments du Use Case sur le GitLab du Campus Cyber : https://gitlab.com/campuscyber/gt-ia-et-cyber/-/tree/main/UC3%20Brand%20protection,%20fight%20against%20typo-squatting?ref_type=heads