Brand protection - fighting typo-squatting

De Wiki Campus Cyber
Aller à :navigation, rechercher
Cette page est une version traduite de la page Protection de la marque - lutte contre le typo-squatting et la traduction est terminée à 100 %.

Use case description

The purpose of the use case is to detect domain names similar to those of a site you wish to protect, which could be used for a variety of malicious purposes: impact on image, phishing campaigns aimed at the general public or at employees, use of the domain for MITitM (Man-in-the-middle) fraud, sale of counterfeit products, etc. - Provide detection tools for copied malicious sites - Provide AI-based classification tools to identify the proximity of the copied malicious site to the initial site and its level of danger, in order to give the SOC prioritization tools to identify priority typo-squatted domains. The particularity of this use case is that it makes little use of traditional Cyber corpus data (network frames, etc.), and relies more on traditional automatic language processing tools.

Approach

The selected approach involves three steps:

  • Step 1: generation of target domain names from a master domain name via a levenshtein distance (1, 2 or 3), in particular via the dnstwist tool https://github.com/elceef/dnstwist (or its online implementation https://dnstwist.it/)
  • Step 2: classify sites as suspect if one of the following conditions is valid
- The "name server" is different from the initial ("original") "name server
- The MX (Mail eXchanger) is not defined or is not the "original" MX
- The IP does not correspond to the initial site, or is not white-listed (including for the MX)
- If there is no IP & no MX: the site is classified as "to be monitored later".
- If there's no IP & but an MX: separate segment to be dealt with elsewhere, potentially dangerous
- Note: registrar analysis is an option to be dealt with later
- A visit to Virustotal may help to classify
  • Step 3: For each suspect site, generate a representation using an NLP tool (themes/entities/emotions), then calculate the similarity of the representation with that of the initial site (distance). High similarity increases the danger of phishing.

Data

Identified kaggle datasets:

- https://www.kaggle.com/xwolf12/malicious-and-benign-websites
- https://www.kaggle.com/dmrickert3/malicious-and-benign-websites-learning/notebook

An article using domain generation and the VirusTotal site explains part of the process and provides its methodology:

- https://towardsdatascience.com/predicting-the-maliciousness-of-urls-24e12067be5
- https://eneyi.github.io/2020/04/01/predicting-the-maliciousness-of-urls.html
- The data is here: https://github.com/eneyi/dataarchive/tree/master/pmurls/data/featurized

The bases will be generated in real time with each new "run" of the algorithm, but parsing can potentially be very CPU-intensive.

Categories of algorithms used

Automatic language processing.

Computation time required

Medium if there is no re-training of NLP models. In cases where the text corpora of the initial site are far from common language (highly technical domain), fine-tuning may be necessary.

== Cloud or On Premise Both are possible.

Other necessary software

ETL, quality, security, crypto, anonymization Two local tools are required:

- uA crawler (e.g. Nutch https://nutch.apache.org/)
- Storage of sites and indicators: mainly elasticsearch

Mitre Att@ck

  • Recognition:
- T1593: Search Open Websites / Domains
- T1598: Phishing for Information
  • Initial Access :
- T1566: Phishing

Mitre Defend

N/A

Cyber Kill Chain

Removal of access for compromised customers (already in place), reliance on the ability to bring down "copy" sites.