G Katz, Y Elovici, B Shapira
Information sciences 262, 137-158, 2014
A new context-based model (CoBAn) for accidental and intentional data leakage prevention (DLP) is proposed. Existing methods attempt to prevent data leakage by either looking for specific keywords and phrases or by using various statistical methods. Keyword-based methods are not sufficiently accurate since they ignore the context of the keyword, while statistical methods ignore the content of the analyzed text. The context-based approach we propose leverages the advantages of both these approaches. The new model consists of two phases: training and detection. During the training phase, clusters of documents are generated and a graph representation of the confidential content of each cluster is created. This representation consists of key terms and the context in which they need to appear in order to be considered confidential. During the detection phase, each tested document is assigned to several clusters and its contents are then matched to each cluster’s respective graph in an attempt to determine the confidentiality of the document. Extensive experiments have shown that the model is superior to other methods in detecting leakage attempts, where the confidential information is rephrased or is different from the original examples provided in the learning set.