CFL Software



Computational Forensic Linguistics

The illustrations below (click images to enlarge) are of the GUI which we use both for demonstration purposes and as a product to smaller users.  The nature of our business means that we rarely have access to the actual data used by our customers, and then only for development purposes, so the illustrations are of plagiarised essay material, but all the features of any sentence based comparison are all shown in the screenshots.


click to enlarge

Work Pairs

The files to be checked.  There can be hundreds of these. 


A)    The Current index to be used.  The index can contain hundreds or thousands of documents.  The optimum size is decided empirically during set up and can vary from task to task.
B)    The parameters that can be set and changed interactively by a user.

  1. The minimum number of sentences in any of the Work files found matching the two criteria set below before a match is reported.
  2. The minimum number of word matches between any given sentence in the Work files and any sentence in any of the indexed files before a match is internally registered by the program.  This setting allows users to ignore short sentences or headings if desired.
  3. The minimum percentage that needs to be present before a match is registered of any matched words found in the shorter of the two sentences in a Work file and one of the indexed files.  This is interactive and allows a user to restrict matches to only those with considerable overlap or to investigate much lower levels of overlap.  This flexibility is distinctive of the program and much-valued by our users.

Pairs with related sentences

  • The matches are recorded between each pair.
  • They are ranked in descending order of similarity.
  • The first number is the approximate percentage of sentences in the Work file found to match sentences in the indexed file.
  • The second pair of numbers indicate the sequential numbers of the matched files.
  • Finally the actual file pairs are shown.

Markup view selectors

There are different ways users can see the marked up text.  The screenshot shows just the sentences that match.  This is most direct means of review, but the actual physical location within the file can be seen by using the File-based options.

Markup in both files

  • Bold red means exact matching.
  • Bold blue means partial matching has been found.
  • Bold black indicates the material that is not present in both sentences.
  • In either of the File views, sentences not identified as matching are shown in regular black text.


All sentences in both files are fully cross-referenced to the sentence found in the matched file.  
This is particularly useful where material has been moved around between documents when viewing the whole file, as it is simple to navigate to the appropriate location in the matched text.



(click to enlarge)



Words to ignore

It is a possible to create a list of words that you want the program to ignore during indexing.  This is particularly useful when working with plain text versions of XML and HTML files as the formatting instructions can be automatically ignored. But there are frequently other words which will appear in many or even most documents which are not indicative of copying.  In patents, 'preferred' and 'embodiment' are almost certainly going to appear, for example.


Investigator uses a built-in list of English grammar words which it ignores by default.  You can replace this list with a longer or shorter version, as a plain text file.  But you can also replace it with the grammar of a different language saved as plain text UTF-8 format, and the program will use that instead.  You can even have a file with multiple language grammars, as they rarely overlap, which allows both multilingual operation within a single text, or seamless operation across monolingual texts in several languages.

Standard Function Words

Clicking this button restores the internal default to English.


Indexes can be built to contain only the level of matching required for detection.

Set maximum in-document frequency

This allows you to reduce the size of the indexes by only included words that occur relatively infrequently in a single document.  In essays the majority of the words used occur just once or twice, and it is the appearance of these in substantial numbers in two texts that indicates potential plagiarism.  So it is possible simply to put those words into the index and perform the detection phase on such an index with little loss of accuracy.  For detailed investigation then indexing all words ensures maximum identification.

Set minimum words to index

The minimum refers to the number of non-grammar words in a sentence before it is included in the index.  This allows you to omit headers and other short sentences so that arbitrary matching of only one or two words does not create an overload of redundant information.

Index Creation

  • Files selected for indexing are shown in the right hand box.
  • Once created they appear in the left hand box.
  • File creation is multi-processor, multi-thread aware.  In the API the number of threads to used is set as a parameter.  That facility can also be supplied in the GUI version.
  • Index creation is generally very rapid and not memory-intensive.

Get in touch

For information about any of our products and services, or to discuss your requirements.