PIUG Annual Conference 2017 Full Abstract Proposal

Speaker name: David Woolls, CFL Software Limited

Co-author: David Goodchild, David Goodchild Limited

Single Pass Numerical Matching - A linguistic solution to numerical searching.


This presentation describes a method of analysing the abstracts and claims of patent documents to extract the numeric values linked to specific items such as % composition or mechanical properties. This is a complex issue because natural language provides so many ways of connecting the numerical values to the item referred to and with patents there is an additional problem of setting the numerical information in the context of the patent claim. The authors have collaborated to develop a language based approach to solving this problem. An initial focus was to identify the numeric ranges for a number of elements in metal alloys and compare them with the range being searched for. We will briefly describe the collaborative process, then present the results and explain how this is a generic solution, rather than specific to the original requirement.

An example of the problem

  • A search is to be made for alloys containing 0.25-2.5% magnesium along with other elements e.g. specific ranges of amounts of Mn, Fe, and Si.

  • A patent searcher needs to know whether there are claims in a patent set that cover these ranges. 
  • So what is needed is a program that will find claims (in the case of  Mg) which cover one of the following:-
    1. Contain the range e.g. 0.1-3.5%Mg
    2. Are inside the range e.g. 1.0-1.8%Mg
    3. Overlap at either end e.g. 0.1-1.5% or 2.0-5.0%Mg
    4. Are outside the range at either end
  • The four steps are repeated for the other elements and their ranges.

Examples of the issues to be addressed:

  • The variation in the form of the element name (Mg, Magnesium, magnesium).
  • The number of ways in which a range can be specified, from the simple to the verbose.
  • The fact that the range can appear on either side of the element either close to or separated.
  • The number of distractors present in a claim (e.g. Claim, Figure and Table numbers).
  • The data quality problems introduced by OCR or machine translation.

Some of the outputs

  • Clear presentation of the relationship between the search match and the source match.
  • Identification of the full range description in the relevant claim.
  • Ranking by number of search matches found providing focus on potential problem areas.

Outcome of collaboration

  • Single pass document reading with high accuracy and very high speed.
  • Extension to numeric ranges for mechanical or physical properties.
  • Identification of every element and property rather than simply the ones in the search set.
  • Extension to CJK languages.
  • Potential for saving the normalised interim index for rapid searching of full or large datasets.
  • Distributed and/or federated search using multiprocessor computers, clouds or cluster.