Organization and Management Theory OMT

Expand all | Collapse all

Wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

  • 1.  Wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

    Posted 10-07-2009 23:32

    I have several large datasets containing names of companies and individual people. The companies or people can and do appear multiple times (e.g., in different years) and I want to link all instances of the same name. This is easy when the match is exact.

    However, for a variety of reasons, such as typos or 'nicknames', there are also many "close" matches – where the text does not match exactly but is very likely to refer to the same entity (e.g., "Jhon Smith" vs. "John Smith" or "Merrill Lynch" vs. "Merrill Lynch Fenner Smith").

    My goal is to identify these close matches in a systematic way without manually going over the data. I presume the main function of such a program or algorithm would be to identify "all but 1 character" matches, and then "all but 2 character matches", etc. Preferably the program would suggest close matches and let me decide if they are matched.

    Any ideas on useful software for this task would be appreciated.

     

    Andrew von Nordenflycht

    Assistant Professor, Strategy

    Simon Fraser University

    vonetc@sfu.ca

     

     

    View my research on my SSRN Author page:
    http://ssrn.com/author=100363

     



  • 2.  Wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

    Posted 10-08-2009 15:43
    I was working on something like this recently (for entity resolution
    in records of free software projects). It's a surprisingly hard
    problem, particularly if you want to deal with variation of names
    (e.g. match "Howison, J", "J Howison", "J L Howison" and "James Linton
    Howison" but not J K Howison).

    I found this paper quite helpful (and the author was happy to share
    his perl code)

    D. G. Feitelson, “On identifying name equivalences in digital
    libraries”. Information Research 9(4) paper 192, Jul 2004.
    http://InformationR.net/ir/9-4/paper192.html

    The 'typos' aspect of the matching is easier, the usual algorithm is
    the Levenshtein distance:

    http://en.wikipedia.org/wiki/Approximate_string_matching

    It's implemented in many languages; I'm not sure of a GUI-fied
    version, but perhaps the keywords will help you. Perhaps DDupe might
    help, especially if you are working with network data, although I
    haven't used it.

    http://www.cs.umd.edu/projects/linqs/ddupe/

    Let us know if you find a better tool.

    --J

    On Oct 7, 2009, at 23:32, Andrew Von Nordenflycht wrote:

    > I have several large datasets containing names of companies and
    > individual
    > people. The companies or people can and do appear multiple times
    > (e.g., in
    > different years) and I want to link all instances of the same name.
    > This
    > is easy when the match is exact.
    >
    > However, for a variety of reasons, such as typos or 'nicknames',
    > there are
    > also many "close" matches - where the text does not match exactly
    > but is
    > very likely to refer to the same entity (e.g., "Jhon Smith" vs. "John
    > Smith" or "Merrill Lynch" vs. "Merrill Lynch Fenner Smith").
    >
    > My goal is to identify these close matches in a systematic way without
    > manually going over the data. I presume the main function of such a
    > program or algorithm would be to identify "all but 1 character"
    > matches,
    > and then "all but 2 character matches", etc. Preferably the program
    > would
    > suggest close matches and let me decide if they are matched.
    >
    > Any ideas on useful software for this task would be appreciated.
    >
    >
    >
    > Andrew von Nordenflycht
    >
    > Assistant Professor, Strategy
    >
    > Simon Fraser University
    >
    > vonetc@sfu.ca
    >
    >
    >
    >
    >
    > View my research on my SSRN Author page:
    > <http://ssrn.com/author=100363> http://ssrn.com/author=100363
    >
    >
    >