Organization and Management Theory OMT

Back to discussions

Expand all | Collapse all

Wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

Andrew von Nordenflycht10-07-2009 23:32

I have several large datasets containing names of companies and individual people. The companies or people ...

James Howison10-08-2009 15:43

I was working on something like this recently (for entity resolution in records of free software ...

1. Wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

Like
Andrew von Nordenflycht
Posted 10-07-2009 23:32
I have several large datasets containing names of companies and individual people. The companies or people can and do appear multiple times (e.g., in different years) and I want to link all instances of the same name. This is easy when the match is exact.

However, for a variety of reasons, such as typos or 'nicknames', there are also many "close" matches – where the text does not match exactly but is very likely to refer to the same entity (e.g., "Jhon Smith" vs. "John Smith" or "Merrill Lynch" vs. "Merrill Lynch Fenner Smith").

My goal is to identify these close matches in a systematic way without manually going over the data. I presume the main function of such a program or algorithm would be to identify "all but 1 character" matches, and then "all but 2 character matches", etc. Preferably the program would suggest close matches and let me decide if they are matched.

Any ideas on useful software for this task would be appreciated.

Andrew von Nordenflycht

Assistant Professor, Strategy

Simon Fraser University

vonetc@sfu.ca

View my research on my SSRN Author page:
http://ssrn.com/author=100363
2. Wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

Like
James Howison
Posted 10-08-2009 15:43
I was working on something like this recently (for entity resolution
in records of free software projects). It's a surprisingly hard
problem, particularly if you want to deal with variation of names
(e.g. match "Howison, J", "J Howison", "J L Howison" and "James Linton
Howison" but not J K Howison).

I found this paper quite helpful (and the author was happy to share
his perl code)

D. G. Feitelson, “On identifying name equivalences in digital
libraries”. Information Research 9(4) paper 192, Jul 2004.
http://InformationR.net/ir/9-4/paper192.html

The 'typos' aspect of the matching is easier, the usual algorithm is
the Levenshtein distance:

http://en.wikipedia.org/wiki/Approximate_string_matching

It's implemented in many languages; I'm not sure of a GUI-fied
version, but perhaps the keywords will help you. Perhaps DDupe might
help, especially if you are working with network data, although I
haven't used it.

http://www.cs.umd.edu/projects/linqs/ddupe/

Let us know if you find a better tool.

--J

On Oct 7, 2009, at 23:32, Andrew Von Nordenflycht wrote:

> I have several large datasets containing names of companies and
> individual
> people. The companies or people can and do appear multiple times
> (e.g., in
> different years) and I want to link all instances of the same name.
> This
> is easy when the match is exact.
>
> However, for a variety of reasons, such as typos or 'nicknames',
> there are
> also many "close" matches - where the text does not match exactly
> but is
> very likely to refer to the same entity (e.g., "Jhon Smith" vs. "John
> Smith" or "Merrill Lynch" vs. "Merrill Lynch Fenner Smith").
>
> My goal is to identify these close matches in a systematic way without
> manually going over the data. I presume the main function of such a
> program or algorithm would be to identify "all but 1 character"
> matches,
> and then "all but 2 character matches", etc. Preferably the program
> would
> suggest close matches and let me decide if they are matched.
>
> Any ideas on useful software for this task would be appreciated.
>
>
>
> Andrew von Nordenflycht
>
> Assistant Professor, Strategy
>
> Simon Fraser University
>
> vonetc@sfu.ca
>
>
>
>
>
> View my research on my SSRN Author page:
> <http://ssrn.com/author=100363> http://ssrn.com/author=100363
>
>
>

Organization and Management Theory OMT

Wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

Andrew von Nordenflycht10-07-2009 23:32

James Howison10-08-2009 15:43

1. Wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

2. Wanted: software to identify "close" matches in a datase t of names (either individuals or companies).

Follow AOMOMT on Social Media