Sunday, December 03, 2006

The Address Matching Problem

I have a list of organizations in my company database.  We are working with a customer who has data on some of those organizations.  They will provide us, multiple times per week, a list of their database.  So I have to say "For any given organization in my list, is variable X true from their list?"

Do you have any idea how complex that problem is?  You see, there's no linking value betwen the tables.  No common identifier that I can rely on.  So I have to do a natural match on whatever information I do have, which is basically name, address, city, state, zip.

Stop and consider all the different ways that this information can be different - abbreviations, word choices, even just plain typos and misspellings. That happens more in such data than you'd think.  What constitutes a match?  If one list says, I dunno, "Paul Smith College" and the other says "Paul Smith College of New York", is that a match?  Do you always have to validate against every single field you have, or can you be confident enough in one of the fields to match it and move on? 

You can't match all the fields perfectly.  First throw out zip code because some entries will have zip5 and some will have zip5+4.  Experience shows that even if you just look at the first 5 you'll find that many entries are off by one digit, since people often know a common zip code for a given city and don't necessarily realize that there could be several.  So maybe you can get by with matching on 4 digits of the 5, but then you don't have an exact match.

Forget about address as well.  Entry one has "1 Main Street" listed for address line 1, and "Suite 220" for line 2.  But the other entry from the other list has "Main Office" for line 1 and "1 Main Street Suite 220" for line 2.  It's a match, but not an easy one.

Worst of all, the data is coming in at a rate of several updates a week.  So it's not even like somebody can sit and go through the list (about 5000 records) once to clean it up and move on.  Every week there could be hundreds of new entries that will exhibit all the same behaviors I've described above.

I realize that there are companies built up around this exact problem.  But when you're a single programmer whose been handed the problem and told to solve it in a week, what do you do?

Goals: Achieve a confident, 'exact' match as frequently as possible (ideally 70-80%, given what is known about the intersection between the two lists).  Ambiguity (where an entry from list 1 might match more than one entry from list 2) is allowed, to an extent, in that the customer will be asked to narrow down the selection.

At least I'm doing it in Ruby :).

 

Technorati tags: , , ,

2 comments:

dom said...

You might want to use or port this perl library to ruby. http://search.cpan.org/~sderle/Geo-StreetAddress-US-0.99/US.pm


It does a good job of parsing addresses

Duane said...

Thanks Dom, I'll check it out. I hadn't thought about Perl -- I'm stuck in Windows and that's not usually my first choice. But an interesting idea, maybe I can find something alreayd ported to Java or .Net or something.