Building Intelligent and Performing Enterprises
 Building Intelligence and Execution Quotient
  
Login or Register  
 
   Aligning performance metrics- Cost-Quality-Time Aligning the contractual  

BiPM Encyclopedia  →   Enterprise Intelligence  →  SECTION -  Data Management Tools  →  CHAPTER -  Data Quality Tools  → 

Data Searching and Matching

Data Searching and matching is the first step that you take before you cleanse and augment the data. The Searching and matching capabilities should include the N-gram indexing, pattern matching and match-codes etc...

Data Searching and matching is essentially done to identify the duplicates as well as household groups (like people from the same family, same company, same associations etc...).

Searching and matching through parsing:

Parsing is a technique, which splits the long strings of the customer data into individual components, which are then fed into the data searching & matching routines. For example - The parsing rules tell the parsing program, that where-ever it finds a word, which matches any of the possible entries of 'first name' reference list, it should assign the same to the first name. It then tells that where-ever it finds the character string, which is five character and not with in the reference list of names, addresses, it should assign it to the ZIP. A good DQ tool will have pre-defined parsing algorithms. One should be able to change those parsing algorithms (though generally it is not done, because a good DQ tool has sophisticated and statistically well-tested parsing algorithms.

Searching and matching through pattern matching:

In conjunction with parsing, you can feed all possible patterns in which data could be stored. The tools OR the queries, which you can run then should scan the data for each pattern and produce the outputs as per a standard pattern. For example the sequence will be:

  • Different date patterns are fed into the tool.
  • The tool scans the database, and picks up the dates, which match one of the fed patterns.
  • The output of this dates data is fed into data correction module, which may standardize all the dates following different patterns into a standard pattern.

Searching and matching through N-gram indexing:

An n-gram is a set of 'n' consecutive characters extracted from a word OR code. Typical values for 'n' are 2 OR 3. These extracted n-grams are subsequently indexed for all names OR addresses in the database. At search time, the idea is that words OR codes that are similar between the search and file data will have a high proportion of n-grams in common.The n-gram index based searching is used for string or text matching.This is fairly standard algorithm which comes pre-defined with good data quality tools. One should be able to change the following parameters:

  • The number of characters for N-Gram indexing. For example you can specify that you want to create 2 or/and 3 or/and 4 character N-gram index.
  • The level of match: You can define on how much %age of n-gram match should be considered as a match. For example you can say that 90% 2 character N-gram match+ 75% 3 character N-gram match will be considered match candidates.

Searching and Matching through match-codes:

We would consider it as a pattern matching, with a little difference. In the match code, there are the defined sequences in which data could reside. For example, you can have a code of first name+Last Name+address+ ZIP. This match code will not have a standard length of each component (as in pattern matching). Each component will be referring to domain rules (OR list of possible values).

Searching through wild-cards:

This is strong arm tactic, whereby you can search along with wild-card symbols like ? or *.

   Access more details on this Topic
 

   Aligning performance metrics- Cost-Quality-Time Aligning the contractual  
 
All Topics in: "Data Quality Tools" Chapter
 Data Profiling and Monitoring →  Data Searching and Matching →  Data Cleansing and Augmentation →  Data Quality Tools Wizards →  Collaboration and Administration Support →  Data Quality Tools Integration → 
 

Was this page helpful?
If you like it ? share it !
Digg
Digg
Reddit
Reddit
Del.icio.us
Delicious
Google
Google
Live
Live
Facebook
Facebook
Slashdot
Slashdot
Netscape
Netscape
Technorati
Technorati
Stumbleupon
Stumbleupon
Spurl
Spurl
Furl
Furl
Blogmarks
Blogmarks
Yahoo
Yahoo
Plugim
Plugim
Squidoo
Squidoo
BlinkBits
BlinkBits
 
CONTENT ZONE
Data Management Tools
Customize Alerts