Pentaho® Cleanser: User Flexibility Re-Invented

by Sherin Mathew,
Sales Engineer

Melissa Data recently released its newest component, the Generalized Cleanser in Pentaho®. The Generalized Cleanser gives greater flexibility and allows users to tailor new rules to achieve data quality. The Generalized Cleanser allows the users to add the following rules:

  1. Regular expression: This rule may be used to find patterns within data and perform an action if the pattern is recognized within the data
  2. Case: This rule may be used to transform data into lower, upper, or mixed casing mode
  3. Punctuation: This rule may be used to either add or remove specified punctuation within a selected record
  4. Expression: This rule can be used to check and make operations to a field such as checking if two strings are equal
  5. Abbreviation: This may be used to abbreviate a field name or to expand an abbreviation such as CA to California or vice versa
  6. Search and Replace: This rule may be used to search and replace text within a field. The search and replace may be used within a substring of a record or a full record
All the rules mentioned above may be applied to one or many fields. The Generalized Cleanser recognizes most of the standard Field Data Types such as First Name, Last Name, City, and State. It also has a general field type for data types that do not fit the standard criteria. The user can apply many cleansing rules to one field. If there are three rules within a field then it will perform the 1st defined rule then take the result from the 1st rule to perform the 2nd operation, and so on. The Generalized Cleanser may be obtained when you download from the Pentaho Marketplace or directly from the Pentaho GUI “spoon.”

Records example