News
Thoughts on Data Quality
Craig S. Mullins, corporate technologist, NEON
Enterprise Software
Data quality is an elusive goal. The more we try to
keep accurate records in database systems, the more
it seems that at least some of that data is just
plain wrong. If you work with your organization’s
data, you know how important it is for that data to
be accurate... and you know how often
data
inaccuracies conspire to make your job more
difficult.
With that in mind, I’d like to ruminate on some of
the causes of data quality problems.
There are many barriers and inhibitors to data
quality. Speed-to-market issues tend to hinder data
quality improvement in today’s world of web-enabled,
rapidly deployed systems. Instead of operating
within a well-defined data architecture that
mandates data modeling and logical database design,
we fall prey to the opportunistic techniques of the
day.
The rapid-fire nature of project planning and
application development is one inhibitor, but it is
not the only one. Information overload and rapidly
expanding data storage exacerbates the problem. As
we add more and more data without necessarily adding
more people and processes and policies to ensure the
accuracy of the data, data quality will continue to
degrade.
The propensity to copy data all over the place is
another big problem. Instead of using database
management systems to share data among multiple
applications, we copy data instead… to files and
spreadsheets and emails and other database systems
and on and on and on.
With copies of the (supposedly) same data swimming
all over the place, who can guarantee its accuracy?
When it changes in one system, does it get changed
everywhere? What about that user who has his
spreadsheet up in one window and an application up
in another window: who can ensure that s/he does not
wreak havoc on the application data because the
information in the spreadsheet is outdated?
Think about maintaining
duplicate data
implementations like having multiple watches that do
not tell the same time. Will they ever be exactly
the same? Synchronizing the data can be a Herculean
task.
Automation can help. ETL tools that automatically
populate data changes from one database out to
others can help to keep multiple copies of the same
data synchronized. But the results will only be as
good as the automation you put in place. It is
doubtful that you will automate the modification of
every copy of the data throughout your organization.
A much better approach would be to stop the wild
copying of data and instead share data using a DBMS.
After all that is what a DBMS is designed for, you
know?
Assuring data quality can be a struggle even in a
centralized and homogeneous environment. Data
stewards are needed to implement programs for data
quality improvement. Such programs need to include
both technology and business tasks. Without
executive-level business folks intimately involved
in the data quality effort, it will surely fail.
Implementing data quality programs is costly and
without upper level management buy-in, human nature
will kick in; and we’ll start copying data hither
and yon all over again.
OK, so let's try to summarize what is required to
improve data quality? At a high level, there are six
components: accuracy, integrity, security,
accessibility, non-redundancy, and flexibility.
Accuracy involves data modeling and implementing
physical databases that correlate to those models.
Integrity has a twofold meaning: placing rules on
the data into the DBMS that define its allowable
values and assuring the backup and recoverability of
the database in the event of hardware or software
failures. Security involves the protection of the
data from unauthorized access – as well as being
able to track who did what to which data when.
Accessibility means that the data is available in an
optimal manner for those who are authorized users.
Non-redundancy means that data is not stored
inconsistently in multiple places. And flexibility
refers to the ability to modify database structures
as changes are required.
Both database administration and data administration
need to be involved. Data administration sets the
policies and rules for data quality and database
administration helps to implement them using
existing database technology.
DBA tools for database
change management, backup and recovery, and
performance assurance can help to achieve these
goals from a physical database perspective. From a
logical perspective, there are tools to automate and
facilitate data modeling, metadata management, and
data profiling. Of course, tools aren't the only
answer, but they sometimes can make an impossible
task at least possible.
And then there are the needed tools that do not
exist. Wouldn’t it be nice to have a data movement
auditing tool that told you whenever data moved
anywhere within the organization? At least that way
we’d know what we were dealing with. But such a tool
does not exist (could it even exist given our
current complex infrastructures?). So we struggle
along, doing the best we can do… and that means data
quality problems will be a sad fact of everyday
life.
---Source: The Data Administration Newsletter May
2009 (www.Tdan.com). Craig S. Mullins is a data
management strategist, researcher, and consultant.
He is also a corporate technologist for NEON
Enterprise Software (http://www.neonesoft.com).
|
|
|