The Grouparoo Blog


What is Data Integrity?

Tagged in Data 
By Stephen Mash on 2021-12-08

Organizations collect and leverage data on an ever-expanding basis to inform business intelligence and optimize practices. Data allows businesses to gain a greater understanding of their suppliers, customers, and internal processes. Extracting and maximizing the value of the information contained within data can boost productivity, revenues, and profitability.

However, this leveraging of information will not be effective unless the organization can preserve the integrity of the underlying data over its lifetime. Integrity is a critical aspect of data processing; if the integrity of the data is unknown, the trustworthiness of the information it contains is unknown.

a stack of rocks

What is Data Integrity?

Data integrity is the accuracy and consistency over the lifetime of the content and format of a data item. Maintaining data integrity means complete control of all changes, whether that’s under full configuration control with an audit trail for every modification or simply a process to ensure the change is valid.

This is distinct from factors such as data quality. If inadequate quality data enters a process, then any integrity change will not affect the quality of the data, just its correctness. Ensuring good data quality is a separate topic from maintaining good data integrity.

Why is Data Integrity Important?

Data integrity is one of the triads of data security. Along with confidentiality and availability, maintaining integrity is a critical element. Unauthorized or malicious changes made to data can undermine the business purposes that use the data.

If undetected, corruption of data and its information will compromise the processes that utilize that data. If detected, investigation and correction will consume resources.

Personal Data

Collecting and managing data carries regulatory responsibilities regarding data protection and evidence required for regulatory compliance.

An organization responsible for collecting, processing, or storing personal data must preserve its integrity. Where data contains personal information, data corruption can result in financial penalties and, in extreme cases, imprisonment of those legally responsible for the personal data under data protection laws in some countries.

For example, where the data refers to residents of California, the California Consumer Privacy Act (CCPA) requires data integrity protection. Also, where the data relates to living individuals within the European Union, the General Data Protection Regulations (GDPR) requires similar assurance, and non-compliance carries severe financial penalties and potential prosecution.

Regulatory Data

Where data supports regulatory compliance processes such as those that govern food or pharmaceutical manufacturing, data corruption will render the affected products unfit for use and require either additional testing to prove compliance or, more likely, their destruction. In either case, the business will waste raw materials and manufacturing resources and experience a delay in product availability. At the same time, manufacturing processes wait on the corrective actions necessary to prevent a recurrence.

Threats to Data Integrity

The most common threat to data integrity comes from inadvertent actions that change data in such a manner that the alteration is not immediately apparent. The revelation of such errors typically happens only once the consequential usage of the data generates erroneous actions further along with the business processes. Such events are time-consuming to investigate and identify the root cause. Other threats come from malicious acts of attackers, whether a disgruntled employee or an external hacker.

Typical threats include:

  • Code errors in data transformation processes change values
  • Data storage processes overwrite data with different data
  • Operators enter the wrong data
  • Human error results in updates to the wrong data
  • Hardware failures alter data in storage or transit
  • Malicious attackers change commercial data to compromise business operations
  • Mischievous attackers alter manufacturing data to sabotage production lines
  • Disgruntled insider corrupts data as a vengeful act

Types of Data Integrity

Physical integrity protects the underlying storage infrastructure that holds the data. This can be hardware storage devices or magnetic-based removable media on-premises or in a data center. It can be data in transit on network cabling or transmitted using electromagnetic radiation.

Logical integrity protects the zero’s and one’s of data in the processing environment, ensuring the data remains correct and accurate during computational and transformational processing.

Protecting Data Integrity

A range of data integrity tools and techniques are available to protect data from unauthorized modification and detect the occurrence of an unexpected change.

  • Access controls are the standard method of protecting integrity by restricting who has access to data and who can alter the data. This mechanism will provide a layer of protection. Still, it will not prevent accidental compromise of integrity due to user error or an attacker who can compromise the authentication process or access credentials of a legitimate user.
  • Data backups can provide recovery mechanisms if backups exist that predate the compromising event, and protection exists to protect backups against alteration.
  • Logging can provide an audit trail of alterations made to data and by whom to support the investigation of any integrity issue. However, capable malicious attackers are proficient in covering their tracks by altering standard event logs.
  • Transmitting data across multiple paths can identify the compromise of one path or a path exhibiting erroneous behavior and corrupting data.
  • Comparing results from parallel processing paths that use independent transformation methods can flag the generation of different outcomes.
  • Data validation rules can identify gross errors and inconsistencies within the data set. The application of machine learning techniques can also identify suspect data by modeling expected behaviors and recognizing abnormal or unexpected data points.
  • Data encryption techniques ensure that only authorized users with access to the decryption key can modify data if the key is secure and the encryption algorithm is sufficiently robust. However, data encryption places a processing overhead on each access to the data.
  • Data masking is a faster but less secure alternative to encryption, where scrambling of data protects against eavesdropping if the algorithm used to mask the data remains secure.
  • Data tokenization techniques allow the storage of critical data in secure locations while data warehouses store a token that points to the secure copy. This enables the application of security controls and protection techniques to a subset of data, transparent to processes accessing the data warehouse.
  • Data hashing techniques protect data against undetected modification by creating a value based on the value of the data. An attacker would need to create a new hash to alter the data undetected. The use of salting techniques can prevent attackers from quickly achieving this type of data compromise.

Data Integrity Challenges

Maintaining the integrity of dynamic data is a considerable challenge, where frequent updates can make monitoring processes impractical. However, where dynamic data feeds business-critical processes, defensive techniques can ensure integrity.

Where data propagates through suites of business applications, then any undetected data integrity error will adversely impact the outputs of all processes that use information derived from that data. For example, integrity issues can affect Sales and Operations Planning (S&OP), Enterprise Resource Planning (ERP), and Customer Relationship Management (CRM), all leveraging the same dataset. This has the potential to impact the entire business operations. Also, where an organization is part of an integrated supply chain, data integrity can affect all information users up and down the supply chain.

For example, a data integrity error in raw customer demand information that leads to a false but believable elevated demand signal can cause an increase in production rates to fulfill the believed demand. This situation will generate orders for excessive raw materials to meet production demand. The impact for the business will be:

  • Financial costs of procuring, transporting, and storing additional raw materials
  • Economic costs of manufacture of the additional products to meet demand
  • The business impact of ramping up manufacturing for additional products by increasing throughput or diverting resources from other products
  • Financial costs of transporting and storing additional unwanted products
  • Economic costs of writing off unwanted products that exceed their shelf life

Processes can capture significant errors, allowing revalidation of any data of suspect integrity before use. However, this will require additional resources to investigate and remediate issues. In time-critical business processes, prevention is the preferred option.

For big data, the principle of data integrity is unchanged. It just needs maintaining on a much larger scale in an environment where subtle inadvertent unintended change may be impractical to detect unless the organization implements specific automated processes that monitor integrity.

Conclusion

Maintaining data integrity means ensuring the data remains complete and correct over its lifetime. In the world of data warehousing and data lakes, where business processes both feed and draw from the data pool, maintaining data integrity is essential.

Undetected errors can potentially have consequences along the supply chain of an integrated network of organizations before detection. Furthermore, any organization that shares data with compromised integrity will quickly suffer a reputational hit. Therefore, it’s critical to manage data integrity and protect against all credible threats.

The Grouparoo reverse Extract, Transform, and Load (ETL) tool takes data from a data warehouse and sends the data to different destinations or tools, empowering business teams to act with verified and trustworthy data. Read more about our Reverse ETL Tools.

featured image via unsplash



Stay up to date

We will let you know about product updates and new content.




Get Started with Grouparoo

Start syncing your data with Grouparoo Cloud

Start Free Trial

Or download and try our open source Community edition.