In 2023, noting a serious operations oversight problem, BMW Group’s Hams Hall plant began centralizing its data gathering infrastructure with the ultimate goal of implementing digital twin technology.

The plant was producing approximately 1.4 million components and assembling around 400,000 engines per year, generating vast amounts of data that typically ended up siloed. At one point, internal teams were reportedly using more than 400 custom dashboards across 15 different IT systems!

The situation in Hams Hall reveals a significant challenge facing not only digital twins but digitalization in general: Company data. A foundational step in adopting advanced digital technologies involves locating, formatting, and integrating multiple, disparate, and often unstructured sources of data—a complex, resource-intensive, and frequently underestimated process. It’s not surprising that poor data quality is estimated to contribute to 40% of failed business initiatives.

WHAT’S REALLY PREVENTING AI ROI?

What if AI isn’t overhyped? What if we’re just not ready for it? When it comes to AI’s promise in enterprise, data readiness – encompassing data standardization, data quality, data management, etc. – is the elephant in the room. According to a 2025 MIT study of 300+ public AI deployments, 95% of AI projects fail. It’s not that the AI models aren’t good enough; initiatives fail because the data isn’t ready. The transformative potential of AI will remain out of reach until data readiness is achieved.

TRAINING AI MODELS

Despite billions invested in AI solutions, cloud infrastructure, hiring talent, etc., AI systems are only as good as the data they’re trained on. (You know the saying, “garbage in, garbage out.” Bad inputs lead to bad outcomes.) The reality is that most enterprise data isn’t AI-ready, making it impossible to train effective and reliable AI systems.

AI training data must be accurate, consistent, complete, and relevant. Datasets must closely resemble the real-world problems with which the AI model will be tasked to ensure it can make accurate predictions or decisions on new, unseen data.

COMMON DATA ISSUES

Here are some of the most common data issues and their impact on AI model training and performance:

Siloed data: In today’s sprawling digital environments, company data is often trapped in silos within disconnected systems, making it difficult to access. When AI models are trained on partial, fragmented data, they lack the full picture or context and can make decisions that fail to reflect reality.

Inconsistent data: Picture the disparate databases, documents/files, chat transcripts, payment records, images, and other structured and unstructured data scattered across an organization. AI requires annotated, standardized data to learn and recognize patterns. Inconsistent formatting, missing values, and duplicates will lead to unreliable, misleading, and inaccurate outputs.

Inaccurate or outdated data: Manual data processes create inconsistencies and are prone to human error, which can mislead AI models; while stale data drives decisions based on past, irrelevant conditions.

Incomplete or biased data: Insufficient or unrepresentative data distorts AI predictions, causing systems to behave unfairly, hallucinate, or fail to generalize to new, unseen data. AI trained on biased data will only amplify and exacerbate existing prejudices. AI-generated synthetic data can be used to fill gaps, but it can also create feedback loops that degrade model performance over time.

Excessive (noisy) data: While training AI requires large datasets, ‘quality over quantity’ still applies. Smaller, curated datasets frequently outperform large, noisy ones, reducing the risk of overfitting. (Overfitting occurs when the model captures noise and specific details rather than general, underlying patterns.)

Poor governance: Without clear data provenance (knowing the data’s origin) and data lineage (knowing how it moved through different apps, was modified, etc.), AI systems become black boxes that are impossible to audit or trust.

AI DATA BEST PRACTICES

Getting accurate, structured, complete, and timely data for training AI typically requires data cleansing, preprocessing, and augmentation. The work is time-consuming, labor-intensive (occupying much of a data scientist’s time), and expensive. Moreover, it doesn’t end once AI is deployed. Ongoing maintenance and regular validation are essential to prevent model drift and performance degradation.

Addressing data debt requires sustained investment in the resources needed to produce consistent, format-ready data. This includes coordinating efforts across departments and fostering a culture of data sharing; establishing standardized data-cleaning and governance protocols; and possibly building synthetic data pipelines to fill gaps.

4 ACTION STEPS

Break down silos: AI requires access to cross-functional data, which requires removing barriers between departments. This involves collaborating with data owners, enabling efficient and secure data sharing, and investing in enterprise-wide data integration such as data lakes, cloud-based architectures, and API-driven integrations to ensure AI models can access complete datasets.

Invest in data: AI is facing a shortage of high-quality, human-generated data, driving increased reliance on synthetic data, data partnerships (including purchasing access to proprietary, high-quality datasets), and models designed to learn from less data. Meanwhile, unstructured data, comprising 80-90% of enterprise data, remains a largely untapped resource. Unstructured data, of course, must be extracted, cleansed, standardized, and labeled to be usable for AI. Another opportunity lies in leveraging multimodal data like video, audio and sensor (IoT) data to increase the volume and variety of training data.

Develop a data quality team: Establish a team dedicated to assessing the organization’s data readiness across systems and domains, with a specific focus on AI and advanced analytics use cases. Key responsibilities could include defining and enforcing data quality standards, educating employees on the importance of data quality, partnering with engineering, IT, and business to embed best practices into day-to-day workflows and continuously improve data pipelines, and proactively identifying risks (bias, drift, data decay) that could undermine AI initiatives.

Validation and transparency: AI requires an iterative development approach and continuous human oversight. Models must be versioned and systematically tested at each stage to detect drift, bias, and performance degradation as data, environments, and business realities change. Rigorous validation, version control, and human-in-the-loop oversight are essential to maintaining trust in AI, while strong governance frameworks provide security and accountability–essential for human adoption.

Data readiness isn’t only a technical matter. It’s a company culture issue, too. Companies are trying to treat AI or digital twins like any other software installation when the technology demands a complete reimagining of workflows including a new paradigm for collaboration between humans and machines. Buying technology before building the capability is how we accumulated the massive data debt bottlenecking today’s AI efforts. So, pause. AI initiatives are not quick efficiency projects. Invest in data readiness as if AI success depends on it, because it does.

Data Readiness: The Real Bottleneck to Enterprise AI

WHAT’S REALLY PREVENTING AI ROI?

TRAINING AI MODELS

COMMON DATA ISSUES

AI DATA BEST PRACTICES

4 ACTION STEPS

Related Articles

Digital Twin It, Then Execute: The New Strategy for Enterprise Planning

From Assistants to Operators: Enterprise AI Agents in Action

The Agentic Spectrum: Where AI Agents End and Agentic AI Begins