Skip to main content

This post was written by a human, alongside an AI-written version available for comparison. We believe it is important to highlight the differences between these two writing styles. While AI is increasingly integrated into how we communicate, we feel strongly that true communication still requires a human voice. We’d value your feedback on this comparison and your thoughts on the process moving forward.

Summary

All modern drug discovery and pharmaceutical companies rely on CROs for at least some of their experimental results, with data typically delivered in spreadsheets, reviewed by scientists, and then buried deep in a shared file system. Though there is often a quiet assumption that CRO data will eventually “get organized”, in practice the opposite happens. What begins as a manageable flow of spreadsheets quickly compounds into a massive backlog that becomes a crisis as soon as historical data is needed.

I recently worked with a team that had over two years of data spread across spreadsheets from multiple vendors. Every analysis required reconciling conflicting formats and patching together a version of the truth. Each CRO delivered data in its own format, and even within a single vendor, reports varied by experiment or template. While extracting data from any individual spreadsheet wasn’t a challenge, the sheer number of files and report formats in the backlog conspired against quickly accessing the desired data.

CRO data backlogs don’t form because teams ignore data, they form because data systems are designed too late.”

The Failure Mode of Spreadsheet Data

There is nothing inherently wrong with delivering data in spreadsheets. As we’ve previously discussed, spreadsheets are a universally accepted, easy to use format for sharing semi-structured data. Unlike with proprietary formats (cough mass spec cough), any scientist can open a spreadsheet and review its contents.

However, when you dig into these datasets, you consistently find:

  • Layout changes that break automated parsing 
  • Misspelled species, control, or column names that corrupt analyses 
  • Shifting units between reports 
  • Calculations errors

“A purely mechanical transformation will preserve every error perfectly.”

To solve this, you need a systematic way to normalize data across contexts and a rigorous curation process that surfaces anomalies before it reaches your scientists.

The Cost of Reactive Management

Repeatedly going back to CRO reports for experimental data to manually extract and clean results is a thankless task that is all too familiar to most scientists. The same errors that were present the first time are still present and the cost of poor backlog management becomes more frustrating with each pass through the same data. And it’s easy to forget which subtle errors were identified the last time through the data.

I’ve seen companies attempt to “brute-force” this by paying CROs to manually structure historical reports. Sometimes at costs exceeding $100 per report, and this only to find that up to 25% of the “cleaned” data still contained errors. You end up paying twice: once to organize the data, and again to fix what was missed. 

I’ve seen attempts at fully automated approaches to extract data and clean in bulk. While deterministic, the amount of variation can lead to a separate script for each file, defeating the benefits of automation. What’s worse is that purely mechanical transformation, no matter how elegant, will faithfully preserve every error it failed to detect.

This highlights a core failure mode of spreadsheet-based CRO data: errors are easy to introduce, hard to detect, and will always be present in the original spreadsheets. Attempts to clean them up at scale are time consuming and often still let errors skip through.

“Manual data cleanup does not guarantee data quality.”

Proactive Management for Fun and Profit!

Scientists should have ready access to all the results relevant to their study, not only when it arrives but any time they need it. They shouldn’t have to rummage through old files of dubious quality.

Instead of reactively extracting and cleaning data well after it’s been collected, proactively capturing and curating data as it arrives sidesteps the backlog problem entirely while providing confidence in the quality of legacy data.

Getting to this point and staying ahead of your data backlog isn’t difficult, but it does require a little up front planning and ongoing discipline:

  • First, you need a systematic way to capture and normalize the data. This is not just reformatting, but understanding what the data means across experiments, vendors, and contexts. Develop a “master data model” or ontology for your experiments. Standardize terminology and units. Is it “T1/2” or “half life” measured in “seconds”, “minutes”, or “hours”? 
  • Next, set up a rigorous curation process that is applied as soon as new CRO reports arrive. Surface inconsistencies, flag anomalies, catch formatting errors. You can and should automate the process, but always have a human in the loop when something looks off.
  • Finally, put the cleaned data where it can be used. Whether it’s through a dashboarding tool, a fully integrated software system, or just well maintained spreadsheets that collate results. You know that FAIR acronym your CIO loves to put in slides? This is it!

None of these steps need to be difficult. At BBC, we regularly guide clients through their “ontological discussions” to standardize terms, units, and relationships. These meetings take a few hours spread across a few days and provide the framework for curation. 

Automated curation is best implemented on-demand. Rather than build a robust set of tools for all possible reports, simply add new tools and as you start ordering new studies. Most programs start off with a few ADME assays. Automate those. Once you start doing PK studies, you just need to add an incremental automation. Pace yourself and stay disciplined with maintenance and you’ll have a robust system before you know it.

Instead of paying twice, once to organize the data, and again to fix what was missed, you simply pay a little as you go forward, knowing your data is in good shape.

“By the time you’re cleaning up your CRO data, you’re already late.”

How to Break the Cycle

I’m sure we’ve all been in meetings where this exact strategy has been discussed and then promptly put onto the “wouldn’t-that-be-nice” shelf. How do we actually break the cycle of reactive data management?

At BBC, we’ve been dealing with CRO data internally and for our clients for almost three decades. The hardest part of getting ahead of the backlog is taking the first step. We’ve consistently followed a few guidelines that have stood the test of time:

1. Have Data Strategy on Day One

Technical debt begins the moment you generate data. An evolving but grounded platform is a necessity for science at scale, not a luxury. (When is Day One? Today is Day One!)

2. Hire Informatics Expertise

If this expertise isn’t in-house, bring it in before the backlog forms. Your scientists are experts on their data, but they are not data experts. And good data strategy can’t be vibe coded (some automations, sure, but leave strategy and architecture to humans)

3. Acknowledge the Gap

Know that data coming in from CROs will need some cleaning and organization. Their systems are optimized for their workflows, not yours. Bridging this gap is essential for long-term usability and traceability. 

Once you’ve made the transition from reactive to proactive data management for your CRO data, the only question you’ll ask is “Why didn’t I always do it this way”?