April 02, 2024
Biotech Matters: Great Data Competition and Interoperability with Allies and Partners
In the U.S.-China competition over biotechnology, America’s most powerful asset with untapped potential is data. Data are the foundation of discovery; data of adequate size, type, and diversity are necessary to realize the potential of artificial intelligence and machine learning (AI/ML), and to support the growing bioeconomy.1 In a world where natural resources are dwindling and current agricultural practices are at risk, using biotechnology to do things such as develop crops that can survive in a changing environment, and to make things that cannot otherwise be manufactured, will soon be essential for survival.2 Understanding and strengthening America’s biotechnology leadership requires exploring the limits of existing data relevant to the U.S. bioeconomy so that policymakers and the biotechnology community can promote data policies and practices that drive sustainability and interoperability, while preserving U.S. values, privacy, and security interests.
Bioeconomy-Relevant Data in the United States
Data that describe the biology of microbes, pathogens, plants, animals, and humans—data derived from any living thing—are all relevant to the bioeconomy. Some of the data include genomes, gene expression, proteins, metabolites, images, and structural information about how the pieces fit together. These data are essential for both discovery and streamlining biomanufacturing. Technology has enabled the massive collection of data, such as genetic data, at scale. The U.S.-led human genome project took 13 years, thousands of researchers from the United States and allied nations, and $2.7 billion to achieve a 90 percent complete genome.3 Today, scientists can sequence a human genome in mere hours, at much higher accuracy, for a cost of about $600. Demonstrating this possibility took a huge initial U.S. government investment, and cooperative research effort with allies.
The massive amounts of bio-data that can be collected must be saved, organized, and analyzed to unleash the bioeconomy. A challenge is that data are often spread across different databases and organizations, each with its own way of organizing the data, with few industry-wide standards.4 Large tech companies are organizing data for text, image, and audio; the task is more challenging for complex bio-data. A lack of coordinated effort to build interoperability tools for bio-data makes it exceedingly hard for researchers to assemble and analyze data, and it costs a significant amount of human time and money to curate data from different sources. Creating data anew is also costly, sometimes impossible for data derived from rare biosamples.
There is also a problem with sustainability for bio-data. Public and private data both face limitations to long-term data saving and sharing. Some of the publicly funded databases, such as those housed at the National Institutes of Health’s (NIH’s) National Center for Biotechnology Information (NCBI), have regular funding.5 But not all data fit neatly into these structured databases, and without permanent infrastructure to handle the complexity, data are siloed or deleted. There are also issues with who has access to data. This is perhaps one of the greatest differences between the United States and China. The People’s Republic of China (PRC) has access to government-funded and private-sector data, whereas in the United States, much of the useful bio-data exist in the private sector and are often withheld for intellectual property, privacy, or other reasons. As a result, there are many fewer examples of cross-sector data sharing in the United States. Finding ways to unlock and scale private company data, while preserving America’s commitment to privacy and civil liberties, is critical for U.S. global leadership in the 21st-century bioeconomy.
Bioeconomy-Relevant Data Policies
NIH is the largest funder of biomedical research in the world and is a global leader in open data to support scientific discovery.6
Since the founding of the National Library of Medicine (NLM) nearly 200 years ago, and more recently the NCBI 35 years ago, NIH has made scientific information open and accessible to the research community and public via databases.7 One way NIH encourages the research community to move to open access has been to partner with scientific journal publishers to make data sharing a requirement for highly prized publications. Building on its open science leadership, NIH expanded its data-sharing policy, and the White House Office of Science and Technology Policy recently released government-wide data-sharing guidance.8
Understanding and strengthening America’s biotechnology leadership requires exploring the limits of existing data relevant to the U.S. bioeconomy so that policymakers and the biotechnology community can promote data policies and practices that drive sustainability and interoperability, while preserving U.S. values, privacy, and security interests.
Open data in the United States is largely driven by a desire to accelerate discovery. This model works well when global partners are also invested in open science. But what happens when not everyone is committed to a model of open data? For example, China benefits from the U.S. and allied stance on open data, but does not holistically participate in this model.9 Most of the bio-data resources in China are not publicly available, and China is also collecting data from open sources in the United States and globally, and from U.S. private companies.10 China’s coordinated effort to collect and organize data means its bio-data assets may be AI-ready before those of the United States, unless the latter acts to improve coordination of sustainable and interoperable data infrastructure internally and with allies. Policies should be considered to preserve open data in the United States and with partner nations, while restricting outbound sharing of U.S. data unless partner nations also share data, in a way similar to balancing import/export in order to maintain fair trade.
Of course, not all data should be shared publicly, especially when it would endanger individuals or national security. There are safeguards in place for data with security or privacy concerns, such as human genomic data and health information.11 Some of the most valuable data sets require access approvals of some kind, such as the NIH’s All of Us Research Program and the UK Biobank.12 Both resources contain health information, genomic data, digital health information, and more on hundreds of thousands of people from the United States and UK. Access controls for both resources allow verified researchers to use the data while protecting the privacy and security of individual participants.
Interoperability and Sustainability to Support U.S. Bioeconomy Data Goals
The open data model drives discovery when every nation works together, but given that unlikely condition, what does the United States’ stance on open data mean, and what can the country do to preserve open science appropriately while also securing America’s competitive advantage? Open data should not be abandoned if it does not pose security or privacy risks, as it aligns with U.S. ideals in a free market economy. Instead of endorsing a closed model, the United States should support open data while championing sustainability and interoperability. Sustainability is essential, because without it, valuable data risk deletion. Data become more valuable over time, as additional context about the data comes to light. The United States must establish persistent infrastructure for the complexity of bio-data, similar to the infrastructure built and maintained for electricity and water. The goal for dedicated, long-term funding for persistent infrastructure for data and compute are highlighted as the first proposed core action in the recent interagency working group on data for the bioeconomy.13 The United States should also consider coordinated efforts with allies to organize and collate open data sources globally to maximize value.
Investing in solutions for interoperability across data sets could also help unleash the bioeconomy. Researchers recently performed cross-cohort analysis of All of Us and the UK Biobank, which highlights that, although these two databases are built using modern cloud resources, this is not enough without tools to enable analysis across databases.14 The cross-cohort analysis took significant human time and effort to re-curate the data, both together and across databases. Building tools that describe the framework of the data elements and structure of the database that researchers can use will make it possible for researchers to skip this time- and resource-intensive curation step and fast-track new analyses. The process is dynamic; as understanding of it improves, baseline frameworks for scientists’ interoperability will be created, which in turn will drive the community toward common standards.
Data are the foundation of a successful bioeconomy and are essential to maintain America’s competitive advantage. The United States should actively support interoperability and sustainability of data assets. First, it should immediately plan for sustainable infrastructure for bio-data. This must include policy actions to support U.S. government agencies that fund research to maintain and manage data assets holistically and persistently. Second, it must urgently fund and foster the development of open-source tools to improve interoperability—the ability to connect data in different locations and analyze across databases. Taking these important steps now will drive scientific and economic progress in the U.S. bioeconomy for many years to come.
About the Author
Michelle Holko, PhD, is a scientist and strategic innovator working at the intersection of biology, technology, and security. She has experience in academia, government, and private companies working on solutions for pandemic preparedness and prevention, data stewardship, digital health technology, health equity, and security related to new and emerging technology. She has served as a White House Presidential Innovation Fellow and has led projects with NIH, the Department of Homeland Security’s Cybersecurity and Infrastructure Security Agency, Health and Human Service’ Biomedical Advanced Research and Development Authority, and the Department of Defense’s Chemical and Biological Defense Program, and was most recently a strategic business executive and scientist at Google. She currently serves as vice president of Biorisk for the disease forecasting company Airfinity, and as an Adjunct Senior Fellow at CNAS.
Acknowledgements
The author is grateful to Bryan Ware, Dr. Alexander Titus, Vivek Chilukuri, Hannah Kelley, and Maura McCarthy for their valuable feedback and suggestions on earlier drafts of this commentary, as well as to Melody Cook and Rin Rothback for their design support.
This commentary series was made possible with general support to CNAS.
As a research and policy institution committed to the highest standards of organizational, intellectual, and personal integrity, CNAS maintains strict intellectual independence and sole editorial direction and control over its ideas, projects, publications, events, and other research activities. CNAS does not take institutional positions on policy issues, and the content of CNAS publications reflects the views of their authors alone. In keeping with its mission and values, CNAS does not engage in lobbying activity and complies fully with all applicable federal, state, and local laws. CNAS will not engage in any representational activities or advocacy on behalf of any entities or interests, and, to the extent that the Center accepts funding from non-U.S. sources, its activities will be limited to bona fide scholastic, academic, and research-related activities, consistent with applicable
- “Executive Order on Advancing Biotechnology and Biomanufacturing Innovation for a Sustainable, Safe, and Secure American Bioeconomy,” The White House, September 12, 2022, https://www.whitehouse.gov/briefing-room/presidential-actions/2022/09/12/executive-order-on-advancing-biotechnology-and-biomanufacturing-innovation-for-a-sustainable-safe-and-secure-american-bioeconomy. ↩
- “Biotechnology and Climate Change,” U.S. Department of Agriculture, accessed December 5, 2023, https://www.usda.gov/topics/biotechnology/climate-change. ↩
- Ewan Birney, “The International Human Genome Project,” Human Molecular Genetics 30, no. R2 (October 1, 2021): R161–63, https://doi.org/10.1093/hmg/ddab198. ↩
- R. D. Kush et al., “FAIR Data Sharing: The Roles of Common Data Elements and Harmonization,” Journal of Biomedical Informatics 107 (July 2020): 103421, https://doi.org/10.1016/j.jbi.2020.103421. ↩
- “Welcome to NCBI,” National Center for Biotechnology Information, National Library of Medicine, accessed December 5, 2023, https://www.ncbi.nlm.nih.gov. ↩
- “Budget,” National Institutes of Health, October 24, 2023, https://www.nih.gov/about-nih/what-we-do/budget. ↩
- “All Resources,” National Center for Biotechnology Information, National Library of Medicine, accessed December 5, 2023, https://www.ncbi.nlm.nih.gov/guide/all. ↩
- “Final NIH Policy for Data Management and Sharing,” National Institutes of Health, Office of Extramural Research, notice number NOT-OD-21-013, accessed December 5, 2023, https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html; “OSTP Issues Guidance to Make Federally Funded Research Freely Available without Delay,” The White House, August 25, 2022, https://www.whitehouse.gov/ostp/news-updates/2022/08/25/ostp-issues-guidance-to-make-federally-funded-research-freely-available-without-delay. ↩
- Dewey Murdick and Owen J. Daniels, “The Perils of China’s Great Information Wall,” Time, June 25, 2023, https://time.com/6289567/china-restricts-open-source-data-research-backfire. ↩
- Julian E. Barnes, “U.S. Warns of Efforts by China to Collect Genetic Data,” The New York Times, October 22, 2021, https://www.nytimes.com/2021/10/22/us/politics/china-genetic-data-collection.html; “China’s Quest for Human Genetic Data Spurs Fears of a DNA Arms Race, ” The Washington Post. Sept 23, 2023, https://www.washingtonpost.com/world/interactive/2023/china-dna-sequencing-bgi-covid; Carrie B. Dolan et al., “Chinese Health Funding in Africa: The Untold Story,” PLOS Global Public Health 3, no. 6 (June 28, 2023): e0001637, https://doi.org/10.1371/journal.pgph.0001637. ↩
- Barbara J. Evans and Gail P. Jarvik, “Impact of HIPAA’s Minimum Necessary Standard on Genomic Data Sharing,” Genetics in Medicine: Official Journal of the American College of Medical Genetics 20, no. 5 (April 2018): 531–35, https://doi.org/10.1038/gim.2017.141. ↩
- “The Future of Health Begins with You,” All of Us Research Program, National Institutes of Health, accessed December 5, 2023, https://allofus.nih.gov/; UK Biobank, accessed December 5, 2023, https://www.ukbiobank.ac.uk. ↩
- “The Vision, Needs, and Proposed Actions for Data for the Bioeconomy Initiative,” The White House, December 20, 2023, https://www.whitehouse.gov/wp-content/uploads/2023/12/FINAL-Data-for-the-Bioeconomy-Initiative-Report.pdf. ↩
- Nicole Deflaux et al., “Demonstrating Paths for Unlocking the Value of Cloud Genomics through Cross Cohort Analysis,” Nature Communications 14, no. 1 (September 5, 2023): 5419, https://doi.org/10.1038/s41467-023-41185-x. ↩
More from CNAS
-
AI and the Evolution of Biological National Security Risks
New AI capabilities may reshape the risk landscape for biothreats in several ways. AI is enabling new capabilities that might, in theory, allow advanced actors to optimize bio...
By Bill Drexel & Caleb Withers
-
Biotech Matters: Problems with Life Science Databases in the United States
While humans will retain their ultimate mysteries, many aspects of their traits, diseases, and environment are becoming increasingly tractable. Much of this advance has come f...
By Carol Kuntz
-
Biotech Matters: Public-Private Coordination of Biotechnology
An appreciation of biotechnology’s great opportunities is, for many commentators, intimately joined with regret about a disconnect between the U.S. government and the private ...
By Richard Danzig
-
Biotech Matters: Innovation in Agricultural Biotechnology
In 1986, the United States established a “Coordinated Framework for the Regulation of Biotechnology.” In the decades since, this policy helped to enable the rapid development ...
By Dr. L. Val Giddings