Catch the Replay: Meet A Data Scientist with Dr. Ken Goodman

January 21, 2021May 16, 2024

Dr. Ken Goodman opened the fourth session in the Meet A Data Scientist lecture series on January 13, 2021 from 4:00-5:00 PM with a talk on how data science works to reduce ignorance, uncertainty, and the harboring of false beliefs, and to correct previous mistakes.

TALK TITLE: “Data Science + Ethics – UM’s Role in the New Research”

He shared this epistemological quote emphasizing the problem with false beliefs: “It isn’t what we don’t know that gives us trouble, it’s what we know that ain’t so.” -Will Rogers

Using an example of a sniffing dog chasing a rabbit to a fork in the road and, when the first sniff had no scent, not needing to sniff the other direction before pursuit, Dr. Goodman illustrated the Law of the Excluded Middle: “For any proposition, either that proposition is true or its negation is true” (something can’t be true and false at the same time / if the dog doesn’t smell rabbit in one direction, it concludes it must have gone in the other direction). This segued into the traditional methods of science, such as framing hypotheses, testing your hypotheses, and how data science changing these methods and raising ethical issues.

History of Data Science

Dr. Goodman then discussed a brief history of data science to shed light on data analysis:

–Abu al-Kindi (801-873) looked at how the frequency of characters in a language could be applied to coded documents to decode them (origins of cryptography/inference by recurrence).

–Thomas Beddoes (1760-1808) understood the importance of observation (facts generated by patient-clinician interactions), record keeping, and creating a national bank of medical knowledge. Beddoes could see that observational data was being lost (origins of Archie Cochran and evidence-based practice). Contemporarily, this illustrates that we are muddling through methodological challenges, and generating a lot of data, but we’re losing it.

–Pierre C. A. Louis (1787-1872) realized the importance of collecting statistics on the success of methods of treatment.

How We Run Experiments Now

Another important factor is that the tools used to analyze data are changing rapidly: For example, a 5MB device was as large as a reach-in closet in 1956, but 32GBs fit inside a wristwatch today. Newer methods are being used. Instead of testing hypotheses, we now ‘run experiments’ by seeking patterns in data repositories (relational databases, varieties of statistical analysis including meta analysis, and machine learning—which used to be called “knowledge discovery in databases”).

Secondary Use of Data

These new methods have produced a number of challenges across many fields, including regulatory reliance on downstream data. We are living in an ethics-regulatory environment that emphasizes secondary use. Dr. Goodman thinks this is an ethics mistake. He feels in using de-identified data with consent versus the full use of the data with adequate encryption and privacy algorithms, the role of consent changes. He also feels an artificial distinction between “clinical research data” and “public health data” is created.

Challenges of Reproducibility

Dr. Goodman asks: In meta analysis (pooling other people’s experiments), what can explain any failure of replication/reproducibility? How confounding is “missingness” (missing data)? This raises interesting conceptual and epistemological problems. In data science, reproducibility and attention to it emphasizes data quality, software engineering, etc., and version control (shout out to Dr. Richard Bookman) as ethical issues, and may explain failures of reproducibility in large data sets. There is a need for improved confidence measures for corroboration, confirmation, and replication/reproducibility (shout out to Dr. Otavio Bueno in the Philosophy Dept.). Philosophers have been worried about these issues for a very long time. Reproducibility became a big problem in biomedical science, Dr. Goodman said, when they realized they were spending a lot of money on studies that were contradicting each other, not confirming each other, and were not guiding us in practical applications (clinical care). He believes the challenges of reproducibility are best seen as an ethical issue related to data science that is under-addressed as such. [Dr. Goodman mentioned session attendees Dr. Joanna Johnson in the University’s writing program and her colleague April Mann. Dr. Johnson received an award from the AAMC (Association of American Medical Colleges.)]

“Data Data” (data about data/data quality)

How do we curate our data? Who’s curating the data? Dr. Goodman gave an example that illustrated a further use of trial data in a secondary meta analysis of de-identified data where privacy fundamentalists would have said permission should extend to all subsequent uses of the data.

In health-care data, patient-reported outcomes (goes back to Beddoes and Louis), real-world data, all of the stuff that happens in our hospitals, contribute to a learning health care system. Beddoes, Louis, and Cochran were the foundations of evidence-based practice, and, they are also the grandparents of the very idea of learning health care system. He recommends: Learning Health Care Systems Journal citing a growth in literature about how we do a better job collecting, analyzing, storing, re-curating, reanalyzing, and reproducing data from every clinical encounter there is—”If we don’t get that right,” Dr. Goodman says, “We’re losing a bunch of valuable data!”

Citing an example of lost data: A patient with prostate cancer who asked his physician what his treatment was based on. A very large, randomized, controlled trial of 600-750 patients. But the physician had treated tens of thousands in the last ten years. The patient asked: “What happened to the others?” An analogy using cosmology and multiple telescope placement, emphasizes how we collect a wider array of data today. How we collect data, and tools that improve its collection, are giving us information on this aspect of the scientific method. This points to citizen science . . .

Citizen Science

If the scientific community doesn’t lead on capturing ‘better’ data, the technology community and citizen scientists (ordinary people) will. Dr. Goodman draws a line of discernment between the tech community interested in marketable products vs. those in the scientific community with a goal of increasing knowledge. The difference in those missions matters in terms of ethics. With things like automotive safety, patient safety, environmental protection, etc. leaving data gathering in the hands of a group whose mission is for profit could be a conflict of interest. To illustrate the breadth of citizen science, Dr. Goodman suggested we check out how many projects there are at citizenscience.gov/catalog, including, for example, the “Adopt a Pixel” project.

Collaborations Incorporating Data Ethics

In all its collaborations, Dr. Goodman thinks the University of Miami has an opportunity to do something interesting and exciting at the seam of data science and policy, and ethics. He sees IDSC as a kind of ‘universal donor’ fostering things like the evolving graduate degree program (Master of Science in Data Science), and the new IDSC Grants program, which was created to produce innovative data science projects. Everybody is excited about AI now, and universities around the world are creating data science centers, but he doubts any/many had ethics ‘baked in’ at the outset. He’s grateful for the insight that led to the inclusion of ethics in the creation of IDSC. Work in fields like transportation science, environmental science, material science and engineering, behavioral science, and so forth, all involve solving problems that are often methodological. IDSC offers ‘ethically optimized’ help to conceptualize problems in innovative ways to manage data data to solve those problems.

Information Management and Research Conduct

Dr. Goodman is concerned, as a matter of policy, ethics, and the history of science, that experiments have changed. Now we do experiments on the data themselves—we don’t analyze the world the way we used to, we analyze other people’s representations of the world. Data and information are not synonymous. On the visualization of data, Dr. Goodman referred to Dr. Alberto Cairo’s latest book How Charts Lie, and how the misuse of the visualization of data can affect’s people’s confidence in the information.

As the world of data science quickly evolves, the expertise of University, the Institute for Data Science, and the Institute for Bioethics and Health Policy, and the Ethics Programs will help our colleagues and our communities address societal challenges. As a matter of course, ethics is part of conducting research. In the course of writing the software, we’re worried about version control, annotation, providence, and so forth. In the study of design, it may have to do with what reagents you’re doing, whether you’re sharing your data, whether you’re providing good lab results, whether you’ve suppressed anomalous results or not. The core curriculum and responsible conduct of research is about methodological issues. Sometimes framed as compliance, Dr. Goodman he believes it’s what good scientists have been doing for a couple thousand years. When we support that, when we link responsible conduct of research and data science—or call it data ethics (the AAMC project linked the responsible conduct of research with reproducibility)—that’s a novel idea. The linking of the Institute for Data Science and Computing (IDSC) and the Institute for Bioethics and Health Policy (IBHP) has led to collaboration in ways that are distinctive, positive, and wholesome, and is already producing a ‘return on investment’ that is, ultimately, intellectually compelling and fun.

This lecture is also part of the Institute for Bioethics and Health Policy’s Dialogues in Research Ethics series (#198). This session was followed by Q + A. Listen here for the full version, or watch the embedded video above.

Audio-Based Group Detection for Classroom Dynamics Analysis

Alberto Cairo becomes Series Editor for AK Peters Visualization Book Series