On January 27, 2021, from 4:00 to 5:00 PM, Dr. Yelena Yesha shared her definition of data science and addressed the current state of affairs in the field of data science. She spoke about exciting innovative projects at CARTA and IDSC that involve the data analytics, data science, and blockchain technology. And about her personal journey—how she got to be where she is and how she picked up the field of data science before it gained the prominence.
TALK TITLE: “Data Science: A Paradigm of Actionable Knowledge”
What I want to share with you now is my ideas about data science, because people have different perceptions of data science—that’s why I have a kind of provocative presentation title. I’m talking about the paradigm of actionable knowledge, because in my mind, data science is not about big data and the managing of the big data. It’s actually about extracting the actionable knowledge from the vast amount of information.
As we all know, when the Internet became reality, we had a lot of things happening on the internet in one minute and one second, and more and more information coming to fruition there. And then, the concept of big data kind of became even more prominent in 2014 and 2015. Not because all of a sudden, we got more data, but because the social media data became a reality. So it’s due to social media, really, that we started to have this overflow of the information, this staggering flow of the information, and in 2014-2015, big data became almost a field.
Data in Real Time
I was sharing with some of my colleagues, the story that, at that time, I was running a different NSF center called “Center for Hybrid Multi-Core Productivity.” And then, the NSF elite asked me to set up another center on big data, and I refused. And that’s when CARTA was born. Because, I said it’s not about big data—it’s about actionable data, about actionable knowledge, and, about the ability to extract the information from the data. And, the idea was to do it in real time, because there are many studies that you can find from Deloitte, IBM, and Accenture, that tells us that if you don’t really work on the data after the first three minutes, the way it’s ingested into the system, it’s basically lost.
Then, instead of having actionable data from which you can get knowledge, you actually create what you call ‘cemetery’ data. Because, once you have it, store it, and don’t use it or process it within the first few minutes of its origination, then the value of these data is not really big. But you’re still stuck with the maintenance, storage, and other good things that come with the curation of the data in the system.
So, since we have so much data, the natural question is: How much of this data is actually relevant? And how much of this is knowledge? And how much of this is data that may or may not be used in a century.
In my career, I had the privilege of running and managing the largest unclassified archive in the world, NASA Goddard (it was the Center of Excellence in Space Data and Information Sciences Just to give you an idea, 25 years ago, we were doing petabytes of data and now, they’re talking about exabytes and zettabytes, so many different things. What I discovered there, as a part of my experience, was that—we were serving at that time 1,500 astrophysicists around the world in terms of actually serving them the data from NASA—what we discovered was that 90% of NASA data, the space images, were never used, or used maybe every 20 years. So, it’s just 10% of the data that contained the knowledge, the information that was used by astrophysicists to do the job. This made me start thinking: How do we deal with this amount of information? How do we curate and how do we clean the data? And even though we need to keep NASA images in mind in this example—there are medical images and others that will be talking about today—how do we keep it in a way that they are there in case we need them? But knowing that will need them a very small percentage of the time, we really want to organize our data and mine it in a way that will have real-time access to the knowledge or method data, if you will, that is extracted from this information.
The Human Brain
It always boils down to the way our human brain actually works, which is yet another project that we started in IDSC and CARTA that we’ll be discussing also later today that takes advantage of this novel technology.
The human sensory system sends about 11 million bits of information per second, but, then there is a big disconnect about how much we can actually digest or register. From these 11 million beats, only 15 or 20 bits can be actually digested per second. It just gives you the idea that we have a limited capacity to absorb and process information with our brain. Therefore, we have some survival instincts, we have certain things that nature gave us, but, again, we cannot really process the amount of information that is coming our way.
Obviously, everybody’s heard the term “information overload” so I’m not going to preach about that. I just want to, basically, to get across the idea that the ultimate goal of big data technology is developing these filters that will aid in critical decision making. Meaning that we need these 15-20 bits that we can process be the right type of 15-20 bits that can actually enable us to make the proper decision.
So, this is not just a random kind of selection, or even some sort of educated guessing. We want to play with our probabilities in a way that there are, with certain confidence, intervals are certain that the data or information they are extracting from this vast amount of data is actionable, and will assist us in the right decision-making process.
And this is specifically important when we deal with industrial Internet of Things decisions, when we deal with medical devices. As you all know, digital therapeutics now became a reality, and we had lots of discussion with Nick and our medical colleagues—how do we make it reality, and how do we actually assist the doctors? We’re not replacing the doctors with software in decision-making processes, which are basically life-depending procedures that would rely on artificial intelligence, on things that are coming from the data science domain, and how we can assure that our patients don’t die. But also the decisions are adequate, and they preserve all the privacy and other things that we would like to have in this type of environment.
AI and Machine Learning Applications
What is happening in the typical kind of AI/machine learning applications? In general, we the process the data and the amount of data gained, for the most part is big, but it can be small as well, and sometimes there are big kind of discussions about how much data you really need to make a decision. While the vast majority of the problems have overload of information, in certain cases, we don’t have enough statistically meaningful sample spaces to actually run our machine learning algorithms to do the learning, to do certain things. Or, perhaps, we don’t want to process such a big amount of data, because we may not have compute power. Or, maybe, this compute power, if we have it, will take longer than we want in case we need to have a real-time decision.
Medical Data Mining
Initially, when we collect the data—whether small, or big, or medium size—what is happening? We do data pre-processing. Sometimes I call it data janitorial services, right? We collect the data, we do data integration that will normalize the data, we do certain feature selections, we do some dimension reduction, and then, we basically stage the data for data mining, or, what I usually call pattern discovery or processing. We are looking for certain patterns. Sometimes we’re looking for certain specific things if we have a hypothesis, so we know what we’re looking for. Then we are kind of finding, if you will, the right hammer to kill a fly, or to kill an elephant, depending what kind of problem we are trying to solve. Or, perhaps, we are looking for a needle in a haystack. And this usually happens in cybersecurity problems. Or, let’s say, when we’re looking for drug re-purposing, or we are looking for certain solutions when we are not sure what the conjecture is. We are looking for certain correlations. This is where the data mining tools, or semi-supervised learning tools, play a significant role. Because we are able to solve specific problems if we know what we are looking for. Or start to look at correlations and start to learn more, and by learning more and getting more data, perhaps solving certain problems that were not even posed.
The Role of Data Scientists
And then, of course, once we have all these exercises done, hopefully successfully, we have post processing. That leads to what is known as pattern information knowledge. We do an evaluation to understand what we found. We want, maybe, to select certain patterns that we would like to explore more. Or, perhaps, we want to visualize our results, and this is where there is a complete cycle between different types of expertise that are needed in order to provide solutions to data science problems. Because a data scientist is not a unique individual that can solve and work on a problem of national significance and be able to address everything we need. It’s very much an interdisciplinary field that requires a number of areas of expertise. But what is important is that a good data scientist will be able to understand the problem, and parse it, and bring a specific expertise to the table so we’ll be able to solve this problem in a timely fashion, and also, get adequate results.
We do a lot of work in the medical domain. In CARTA, the Center that I am heading, we have five subsites: One of them is the newly created site at the University of Miami, but we also have Tel Aviv University, Rutgers, North Carolina State, and the University of Maryland. We do a lot of work in the medical domain, and with medical imaging.
When you try to analyze the type of tools that are primarily involved so far in the medical field. We are all, pretty much, using machine learning as the main tool, if you will, to assess and solve the problem. We are using some of the statistical tools. Then, a lot of work is done in pre-processing. Cleaning the data is, I would say 60 or 70% of the effort and, if you look at the commercial companies, there are even companies that make significant revenues out of simply cleaning the data. What, again, I call data janitorial services. They’re not very exciting services, but they’re very unique, very labor consuming, and require certain unique expertise and there are needed, because if you don’t clean the data properly and you do make mistakes on the way, you really have a problem in the actual analysis of the results that you’ll be getting.
If you deal with imagery, you’ll do a lot of classification clustering, and then, of course, you’ll need to do post processing for presentation. The data science field is not really new. I did my PhD in distributed databases—it will date me but I’ll share with you—about 30 years ago. This was the beginning of distributed databases. Now, are people saying it’s the demise of databases, and we are more kind of into data lakes, but it’s still very distributed. And at that time, we already had kind of the beginning of doing some sort of analysis in distributed fashion, which now has a fancier name: federated learning and ensemble learning, and many other cool things are happening.
Now, we are entering—and I would say solidifying—the role of data science, I would say data science version 2.0. As data science again evolved more as a mature discipline around 2016-2017, we are entering the next generation of data science where certain specific tools and problems are solved. And people are looking at a kind of synthesis of actionable knowledge from the raw data through the complete data-life-cycle process. So, again, we ingest raw data, and nowadays, you have companies like Seagate, and we have a number of patents with them through our Center (CARTA) where we do processing of the data as we actually ingest the data into the machines. It’s a very exciting kind of situation. The idea is how do you grab the raw data, process it very fast, so you can extract the knowledge, extract the actionable data, and go from there.
When people talk about data science, and depending who you talk to, people in the department of mathematics—especially applied mathematics—always tell me: Yelena, it’s not about computer science it’s about our statisticians right? Or our mathematicians, and the computer scientists are saying it’s our field, right? And then you have data mining people that work, sometimes, in business schools, and say no, it’s all about business applications and we do that. And then, of course, you have people who do creative computing and visualization, and they’re saying no it’s all about us. But then you have infrastructure people that have been there all along, and they’re supporting all of this, right? And they’re saying no it’s all about software, and network engineering, and subject-matter expertise.
So what is it all really about? In my opinion it’s nicely depicted in this Venn diagram.
It’s all of the above.
The Team Approach
Depending on the problem you’re trying to solve, you really need this inter/multidisciplinary approach, and you need the right team. That’s why our Institute (UM IDSC) is already so successful, and is poised to be one of the national leaders in this field, because we are able to bring team expertise from different domains that work together on large problems that have a human significance, if you will, human values from the medical field, to data field, to Homeland Security values, and things of that nature. It’s not just about a specific field, it’s really about the team. The problems that require data science solutions, in my mind, very rarely can be solved with a single individual or tool. It does require a team. It doesn’t have to be a hundred-person team, but it needs to be a very cohesive team over at least 15 experts that really work together, and is able to solve specific application problems.
The American Workforce of Tomorrow
If you’re trying to take a more academic approach and ask: What do we need? What do we teach? This is kind of coming to more of a me speaking not so much as an innovator and a person who likes to build the products, but somebody who would educate the American workforce of tomorrow.
So, what kinds of skills would we like to engrave into our next generation of students that would come with either a major in data science or minor in data science, or some specialization in data science? I, personally, believe that the approach we should take to data science is that everyone who graduates university should have a course in data science.
These courses should be taught in a different fashion. If you’re a history major, or you are an art major, you should be taking the data science class. We are living in the digital age and we need to train two types of individuals. Individuals who are “I”-shaped people who are deep, and these are computer science kind of degrees, computer engineering types of degrees in data science. And people who are more “T”-shaped that are having other specializations like biology, medicine, and things of that nature, but they have enough depth and understanding of how to use, at least, the data science tools, so they can really solve the main specific problems in their field.
In order to build and train the right type of individuals, what do we need? We need to teach them data analytics and visualization. We need to be able to clean the data, which usually comes more kind of from the data background, from database classes, and things of that sort. We need to be able to quantify and digest and understand what it means to have actionable data. Then, we need to have a mining knowledge, meaning transforming knowledge into conclusions to certain specific results, if you will.
Once we extract the knowledge from the data, and . . . Believe it or not, it was 27 years ago almost to the date when I started the ACM Conference on Information and Knowledge Management (CIKM). I was a young assistant professor and people were saying “You’re crazy, Yelena, starting a new conference.” Especially at that time—databases were still a very solid field—and I said “No it’s changing.” This was when I started, but now, it became the premiere conference in the world. And I must confess that sometimes I get my papers rejected from this Conference.
Just to give you an idea that this whole concept of knowledge extraction from the information is not new, but it took a long time, really, to go through iteration. And for us, as scientists, being able to digest and corner, in the more applied fields, what exactly we mean by extracting actionable knowledge from the vast amount of information.
Data Science | Infrastructure
Now, infrastructure is important, as I said, this is part of the things that we’re doing. We are fortunate at IDSC to have a very solid computing piece. And that’s why I called my Center, NSF Center, CARTA. And “accelerated” there stands for high-performance computing, quantum computing, neuromorphic computing. So, all this infrastructure is a must. If we don’t have the proper infrastructure, and we don’t have the layers built on the proper infrastructure, and we don’t architect our system in the fashion that they need to be architected—or re-architected if you will—to solve specific problems, then, of course, we are out of luck in terms of finding the proper type of solution.
Data Science | Storage
We also need to work on long-term archival storage. When I worked at NASA, we worked on telegraphic storages. We even created telegraphic storages, and sent them to space. Because they said the storage industry is changing so fast, and if we collect the information in space, and then we don’t have stable storage there, how do we copy it in, almost, real time from one place to another so we’ll be able to keep the information alive over a significant part of our lifespan, or at least the life cycle of the need of this data.
Data Science | Data Curation
Data curation is another big area for data science, because once we’ve collected the raw data, we process it, we feel comfortable. How do we curate it? How do we store it? And how do we make sure that we can reuse it, repurpose, and how do we legitimize the data, and how do we keep a secure access to the data or the metadata that was already created from the existing data?
Data Science Paradigm
In terms of definition, I promise you to talk a little bit about the paradigm. And again, this is just my personal view that I kind of confluence on, or basically came to over years and years working this large amount of data.
So what is the data science paradigm? In my opinion it’s an extraction of actionable knowledge that I collect from the data through a process of discovery, hypothesis, and hypothesis testing. As you see from this definition, in order to go through this lifecycle of dealing with and massaging the data, you really need to have specific expertise that doesn’t come from the single field of computer science. You, obviously, need statistics, and you need many other interesting types of expertise in order to be able to take care of this type of situation.
And the data life cycle is the set of processes that transform raw data into actionable knowledge. We are after actionable knowledge. So what do we want? We want to extract what we need from the data, and get an action out of this. An action can be publication. It can be certain picture visualizations. It can be a signal—emergency signal, alert system if it’s a medical system. Or, it can be an input to another device, which will be driven by the input that comes from the previous devices, if you really have internet of things, and they’re interrelated in terms of passing the information and knowledge around.
Analytics, in this type of setting, can be defined as the synthesis of knowledge from the information. And with the new big data paradigm, analytics are no longer separable from the data model and the distribution of the data across horizontally scaled resources. Scalability is a big issue that’s why we are fortunate to have Triton here. We are fortunate to understand how to deal with large amounts of information, how to scale the computation to address different types of problems, and this is extremely important. But, for us, the people who provide also the infrastructure, we need to work with application people, we need to grasp the models and exactly what they’re trying to achieve, while solving their problems that will give our computational scientists an ability to solve these problems with a high speed and high fidelity.
A Career as a Data Scientist
So, what is needed in the data science field? Data science is probably one of the best, if not the best profession nowadays. If you Google Money Magazine: “What is the best profession?” Data science will come in at either number one or two. However, if you try to dissect the job offerings. . . Almost every day I have at least five or six calls from people looking for the top students, and people trying to hire—even now in the Covid era, there’s this tremendous demand for—people in data science. So, who are these Data Scientists and what are the specifications?
In general, in my opinion, skills and expertise matter more than titles. However, in our kind of structured society, every advertisement comes with a job title. Very few of them actually say “Data Scientist.” They’ll say “Database Administrator.” They’ll say “Data Engineer.” They’ll say “Data Analyst,” “Data Architect,” “Business Analyst,” right? “Data Analytics Manager.” In reality, it doesn’t really matter, because if you have a well-trained person, he or she can field most of these positions.
However, if you are more business oriented, and looking more at the ability to optimize the enterprise to do marketing, for example . . . I remember we had a great talk by Manos Spanos from Danone yogurt. Marketing is 20% of the cost of every product. So, if you really process the data right, you have the right business model, right business analyst, then, of course, the sky’s the limit. Then, you have people who are more Data Engineers, and they are looking more at the putting the systems together. You have Data Architects that are looking at the back end. And you have Data Analysts who are more up front and trying to work with the domain specialists. And then, you have Data Managers that actually manage the group of people.
Now, the interesting thing is that, nowadays, if you look at the skill set and how we train our next workforce generation—for better or for worse, there is an opportunity for different types of training. It’s almost, I can say with certainty, that we can take a high school graduate, and, in six months, train them so they’ll be able to enter a data science world, if you will, and find themselves a place. They’ll be intelligent users and maybe entry-level people, all the way up to the to the people who are very senior and have PhDs, and maybe, zillions of publications and work in data science. So, there is a whole spectrum of expertise that is needed now in the market.
What, really, is more marketable nowadays? From what I see, it’s, again, people somewhere at a mid-level career, people that combine not just the technical expertise, but have a certain advantage. They understand certain fields: biology, business, medicine. They don’t have to have formal training, but at least understand the terminology. They understand the science that goes into scientific applications. They understand business, business modeling, and marketing if they go into the business world. They understand finances if they go into the banking industry. This is where I think we really need to concentrate. This is probably where the skills of somebody with a master’s degree (like the Master of Science in Data Science) that Professor Ogihara and Dr. Mannino are developing would play into the into the picture. It’s almost like there is a tremendous need, in my opinion, for people with advanced degrees, but not necessarily PhDs. They’re not necessarily searchers, but there are people who are beyond their basic kind of Bachelor’s degree, but not necessarily would like to do innovation, or, do the classical academic work in data science.
Why Study Data Science?
For those of you on the call today who are inspired by data science, and would like to wander in those fields, and either do a formal degree, or do some vocational training, or do a certification, why study data science?
Different people have different inspirations, or different motivations, why they go and study the field of data science. Some of them like to have to identify certain patterns. They’re fascinated by certain problems. I’ll be sharing you later, in today’s presentation, my journey—how I came to data science. This was inspired more by the medical field, actually, rather than anything else. But, some people like to look at biological patterns, chemical patterns, or anything else. Some people like to quantify what they know, and go more on the statistical side.
What motivates an individual that goes into the data science field?
- Some really like the very high salaries that this profession comes with in terms of compensation.
- Almost I would say 150% employment, if such thing exists. They’re usually, I would say, nowadays, four to five job per graduate in data science. So, it’s really tremendous opportunity for somebody young, who has the talent and ambition to pursue a career in this field.
- The promotions are relatively fast, because the field is new.
- Some people like to help individuals. You can work with social economic models. You can try to work on the inequality that exists in society. I have a lot of data scientists that are working for the Labor Department, working with the USA trying to help with underrepresented communities, countries, and things of that nature.
- Some like to gain novel data, or access to novel data. These are people who work for NASA or some exploratory agencies that, basically, have expertise that enables them to come to situations where they work with this enormous amount of data that is novel in nature, and they’ll get access to this through their employment.
If you look at the value and kind of difficulty, this is kind of a curve, and I like to do curves as you can imagine. I’m a mathematician by training, and computer scientist as well. Primarily computer scientist, but I do have dual first degrees, so I like to see the graphs and like numbers.
- First of all, you have Descriptive Analytics. It, basically, tells you: What happened?
- Then, you have Diagnostic Analytics. Why did it happen?
- Then, you have Predictive Analytics that do anticipation. What will happen?
- And then, you have Prescriptive Analytics: How can we make it happen? So, in a sense, once we learn what’s going on, what can we change, and how can we optimize the situation?
The higher the level you go, the level, of course, of difficulty is growing. This kind of usually goes proportionally to the skill set that individuals are gaining while training for this type of profession.
Large Scale Use Cases
When I started my lecture today, I mentioned large scale use cases—sometimes I like to call them problems of national significance, or problems of extreme interest—usually come with the access to the big data sets. Not always, but most likely, they will require access to the big data set, but also, processing of these big data sets. The example of such activities would be:
- Your traditional Google search engine, which is huge in terms of the data it processes and crunches every day.
- And then, recommendations from Amazon and Netflix
- Financial institutions have tremendous amount of data requiring a tremendous number of resources, specifically as it applies to fraud detection.
- The logistics industry, like DHL or UPS fleet management, is another interesting large-scale problem nowadays. I’ll talk about it again later when I talk about specific projects.
- I’ll talk about supply chain of PPE as it relates to the current pandemic crisis. How do you optimize the supply chain using this novel data science technology that would enable us, not just to secure our supply chain, the vaccines, or PPE, but also make sure that logistics are properly done so we actually get the delivery of vaccines to different states, properly, with the right amount, with the right refrigeration, with all of the constraints that, usually, the medical community—and, at the same time, our OMB Office of Management and Budget—imposes some days, because we are solving a number of optimization problems here? It’s, first of all, the budget, then the availability of the vaccine, acquisition of the vaccine and then, of course, distribution. All these cycles need to be interconnected, and then we need to secure and optimize the supply chain, so that, finally, the vaccine will reach the hands of individuals on time before the crisis takes an even bigger toll on the lives of our people.
- And the other big area is healthcare records for patients. This is a huge undertaking, huge amount of data. But also, it’s not even so much huge in terms of the data, but huge in terms of protecting it, and being able to interoperate it, and to make sense of it in terms of trying to build a decision-support system based on the information that we have.
Reproducibility and Portability
Since we’re dealing with the data—and most of the data, the big data, is used to produce certain scientific discoveries, or results, or scientific products, artifacts, whatever you want to call them—we need to build an infrastructure that enables us to have reproducibility. Those of you who are working in exact sciences, or have connection to exact sciences, know that is very important to be able to reproduce results. And also, to have a certain degree of portability. Meaning that, if we have the algorithms, we have certain solutions to specific scientific or analytics problems, we should be able to reproduce them, or port them to another different infrastructure, to different operating systems, to different computers. Because as the hardware and infrastructure evolves, and becomes more exciting, more mature, we don’t want, necessarily, to rewrite our algorithms that we’ve developed, let’s say, on Triton. I’m sure Nick and Joel will be looking to upgrade Triton in the next few years to an even more exciting machine, but then everything we’ve developed on Triton shouldn’t go to waste. We should not need to reprogram everything. All these things need to be designed in a way that we should be able to adapt our solutions to different types of computational and storage infrastructure that is, basically, invaluable in terms of supporting different types of applications.
Now, before I come to the specific projects that are going on now at IDSC, I want to give you yet another interesting example. In the Center (CARTA) that I’m running, we have a lot of work in:
- Computer vision in self driving/driverless cars, and also computer vision for medical domains.
- We are doing a lot of work in classification and label predictions. We are working on different models that are built on training examples, or training sets, where we are using semi-supervised, or supervised, learning tools. We are looking at the analytics that predict unknown class labels. Again, in terms of technology, a lot of work in deep learning, neural networks, rule-based classification, pattern-based classification, and logistic regression. These are the tools that are deployed with different teams at CARTA that are very effective in the medical imaging domain, but also in driverless cars, and many other places where computer vision is now taking a leading role in leading the way in terms of solving large scale problems.
- We are working a lot with the Centers for Medicare & Medicaid Services (CMS) and Northrop Grumman on fraud detections. Northrop Grumman is the lead agency that works with CMS on Medicaid and Medicare fraud Now again, they’re looking at ICD-10 kind of data, they’re not looking at individual doctor kind of fraud. They’re looking at organized networks and we’re helping them to do that. Then we’re looking at classifying diseases, web pages, direct marketing.
- I personally done tremendous amount of work in the electronic commerce for the government agency, and also for private industry.
- Also we’ve done a lot of work with location-based commerce, and many other things as it’s related to the actual electronic commerce.
Now we are moving more and more to remote work. This is the next thing. Twenty-five years ago, I wandered into electronic commerce and they said “Yelena, what?” And I said “Yes”, but it took 25 years. Now I’m saying it’s remote work and blockchains. I believe blockchains and remote work are the next things that are really going to boom in the next five years—In fact, I predict that 10%, at least 10%, of all the global economy will be powered by the blockchain solution. And it’s not the same blockchain that you usually think about when you think about bitcoin. It’s a similar technology that is used to produce cryptocurrency, but the type of blockchains we are working on are permission-based.
We’re having a number of extremely exciting results in this area that are published. I just got accepted two months ago to the top cybersecurity conference. And just now, before I came to this lecture, I was presenting with my colleague at a cybersecurity conference—about 1,500 people at National Institute of Standards and Technology (NIST) on Zero Trust—where we actually have a cooperative research agreement with the Department of Homeland Security and NIST. We are optimizing the purchases of software licenses for the government. It looks the government is wasting hundreds of millions of dollars, because they are over purchasing software licenses, and by using the blockchain technology, you can automate the process and actually minimize the waste. Obviously, this is not done on purpose, but because technology is not fully implemented, and different government agencies (and there are 15 of them) are buying the same license and simply are not using them properly, or to the full extent for what they’re paying.
We are also looking at the trends in time serious, deviation analysis, sequential pattern mining, periodicity analysis, and you name it. This is coming more from the field of industrial engineering and statistics, but there are a number of tech things that are extremely useful here.
And now I come to probably one of the most exciting parts of this lecture, I want to talk about what we doing at IDSC. I’m relatively new to IDSC, as you know. I joined about a year ago, but it feels like I’m here already a decade because I got to know so many exciting and phenomenal people, and we started a number of very, very interesting projects here.
- We’re doing a lot of work in medical imaging, where we are using machine learning, statistics, computer vision, and are working with subject matter experts. We have a visiting fellow at IDSC, Prof. Michael Morris, MD (Mercy Medical Center, Maryland), who also has an appointment in Radiology. We are working with radiology departments in Maryland, starting to work with Miami, and also working with the University of California San Francisco School of Radiology. We built what we called an “artificial intelligence virtual reality assistant” that is basically an AI-based program that is able to diagnose lung cancer images. Now, for those that don’t know, and some of you may not even want to know, but 32% of all radiology reports are read incorrectly. This is a known statistic. I didn’t make it up. It was done in literature. What we are doing, we are not trying to replace radiologists, but we are trying to build the AI-based tools that can assist radiologists to do the proper diagnosis. Now, we are working with the FDA to see if these tools can be adopted as a part of digital solutions for this type of environment.
- We are also working with RAD-AID lab, which is a nonprofit institute composed of Agis III countries where they’re trying to bring radiology into underrepresented countries because the machines by themselves are not expensive. What is expensive is, actually, the time of the radiologists. So, if we get some sort of green light on the presentation of what we have, we can share this with no courses as, basically, our philanthropy and charity with the countries that don’t have an access to this technology to diagnose lung cancer, breast cancer, prostate cancer, and you name it.
- We are also working with the quality assessment of clinical CT exams, trying to understand if they’re really up to par, and how to eliminate errors in these type of activities. Building on our kind of decadal work on AI for radiology, we’ve been fortunate to secure a National Science Foundation (NSF) award that’s called RAPID that came when pandemic came in February-March. NSF contacted me, because they usually contact all the national centers, and said “Yelena, we need some new technology. We need help to fight the pandemic.” And, at that time, if you remember, we had big difficulties in having correct tests, and we didn’t have the quantity of the Covid test, but, also, some of them were a false negative or false positives, and we couldn’t really calibrate our testing system properly. So, very fast, we came up with the concept of using CT scans. And used the computer programs, that using the CT scans, we can diagnose corona 100% of the time. Initially, what I heard from emergency medicine doctors and radiologists was that many people got misdiagnosed. People with viral pneumonia were not treated properly because they thought that they had Corona, or they got a false positive test for Corona. And some of them, unfortunately, died, because they didn’t get the right treatment because the treatment for Corona is different than the treatment for viral pneumonia. So, anyway, I’m happy to report that we got 1,500 images from, of all the places, you wouldn’t believe it, we actually got it from Iran. 1,500 Covid patient images to train our algorithms, and our paper was just presented at the top computer conference, IEEE Big Data Conference in December, of course, over Zoom. But this is the leading conference in data science and big data, where we were able to share the results, and they still continue to refine them, and move to the next level.
- We are working with a number of different agencies. It started with IBM, DHS, and NSF, on social media sentiment analysis. We are using certain learning tools for sentiment analysis. More recently (I’m working closely with Nick) we will be submitting a proposal to the FDA. They want us to use social media analysis to understand the noises or adverse effects of implantable devices. Because what turns out is—and we did the studies for pharmacovigilance for the actual medicine—in the medical device market, the final clinical trials phases are already when the device is implanted in the human body, and the device manufacturers need to report what happened to the device until the actual approval comes into the picture. However, it’s not always possible for the FDA to see what is happening. And, even the device manufacturer does not always know about certain adverse effects that can happen with this specific device. But patients are sharing their information online. They’re sharing it in their portals, or they’re sharing it with other members of the community. So, we’ll be developing crawlers that will look at the signals of adverse effects of the medical devices.
- We have a tremendous amount of work and research in blockchains for securing medical devices. We’ve developed a novel system for the FDA MD EpiNet (Medical Device Epidemiology Network) that enables patient consent.
- We are also using blockchain for supply distribution.
- The other interesting area where blockchains are used, and we’re doing some research there, is curtailing fake news. We are waiting now for funding from the Institute for Speech Freedom. In Iran, from all the places that is funded by the State Department, where will be using blockchain and AI algorithms to see what kinds of misinformation people in Iran are getting from their government or other sources.
- We are doing a lot of work with NOAA and NASA. There, we are trying tensorflow on quantum machines. We have access to three different quantum machines, so we are running tensor flow algorithms and applying them to some NASA programs. With NOAA, we are working on their operational models for weather prediction and trying to optimize them on their supercomputer, so this is a very exciting project.
Finally, I come to the last part of my lecture, which is my personal journey. And it will probably take a long time to describe how I came to this field. Computer science was, by no means, my first choice. Therefore, those of you who are really want to know more about me, Forbes magazine, about two years ago, maybe a little bit less, wrote two articles (Part I and Part II) about my life and how I got where I am now, so it really describes my personal journey.
In a nutshell, I was born in the former Soviet Union, which is now in a country that is now called Ukraine. From what I learned from my parents, at the age of one and a half and two I discovered the concept of even and odd numbers. When I was two and a half, the concept of zero really excited me and very early in the game I knew I’ll do something with numbers, but I didn’t have a clue what it would be.
And then, by the time I went to high school, I was very much fascinated with physics and wanted to do, probably, a PhD in physics. I was second in Ukraine in the high school Olympics, which is big because, if you can imagine, it’s a country of about 60 million. However, for national competitions, they didn’t send me because I was a girl, and they said, its physics, maybe you’ll have those assemble some circuits, and you know. So, even though it was at that time the Communist Party, they still kind of didn’t believe that a woman can do physics. And then, at that time, something came to my head, and I learned about computers although I didn’t know much, and I said I want to do computer-based diagnosis. This is what I’m going to do in life.
But then, when I was 17, we were refuges. We left the former Soviet Union, which was lucky for us and we ended up in Italy. It was me, my parents who didn’t speak any English, my two grandmothers, and $500. We became stateless. I had a USSR exit visa not valid for return with a citizenship “stateless,” and we spent eight months in Italy waiting for the visa.
When we came to Canada, at that time I could actually read and write English. I was self taught and took lessons. But when I watched TV, I saw what I studied, I guess, back in Ukraine was British English, I couldn’t understand so much, but, I could write and read. I went to the University of Toronto by foot when we were still staying in this refugee hotel, which was a filthy place, but it doesn’t matter because they were nice, and they took us in. And the UT people were very nice, they looked at my record, they said: “You’re amazing, but, you know, what you need for computer-based diagnosis is a PhD in engineering, and an MD. And, by the way, young lady, in North America, education costs money.” And then I looked at my mother and I realized that I have a problem.
Then I was walking back—because 50 cents on the bus was too much—to the hotel from the University of Toronto and, then, talking to my mother. I was very much depressed and didn’t know what to do. And somebody overheard us speaking Russian, and he said “I’m sorry but, young lady, it sounds like you’re very ambitious. Let me give you some advice: Study the shallow science computer science, and you’ll always have a job.” And I said “Okay.” And he said “Later, you can do whatever you want.”
So, I went to the University. I talked my way in. I missed all the deadlines. I took local tests, so I got my myself in, and I took a computer science class, and I didn’t like it. I dropped my first computer science class. And I took physics and mathematics, and I loved it, and I didn’t know what to do. Then, I started to do pre-Med. And then, I took a computer science class again, and then, of course, I got the open scholarship. My parents still didn’t have a job, so I convinced the University that I can use my room-and-board money and live at home, so I could pay rent for my parents. And so, in two and a half year second years I finished a degree in computer science and mathematics. And then, the rest is history.
It wasn’t until 2003, when I was already a full professor at the age of 32, and finished running, as a director, a large NASA scientific center that the—at that time he was the Chair of the surgery—walked into my office, uninvited, I didn’t who he was. And he said “Yelena, I need somebody like you. I’m building the operating room of the future in the Medical School, and I heard about you. You have to come and help me.” This was the beginning of realizing my dream. I got an appointment in the Medical School and I started to do the things that I was dreaming of. So, it was a not a direct journey, although I was doing data science—without knowing the name of data science, we called it distributed databases—all along. I walked into situations where I started to work on medical applications and many other applications.
So anyway, that was my journey and, you know, it was very exciting. I’ve done a lot of interesting things in my life, and I believe, will do even more exciting things and make a quantum jump to the next level at IDSC with what we have here, and what we’re inspired to build. With that I stop and entertain a few questions, if you have any.
Questions + Answers
Ken Goodman: Thank you very much, one of the things—especially about you sharing your journey—makes it clear to us that one of IDSC’s missions is to provide more opportunities so that other women do not have the hurdles you did. That the number of women who are going to be able to shape the future of data science, we need to make sure that their path is different, and perhaps easier than yours. We have just a few minutes for questions, I don’t see one yet, but you’ll forgive me for asking a sort of a meta-scientific question because you made a number of observations that I think are really important for scientific process, and, for that matter, epistemology. If you’re talking about knowledge, you’re talking about how we use data to learn, and to know. But you also made reference to reproducibility, which is a great challenge right now in many of the sciences. You’ve emphasized biomedical sciences, but, in fact, it’s a problem everywhere. Say just a little bit, if you would please, about data science and reproducibility.
Yelena Yesha I had two PhD students who did their doctorate in the scientific reproducibility. One of them is now a Distinguished Information Scientist at NASA. So, scientifically reproducibility is a problem in any field. In fact, the statistic is horrible. Thirty percent of all scientific results that are reported in research literature cannot be reproduced. I’m not claiming that our colleagues in different fields are trying to produce some fraudulent results—maybe few are, but most of them are not—the reason we can’t reproduce them comes from different types of problems. Maybe, as I said, the algorithms can’t be ported to the different machines, or the data is not available, and things of that sort.
One way to capture scientific reproducibility is by using the blockchains. This is the technology that I’m very fond of, and I believe, is the future, or part of the future of the evolution of data science that would enable us to do democratization of the data on one hand. Because this is what we want, we want to share the data as much as possible. But, we also will have the knowledge and ability to capture every step of the way using the distributed ledgers, and reproduce the data or reverse engineer the steps showing how the data was produced, if you will. This would enable us to have 100% reproducibility in the field. Without reproducibility and scientific community, I don’t think we can really advance the science because, so the results of fundamental and new in nature, and there are very few of them, but many results are actually incremental, as we all know, so we built on our colleagues’ work. And if you cannot reproduce their work, what are the chances we can really advance the state of the science without being able to reproduce what has been done already in science? So it’s a huge impediment, and I believe we have a scientific solution for it.
Ken Goodman A data science scientific solution, exactly—Thank you for that.
Yelena Yesha Right. We have a data science solution, exactly. That’s a good point. Yeah, absolutely.
Kenneth Goodman So in our 11th hour we’re getting a couple of questions. I think, well . . .
Just a quick shout out to: Evelyn [Cruz] and Helen [Gynell], Professors [Alberto] Cairo, [Nick] Tsinoremas, [Ben] Kirtman, [Tim] Norris, in addition to Michael [Mannino] and I, have the privilege of working with IDSC.
But Alberto has asked: You mentioned Forbes published two articles . . .What a long fascinating journey—it’s more of an observation.
Daniel [Messinger], I’m wondering if you’re related to Esther?
Question [from Daniel ] How should graduate students aspiring to work in multidisciplinary data science but really having general degrees go about furthering their careers?
Yelena Yesha Well, as I mentioned data science has a whole spectrum of expertise in requirements. So, if certain individuals come from, let’s say, a traditional economics degree or social science degree, there is a place for such a person to either gain a professional master’s degree, which I advocate, that would not require the kind of traditional data science or computer science training.
Or perhaps, one thing that we should entertain at U Miami and what is happening in other universities, is to have combined degrees like minors in data science. If, let’s say if you’re, for example, I’m not picking on social sciences, but let’s say, if you do take a social science degree or sociology degree—which is a good degree actually for data science, because if you do kind of geopolitical or sociological kind of research—then the type of data science that you need to learn is more about being intelligent user of the tools.
This is what I want to get everybody to understand, after my lecture: Not everybody should really understand and being able to develop new Ai algorithms. It’s enough for 80% of data scientists to be an intelligent user of the technology, and understand which algorithm exists that can be applicable, and how to apply it properly. This also goes for our kind of solid sciences for Medical School, for weather prediction models. We don’t need people who are predicting our weather to be AI gurus and develop new algorithms. They need to know what is the state of the art, and they should be able to play these models properly to their specific domain. This is not easy at all, even just that is, by itself, a huge challenge. But, there is no need—it would be nice—but there is no need for individual with some other degree really to go and become a computer scientist or AI expert. You need to understand enough to be able to be an intelligent user.
Kenneth Goodman And on that note, that kind of a concluding remark, it makes you clear that IDSC is trying to become a kind of a universal donor to foster that, both at our institution and in our community. Professor Dr. Yesha, thank you so very much for this. Michael do you have any final remarks, or Nick perhaps?
Michael Mannino No, thank you very much, Dr. Yesha and Dr. Goodman, and everyone at IDSC. Stay tuned everybody for our next lineup of wonderful Meet a Data Scientists speakers on our website. We’ll get those announced very shortly, thank you very much.