A team of graduate students from the Department of Population and Public Health Sciences at Keck School of Medicine of USC received the Best Visualization Award at a November 2019 Data Science Hackathon hosted by the Orange County R Users Group and UCI Paul Merage School of Business.
During the two-day event, the R-Noobs team – which included Temuulen Enebish, Doctor of Philosophy in Epidemiology candidate, Zhuolin Wang, Master of Science in Biostatistics candidate and Tingyu Yang, Master of Science in Applied Biostatistics and Epidemiology candidate – collaborated to analyze data sets and deliver a five-minute presentation on their findings within a 24 hour period. Thirteen teams made up of approximately 80 participants from all levels of expertise in statistics were evaluated for Best Visualization, Best Model, and Best Insight. Each team showcased their work for a chance to win prizes including $125 in Amazon Web Services credit, a 360-degree camera, wireless earbuds, and books about machine learning and data science. We asked Yang and Wang to reflect on their experiences.
Both Wang and Yang participated for the first time after fellow student and PhD in Biostatistics candidate, Zhi Yang, encouraged them to explore the opportunity. Despite initial hesitation, Yang discovered her education thus far had prepared her well for the challenge, saying “I was really nervous that I am not an expert in this field yet. Before, I thought it might be better for students with a computer science background to join this Hackathon competition, but actually I was totally wrong. We need to be confident, and we need get out of this comfortable circle on campus and push ourselves.”
The dataset the team received had to do with a company contacting potential customers. “The data is related to direct marketing campaigns from a Portuguese banking institution,” said Yang. “The marketing campaigns were based on phone calls.” The team found that contacting a potential client more than once was often required to earn their business, and that potential clients who had been previously marketed to were more likely to subscribe than those who had not been previously marketed to. “Our classification goal,” Yang said, “is to predict if the client will subscribe to a term deposit (variable y).”
In working with the dataset, “our team aimed to present both prediction models and well-organized visualization showcase for the dataset,” said Wang, explaining they “used the shortest time to find the three most associated variables with the outcome, and found their trend by time.” Yang concurred, “we wanted to find out whether there are some associations between characteristic variables of clients and whether they would subscribe to the product.” The team soon discovered factors that complicated the scenario. Said Yang, “we created the descriptive analysis and discussed the value of each variable, and found out that there might be some overlapping demographics in this dataset if we just select all the characteristic variables (also known as independent variables) to build our predictive model – we could not get reasonable results.” The team decided to “imply a sales strategy of reconnecting with customers every 3rd, 6th and 12th month,” said Yang, “If you had been marketed to in the past, you would have a high probability of subscribing again. In addition, of people who subscribed, the majority of them spent more time on the phone call than people who did not subscribe.”
In order to visualize the data and present prediction models, the team separated the dataset into two groups based on whether or not the individuals had been previously contacted, and excluded some unnecessary variables for prediction. “For data in which people had been contacted before, we conclude all variables for prediction,” said Yang. “We find that the random forest has the best performance, and there are several variables that are important for the prediction, such as ‘last contact duration’ and ‘outcome of the previous marketing campaign.”
When it came to creating a slideshow presentation, the team found the 24-hour time limit forced them to pare down to the essentials. “We actually didn’t have enough time to make beautiful looking slides after we ran the predictive model,” said Yang. “We wanted to focus more on our statistical analysis process and get reasonable results… on the second day, after Zhuolin and I finished the data visualization and discussed the predictive model results with Temuulen, we started working on PowerPoint slides and waited for the prediction results. We used the Tableau software and ggplot package and tidyverse package in R to make different histograms and line charts to visualize the correlation between variables. Then we selected some meaningful typical information to put into our slides. We added the predictive model results at the end.” Both students enjoyed the visualization component, and Wang especially liked using the Tableau software. Another constraint was the 5-minute time limit for the for the presentation itself. “We didn’t have much time to explain our results from the beginning all the way through the final model,” said Yang, “so we explained the essential part of our analysis, as shown in our slides.”
Both Wang and Yang agreed the event was a valuable learning experience. Wang gained a new appreciation for the process of sharing results with others, saying “data scientists not only do predictions, [they] also need the ability to find some insight, [visualizing] the data to showcase to other people.” Yang stressed her enjoyment of the team experience, saying “teamwork just like an engine, everyone should be included in it. If one of part broke down, the whole engine might even stop. Thus, we are a team. We need not only finish our own task but also connect with each other since the statistical analysis is a chain.” She also urges others to be confident in their abilities. “Everyone needs to be confident. Like me – at first, I could not even think that I could win the prize, since there are so many experts and professionals in this competition… company management, PhD students from engineering schools or even post-doctoral students. I was only a masters student… but in the end, I found out that it doesn’t matter who you are, it does matter what you do. For the data visualization, we stayed until midnight in order to find the correlation between each variable and have some informative results. Finally, our hard work paid off.”
Yang goes on to talk about how the event gave her the opportunity to network and form new friendships, adding, “I made so many friends within different industries, from different schools… These precious experiences could not be obtained from on-campus life. I was very happy to have a chance to exchange my ideas with other professionals – on this project and on my analysis methods and get the feedback. In this society, we need to broaden our horizons beginning with expanding our friend circles.” The teamwork and idea exchange aspects of the competition indeed turned out to be Yang’s favorite part of the experience.
Both Yang and Wang suggest that others hoping to get involved in an event start by talking to others in the community. Wang says professors are a great source of information, and Yang suggests joining the LA R User Group on Slack or GitHub, talking with PhD students or going to an R user group meeting. Ultimately, Yang encourages her fellow students to participate however they can, saying, “I strongly recommend every student take part in it if they have chance in the future.”
In the future, Wang and Yang both plan to go on to complete their PhD in biostatistics. Wang looks forward to working as a biostatistician, and Yang sees herself as a professor or researcher in environmental genetics with the goal of improving quality of life.
— by Nicole Mercado and Carolyn Barnes