An interview with Yuanming Shan, SVP of Analytics and Data Science at Smule
It takes someone who really knows the data-driven decision-making process inside and out to understand where it can go wrong. That’s why we sat down with Yuanming Shan, a data analytics executive who has been with live music platform Smule since 2017 (and has held data-focused roles at Coursera and LinkedIn).
At Smule, Yuanming oversees data organization and uses analytics to improve the company in three ways: the analytics side (how to help business partners identify growth opportunities by using data to generate actionable insights); the machine learning side (improving the music experience by optimizing content customized to each user); and the infrastructure side (building stable frameworks that make data accessible). Yuanming says there is never a finish line with these tasks, as the best organizations view data-driven innovation as an ongoing process that seeks constant improvement.
Let’s say an issue is discovered. What does the discovery and resolution process look like from there?
If there’s a product issue, it’s detected by a data organizer or product manager, and from there we use a checklist to decide if it needs immediate attention. Is it a true data issue? If not, there could be multiple influences causing this pattern. Is it due to some A/B tests? Or some engineering performances, like a crash rate? We also look at external factors, like recent market changes, new competitors, or news announcements to identify the potential reasons causing a pattern we observe in the data. Kubit calls this benchmark analytics, and we feel this is extremely important to assessing a situation and acting on it the right way.
If the issue is really complicated, wider collaboration becomes crucial, and we will start to involve other functions--like the marketing department, which looks at acquisition efforts; we involve business development to look at competitors and app store activity, etc. The biggest issues may require communicating with the leadership team for awareness and status updates. And, potentially, other key stakeholders who can guide us in what they think is the best way to balance ROI.
With so many parties involved--and lots of data to look at--how do businesses typically communicate around data?
Every company has its own communication behaviors. In some styles, the product owner drives everything, or the analytics department does. I’d say email, Slack and face-to-face meetings are the most used means. If it’s really urgent, we book a war room. Meetings can be really effective if you know who should be involved.
With Slack and email, the issue can be that people are often talking about different things on the same thread. So while Slack is good, there are challenges when you want to reference something from days earlier--the search function is hard to use and can be painful. Some people use Google Docs, but it’s the same problem--several weeks later if we reference something, we have to search through our docs and may not remember the exact title of the document.
“One team can spend a lot of time gathering data only to turn around to the requesting team and hear that it’s not exactly what they were looking for.”
We hear a lot of businesses talking about data quality, but their ideas about what that means can differ. How would you define it? Where are teams most likely to stumble?
When other teams are gathering the data for other teams to process, It involves a lot of business logic, aggregations, etc. that can cause data quality issues. Data quality is tied to accuracy --using the right data in a timely manner. Latency can compromise data quality, as well as how it’s used; when people look at those data, they should look only at it what’s useful and insightful to them.
A big issue is that data quality has many dimensions to it. Start upstream and define it together. When we try to gather information, the first question should focus on the goals. Why do we need this data? How do we get it? Be clear on the goals. Even for common terminologies, there could be different understandings of the same term. The business development team could have a different understanding of what something means than vs. the engineering team. This can cause issues after the data is gathered--one team can spend a lot of time gathering data only to turn around to the requesting team and hear that it’s not exactly what they were looking for.
We see this in the advertising issue around impression data. Do we all have a common definition for what qualifies as an impression--does it mean how many times an ad was viewed, or just how many times it popped up on the screen? Those terms need to be clarified before any actual work is done.
“I’d say roughly 20-30% of the time, businesses are making poor data-driven decisions because of data quality issues.”
How does a lack of data quality and data control typically affect a business, product, or consumer, and how widespread is it?
In the decade I’ve been working with data, my experience is that it happens often but doesn’t always cause huge issues. Over time, though, it will be a big cost to the company. There are 2 major types of issues. First, custom data quality is not “clean”; analysts and business-side people need to look at history; meanings in fields change, certain labels are valuable in some time periods but no others. You need a lot of custom data to keep it clean. But often, limited resources are sent toward building up or promoting the product. Over time you will see many companies--including the biggest ones--where the data is usually very messy. This is costly. And when people leave a company, that business is losing customized business logic that is not easily transferable.
The second issue isn’t necessarily catastrophic, but it is time-consuming--and that’s when companies see a drop in metrics. It could take an entire day of investigation to figure out it’s just an implementation issue and doesn’t reflect the truth.
I’d say roughly 20-30% of the time, businesses are making poor data-driven decisions because of data quality issues, with privacy concerns accounting for the other 80% of why businesses make bad data-driven decisions.
First, companies don’t collect enough data, and that can be due to storage restrictions, computing costs or privacy concerns. Data that is highly confidential can also be highly impactful. When there’s no data, there’s no way to measure data quality. On the other hand, we have too much data. Of the data most teams track, 99% of raw data collected is not that meaningful to them. We have to distill the insights from them and compress from a petabyte of data into a mega- or kilobyte to show that 1% of data that really matters. It requires the ability to connect dots, combine with business sense, and go way beyond the data quality itself--which is why Kubit is so valuable in addressing business’ lack of analytical abilities to generate meaningful, actionable insights out of the huge amount of data we already collect.
Communication also plays a role--specifically, small talk. Those top-secret, influential conversations that take place between executives can have a huge impact on a business, and we aren’t able to capture that data.
“I like Kubit because it brings everything needed to ensure comprehensive data quality onto one platform, rather than having to use various platforms for different stages and needs within data analysis.”
What businesses especially need to maintain data quality? Do you feel the data analytics solutions currently on the market do a good job of promoting data quality?
It depends on the industry. Banking’s data, for example, is mostly transactional, so their solutions are strict and rigorous--meaning the data quality is often good, but the cost is huge.
For all companies’ analytics purposes, they need to ask why users are using their products or risk losing them. How can we promote data to increase platform engagement? Those analyses don't necessarily require 100% accuracy. In other industries, 95% data quality is acceptable---things like language and times zones will all change the baseline.
For any industry, Kubit provides really good data quality for day-to-day, data-based decisions. I like Kubit because it brings everything needed to ensure comprehensive data quality onto one platform, rather than having to use various platforms for different stages and needs within data analysis. It makes data analysis accessible and understandable to that businesses focus on the right data, not just any and all data.