Roles in Data Science Teams

57.63k views1794 WordsCopy TextShare
AltexSoft
Data-driven organization. You’ve likely heard this buzz phrase hundreds of times. But what does it r...
Video Transcript:
[Music] in 2017 Netflix changed its five-star rating system to a simple thumbs-up thumbs-down now the service was recommending movies based on the match percentage and people hated it how can we reduce all the nuance that lives in cinematic art to a primitive binary reaction in reality what Netflix found was that people were giving high rates to those movies that they believed were good not necessarily those they've really enjoyed watching at least that's what the data said so how does data analysis work in organizations like Netflix and what are the roles in data science teams this
is Gibson Biddle a former VP and chief product officer at Netflix when talking about consumer insights he explained an unexpected customer behavior that led to changing the whole rating system in shifting to percentage match Netflix acknowledged that while you may ready to leave your brains at the door Adam Sandler comedy only three stars you enjoy watching it and as much as you feel good about watching a Schindler's List and give it five stars it doesn't increase your overall enjoyment and keeping subscribers entertained is kind of critical for Netflix so they simplified the feedback system to
avoid bias but these insights into customers are impressive by themselves and they wouldn't be possible without two things the culture that fosters the use of data and a powerful data infrastructure in tech jargon it's called a data-driven organization you've likely heard this buzz phrase hundreds of times but what does it really mean Netflix alone records more than 700 billion events every day from logins and clicks on movie thumbnails to pausing the video and turning on subtitles all this data is available to thousands of users inside the organization anyone can access it using visualization tools like
tableau or Jupiter or they can get to it via a big data portal an environment that lets users check reports generate them or query any information they need then this data is used to make business Asians from smaller like which thumbnails to show you two really serious ones like in which shows to Netflix invest next but Netflix isn't alone according to some estimates about 97 percent of Fortune 1,000 businesses invest in Data initiatives including artificial intelligence and big data buzzwords again but let's have a look at the real data infrastructure technology and data engineers that
make it work to describe how data infrastructure works technicians borrowed the term from liquid and gas transportation similar to physical pipelines data pipelines have their own origins destinations and intermediate stations so it's a pretty apt metaphor the origin of data may be anything from clicks on a reserve button and pull-to-refresh to conversation records with customer support from vehicle tracking devices to turbine vibration sensors on power plants in today's world it's actually harder to say what cannot generate data rather than what can even no data can tell us something once the data item is generated it
travels down its pipe to a staging area right here this is the place where all raw data is kept raw data isn't yet ready to be used it must be prepared you have to remove the airs from it fill in the gaps change its format or merge data from different sources to get a more nuanced view as soon as these operations are done the data now structured and clean can't continue on its journey all these operations happen automatically they are described in three words extract extracting data from its origin and getting it to a staging
area transform preparing data for use and load push prepared data further ETL for short all prepared data falls into another storage a data warehouse unlike the staging area a warehouse is a place where all stored records are structured and prepared for use just like in the library with its classification system finally you can query visualize and download information for a warehouse to do that you must have business intelligence or bi software it presents data to final users data lists and business analysts who carry out essential tasks they access data explore it visualize it and try
to make business sense of it did our marketing campaign work out well what's our worst performing channel they act like a sensory system supporting an organization with historical data and getting insights to management and ultimately anyone who makes decisions okay who's in charge of building this whole pipeline traditionally these specialists are called data engineers mostly tech people adept at what's known as plumbing moving data from its origins to destinations across the pipeline and transforming it on the way they design pipeline architecture set up ETL processes configure the warehouse and connect it with reporting tools Airbnb
for instance has about 50 data engineers sometimes you might encounter a more granular approach with several extra rules involved data quality engineers for instance make sure that data is captured and transformed correctly having biased or incorrect data is too expensive when trying to derive decisions from it there may be a separate engineer responsible for ETL only and also a business intelligence developer focusing solely on integrating reporting and visualization tools however reporting tools don't make headlines and a data engineer wasn't called the sexiest job of the 21st century but machine learning does and a data scientist
was what everybody knows is that data science is particularly good at taking data and answering complex questions about it how much will the company earn in the next quarter how soon will your uber driver arrive how likely is it that you'll enjoy Schindler's List the same as uncut gems there are actually two ways of answering such questions data scientists make use of BI tools and warehouse data as business analysts and data analysts do so they would sit here and get the data from the warehouse sometimes data scientists would use a data Lake another type of
storage that keeps unstructured fraud data they'll create a predictive model and suggest a forecast that will be used by management one time reporting and it works for revenue estimates but it doesn't help with predicting the uber arrival time the real value of machine learning is production models those that work automatically and generate answers to complex questions regularly sometimes thousands of times per second and things are much more complicated with them to make the model work you also need an infrastructure sometimes a big one have a look at this dramatic image not in the way most
people consider the meaning of this word obviously but for data scientists it really is dramatic notice this tiny box in the middle which says yeah let's zoom it in please it says ml code the paper is called hidden technical debt in machine learning systems by Google engineers and the image compares the amount of machine learning code to the rest of the systems that make machine learning code useful without them this tiny box however brilliant it may be is a relatively small piece of code in Python or in Java but it's actually pretty hard to arrive
at this model data scientists explore data from warehouses and lakes experiment with it choose algorithms and train models to come up with the final m/l code it takes a deep understanding of Statistics databases machine learning algorithms and a subject field in his famous tweet Josh wills former head of data engineering at SLAC said that a data scientist is the person who is better at statistics than any software engineer and better at software engineering than any statistician what about the rest of those boxes okay imagine yourself isolating and ordering food at uber eats once you confirm
your order the app must estimate the time of delivery your phone center location restaurant and order data to a server with a delivery prediction ml model deployed but this data isn't enough the model also gets additional data from a separate database that contains say an average time for your restaurant to prepare a meal and a wealth of other details once all the data is here the model returns a prediction to you but the process doesn't stop there the prediction itself gets saved in a separate database your delivery person shows up in the real time of
arrival will also be captured to record the ground truth the model performance against it and explore the model via analysis tools to update it later and all this data will eventually appear in a data Lake and a warehouse in reality uber eat service alone uses hundreds of different models working simultaneously to score recommendations search rankings of restaurants and estimate delivery time if you have that level of complexity you also need a clever system to update more Tyre models as well as prioritize some models over others to manage computing resources that's a lot to process usually
this job falls on the shoulders of data engineers or machine learning engineers ml engineers take charge of the production side of things they aren't as much into statistics and subject matter as data scientists but they know how to configure production models automate extraction of specific data from multiple sources and verify data quality before use finally if you run machine learning with hundreds of models deployed you need a data architect role to make the work of the whole data platform consistent this person would be responsible for the platform itself and its capabilities rather than how specific
models solve real-life problems these six roles are those you frequently meet today but things will be changing in the future look at how people imagined our time in 1982 if you ever glance out of a window in 2019 when Blade Runner takes place you didn't see the dystopian architecture flying cars or multi-store commercial Holograms in fact the real future looks like this or this or even like this you can't touch data you'll have a hard time explaining what data means but that's what defines the real future we're living in today and data science and business
intelligence will soon be taken for granted Adam Waxman head of core technology at Foursquare believes there won't be data scientists or m/l engineers anymore since will keep automating model training and building production environments much of the data science work will become a common function inside software development thank you for watching if data is what you deal with every day tell us more about your work in the comment section below you may also send meaningful signals to YouTube's machine learning rhythms if you liked the video and want to see more
Copyright © 2024. Made with ♥ in London by YTScribe.com