How Data Engineering Works

466.24k views2307 WordsCopy TextShare

AltexSoft

So, the sole purpose of data engineering is to take data from the source and save it to make it avai...

Video Transcript:

so the sole purpose of data engineering is to take data from the source and save it to make it available for analysis frankly it's so simple like it's not even worth talking about you click on a video and youtube saves this event in a database the exciting part is what happens after how will youtube use its machine learning magic to recommend other videos to you but let's rewind a bit was it really that simple to put your click into a database let's have a look at how data engineering works [Music] okay imagine a team with

an application the application works fine traffic grows and sales are selling they track results in google analytics the crm an application database maybe a couple of extra tools they bought to spice up quarterly powerpoint and of course there's this one quiet guy who's the absolute beast of excel spreadsheets analytics great at this point their analytics data pipeline looks like this there are several sources of data and a lot of boring manual work to move this data into an excel spreadsheet this gets old pretty fast well first the amounts of data become larger every month along

with an appetite for it maybe the team will add a couple more sources or data fields to track there isn't too much data when it comes to data analytics and of course you have to track dynamics and revisit the same metric over and over again to see how it changes month after month it's so 90s the days of the analytics guy start resembling the routine of a person passing bricks one at a time there's a good quote by carla geyser from google if a human operator needs to touch your system during normal operations you have

a bug so before the guy's burned out the team decides to automate things first they print the quote and stick it on the wall then they ask a software engineer for help and this is the point that data engineering begins it starts with automation using an etl pipeline so the starting goal is to automatically pull data from all sources and give an analytics guy a break to extract data you would normally set up an api connection an interface to access data from its sources then you have to transform it remove errors change formats map the

same types of records to each other and validate that the data is okay and finally load it into a database let's say mysql obviously the process must repeat itself every month or even week so the engineer will have to make a script for that it's still a part-time job for the new data engineer nothing to write home about but congratulations there it is a simple etl pipeline to access data the team would use so-called bi tools business intelligence interfaces those great dashboards with pie charts horizontal and vertical bars and of course a map there's always

a map normally bi tools come integrated with popular databases out of the box and it works great all those diagrams get populated with new fresh data every week to analyze iterate improve and share since there's convenient access to insights the culture of using data flourishes marketing now can track the whole sales funnel from the first visit to a paid subscription the product team explores customer behavior and management can check high-level kpis it all feels like the company has just put on glasses after years of blurriness the organization starts becoming data driven the team now can

make decisions based on their actions and receive insights via business intelligence interfaces actions become meaningful you can now see how your decisions change the way the company works and then everything freezes reports take minutes to return some sql queries get lost and the current pipeline doesn't seem like a viable option it's so 90s again the reason this happens is that the current pipeline uses a standard transactional database transactional databases like mysql are optimized to rapidly fill in the tables they are very resilient and are great to run operations of an app but they aren't optimized

to do analytics jobs and process complex queries at this point a software engineer must become a full-time data engineer because the company needs a data warehouse okay what's a data warehouse [Music] for the team this is the new place to keep data instead of a standard database a repository that consolidates data from all sources in a single central place now to centralize this data you must organize it somehow since you're pulling or ingesting data from multiple sources there are multiple types of it these may be sales reports your traffic data insights on demographics from a

third party service the idea of a warehouse is to structure the data that gets into tables and then tables into schemas the relationships between different data types the data must be structured in a meaningful way for analytics purposes so it will take several iterations and interviews with the team before arriving at the best warehouse design but the main difference between a warehouse and a database is that a warehouse is specifically optimized to run complex analytics queries as opposed to simple transaction queries of a regular database with that out of the way the data pipeline feels

complete and well-rounded no more lost queries and long processing the data is generated at sources then automatically pulled by etl scripts transformed and validated on the way and finally populates the tables inside the warehouse now the team with access to business intelligence interfaces can interact with this data and get insights great the data engineer now can focus on improvements and procrastinate a bit right well until a company decides to hire a data scientist so let's talk about how data scientists and data engineers work together a data scientist's job is to find hidden insights in data

and make predictive models to forecast the future and a data warehouse may not be enough for these tasks it's structured around reporting on the metrics that are defined in advance so the pipeline doesn't process all the data it uses just those records that the team thought to make sense at the moment data scientists tasks are a bit more sophisticated this means that a data engineer has more work to do a common scenario sounds like this a product manager shows up and asks a data scientist can you predict the sales for q3 in europe this year

data scientists never make bold promises so her response is it depends it depends on whether we can get quality data we'll guess who's responsible now besides maintaining and improving the existing pipelines data engineers would commonly design custom pipelines for such one-time requests they deliver the data to the scientist and call it a day another type of system needed when you work with data scientists is a data lake remember that the warehouse stores only structured data aimed at tracking specific metrics well a data lake is the complete opposite it's another type of storage that keeps all

the data raw without pre-processing it and imposing a defined schema the pipeline with the data lake may look like this the etl process now changes into extract load into the lake and then transform because it's the data scientist who defines how to process the data to make it useful it's a powerful playground for a data scientist to explore new analytics horizons and build machine learning models so the job of a data engineer is to enable the constant supply of information into the lake lakes are the artifacts of the big data era when we have so

much diverse and unstructured information that capturing it and analyzing becomes a challenge in itself so what is big data well it's an outright buzzword used mindlessly everywhere even when somebody hooks a transactional database to a bi interface but there are more concrete criteria that professionals use to describe big data maybe you've heard of the four v's they stand for volume obviously variety big data can be both structured and aligned with some schema or unstructured veracity data must be trusted and it requires quality control and velocity big data is generated constantly in real time so the

companies dealing with the real big data need the whole data engineering team or even big data engineering team and they wouldn't be running some small application think of payment systems that process thousands of transactions simultaneously and must run fraud detection on them or streaming services like netflix and youtube that collect millions of records every second being able to run big data means approaching the pipeline in a slightly different manner a normal pipeline that we have now here pulls the data from its sources processes it with etl tools and sends the data into the warehouse to

be used by analysts and other employees that have access to bi interfaces data scientists use both data available at a warehouse but also they query a data lake with all raw and unstructured data their pipeline would be called elt because all transformations happen after data gets loaded into a storage and there's some jungle of custom pipelines for ad hoc tasks but why doesn't it work for big data that constantly streams into the system let's talk about data streaming up to this moment we've only discussed batch data this means that the system retrieves records on some

schedule every week every month or even every hour via apis but what if new data is generated every second and you need to stream it to the analytical systems right away data streaming uses a way of communication called pub sub or publish and subscribe a little example here think of phone calls when you talk on the phone with someone it's likely that you're fully occupied by the conversation and if you're polite you'll have to wait until the person on the other side finishes their thought for you to start talking and responding this is similar to

the way most web communication works over apis the system sends a request and waits until the data provider sends a response this would be synchronous communication and it gets pretty slow if the sources generate thousands of new records you have multiple sources and multiple data consumers now imagine that you use twitter tweets get added to your timeline independently and you can consume this information at your own pace you can stop reading for a while and then come back you'll just have to scroll more so you control the flow of information and several sources can support

you with data asynchronously the pub sub enables asynchronous conversation between multiple systems that generate a lot of data simultaneously similar to twitter it decouples data sources from data consumers instead the data is divided into different topics or profiles in twitter and data consumers that subscribe to these topics when a new data record or event is generated it's published inside the topic allowing subscribers to consume this data at their own pace this way systems don't have to wait for each other and send synchronous messages they now can deal with thousands of events generated every second the

most popular pub sub technology is kafka not this kafka yes this one another approach used in big data is distributed storage and distributed computing what is distributed computing you can't store petabytes of data that are generated every second on a laptop and you won't likely store it on a single server you need to have several servers sometimes thousands combined into what's called a cluster a common technology used for distributed storage is called hadoop which means well it actually means nothing just the way a two-year-old called his toy elephant but the boy happened to be the

son of doug cutting the creator of hadoop so hadoop is a framework that allows for storing data in clusters it's very scalable meaning that you can add more and more computers to the cluster as your data gargantua keeps growing it also has much redundancy for securing information so even if some computers in the cluster burst into flames the data won't be lost and of course etl and elt processes require specific tools to operate hadoop clusters to make the stack feel complete let's mention spark the popular data processing framework capable of this job finally this is

what an advanced pipeline of a company operating big data would look like you stream thousands of records simultaneously using pub subsystems like kafka this data gets processed with the use of etl or elt frameworks like spark and then it gets loaded into lakes warehouses or travels further down custom pipelines and all of the data repositories are deployed on clusters of several servers that run with tools for distributed storage like hadoop but this isn't nearly the end of the story besides data scientists and analytics users the data can be consumed by other systems like machine learning

algorithms that generate predictions and new data so the sole purpose of data engineering is to take data from the source and save it to make it available for analysis sounds simple but it's the matter of the system that works under the hood when you click on a youtube video this event travels through a jungle of pipelines is saved in several different storages some of which will instantly push it further to suggest next video recommendations using machine learning magic [Music] talking about magic check our previous video that has more information about data science and teams that

work with data thank you for watching