hello and welcome to this session on data science my name is mohan and today we are going to take a look at what this buzz is all about so what is the agenda for today we will talk about what is the need for data science and then what exactly is data science some definitions and also understand the differences between data science and business intelligence then we'll talk about the prerequisites for learning data science and then what does a data scientist do what are the activities performed by a data scientist as a part of us daily
life and then we will talk about the data science life cycle with a quick example and briefly touch upon the demand or ever increasing demand for data scientists all right so let's get started now you must have already heard about autonomous cars i'm sure you must be excited to have a car driving by itself which will take you from home to office or office to home right and that's where one of the examples where data science is used now the car needs to take a lot of decisions in this whole process whether to speed up
whether to apply the brake take a left turn right turn or slow down so all these decisions are basically a part of data science and there is a study that says that self-driving cars will minimize accidents and in fact it will root out more than 2 million deaths caused by car accidents annually self-driving cars right now there's a lot of research and there is a lot of testing going on and not a lot of cars are yet in production in terms of usage but it's going to happen every automotive company worth its name is investing
in self-driving cars so in about 10 to 15 years some of the studies say that most of the cars will be autonomous or self-driving cars where else there are issues for example if we take airlines this is another area where data science contributes in a major way flights get delayed due to weather conditions because the weather is not predicted in time and the demand of passengers is not probably seen ahead of time for all these unique data science then this could be improper route planning and some customers might miss some flights that again needs data
science and similarly it could be incorrect decisions in selecting the right equipment so which plane should fly in which route that's the equipment that's being mentioned here if that is not planned properly then you might end up in a situation where the plane is not available whereas you have planned for a flight in a particular route so these are some of the challenges in one of the representative industries we are talking about which is airlines so if we use data science properly all of these or most of these problems can be avoided and that will
help in reducing the pain both for the airlines and also for the passengers a few more examples what else can we do here are some of the other things that we can do and we will stick to the airline industry we can do better route planning so that there are less cancellations and less frustrated people we can predict use predictive analytics and predict any delays that are there so that some flights can be rescheduled ahead of time and there are no last minute changes data science can also be used to make promotional offers and the
last but not least is what kind of planes should be used or the different classes of planes that should be used in different routes for better performance so these are some examples of how data science can be used in airlines and another example or another industry where data science can be used and benefited would be in logistics so companies like fedex they use data science models to increase their efficiency drastically to optimize the routes and cut costs and so on so before their delivery truck actually sets out they determine which is the best possible route
to ship their items to the customers and based on various inputs they also predict or come up with the best suited time to deliver and last but not least they determine what is the best mode of transport for this delivery as well so what is data science used for these are some of the main areas where data science is used for better decision making there are always tricky decisions to be made so which is the right decision which way to go so that is one area then for predicting for performing predictive analysis like for example
can we predict delays like in the case of airlines can we predict the demand for certain products let's say in e-commerce that is the second area the third area is pattern discovery or pattern recognition is there a pattern in which people are buying items for example it could be seasonality if you take the data sales data for multiple years there may be a pattern in a way people are buying that's a buying pattern certain months probably the sales will increase certain months if sales will come down certain quarters traditionally the sales will be higher certain
quarter so that is a pattern and this pattern discovery is another area where data science is applied so what is data science now let's take an example a real life example on a day-to-day basis we use or we we try to make some decisions let's say we want to buy some furniture online for our new office so how do you go about doing this you need to take a bunch of decisions to actually do the purchase so we start with which website which portal or which website you should use so we try to find out
let's say you want to buy the furniture obviously you don't go to a online grocery store because you need furniture because there are several websites so that is the first decision you need to take which website i should use so once we have multiple websites you kind of discard all the websites which don't sell furniture and you stick to those websites which sell furniture now within that we try to find out what is the ratings of this websites if the ratings is small that means they are reliable the quality probably is good and so on
and so forth so only then you want to buy from that particular website so anything that doesn't satisfy this credit area you close all those websites close in the sense you close the browser right so you still are left with maybe a few of these which satisfy your criteria that is which sell webpages or websites that sell furniture they have a rating of four and above and then you look for discount is somebody providing discount greater than 20 then again you filter out some of them which probably are not providing any discount and zero down
to one or two websites which are probably providing those discounts and go ahead and select the furniture and purchase it so this is a very basic example probably don't follow this always exactly the same way but just to illustrate drive home the point so we can answer a lot of questions using data science for example when we take a cab when we book a cab now to go from location a to location b what is the best route that the cab can take to reach in the fastest way or in the least amount of time
there could be several factors there could be traffic there could be bad road there could be weather now all these come as inputs and a decision needs to be taken as to which is the best route another example is tv shows so netflix and maybe even other a lot of other tv channels they have to perform this analysis to find out what kind of shows people are viewing what kind of shows people are liking and and so on and so forth so that they can then sell this information to advertisers because their main source of
revenue is advertising so this is again major function of data science predictive maintenance we need to find out will my car break down will my refrigerator break down in the next year or two years should i be prepared to buy a new refrigerator you can potentially apply data science here as well and then in politics a lot of data science is applied in politics you must have seen on tv about u.s elections or in uk or even in india nowadays everybody is applying data science in elections and trying to capture the votes or rather the
voters influence the workers personalized messages providing personalized messages and and so on and so forth and that is one not only that people use data science to even predict who is going to win the elections it's a different matter that probably not all predictions come out to be true but then yes this is they use data science to do this predictions so what is the process or what are the various steps in data science the first step is asking the right question and exploring the data basically you want to know what exactly is the problem
you're trying to solve so that is asking the right question so that is the this circle out here then the next step is after exploring the data so as a first step you will ask some questions what exactly is the problem you're trying to solve and then obviously you will have some data for that as input and you perform some exploratory analysis on the data for example you need to clean the data to make sure everything is fine and so on and so forth so that all that is a part of exploratory analysis and then
you need to do the modeling let's say if you have to perform machine learning you need to decide which algorithm to use and which model to use and then you need to train the model and so on and so forth so that's all part of the modeling process and then you run your data through this model and then through this process and then you come out with the final results of this exercise which includes visualizing the results and preparing a way to communicate the results to the concerned people so it could be in the form
of powerpoint slides or it could be in the form of a dashboard which is basically what we call as a visualization and so all the insights that have been gathered through this exercise that has to be communicated in a proper way in an easy to understand way which is again a key part of this whole exercise communicating the results so let's now talk about the difference between business intelligence and data science now business intelligence was one of the initial phases where people started making or wanted to make some sense out of data so for some
of you who may not be aware there were multiple phases of this it revolution so initially there was all automations you had automation of your selling process manufacturing process you had your erp systems your crm systems and and so on uh which are basically enterprise resource planning and customer relationship management erp and crm and all these enterprise applications were generating a lot of data so then people started saying that okay we need to understand or get some information out of this data so that's how business intelligence started so if we take from a data source
perspective let's compare these from with each of these criteria the criteria are what is the data source what is the method what are the skills and what is the focus now if we compare business intelligence with data science this is how it looks as far as the data source is concerned business intelligence was primarily using structured data so you had all your enterprise applications like erp crm and so on and they were working out of pretty much rdbms relational database management systems like oracle or mssql mysql and so on so all these data was structured
in neat form form of tables rows and columns and then they were all brought into a centralized place because remember these were still different applications so they were working off different databases in silos so if you wanted to get a combined view you needed to create what is known as a data warehouse and bring all this data together and then look at it in a uniform way so this is what business intelligence was doing pretty much it was structured data and it had reports and dashboards that was pretty much what was there in business intelligence
now with data science in addition to structured data we also use a lot of unstructured data example web blogs or comments if we are talking about customer feedback there is a structured part there is an unstructured part where people write free text data science includes that as well and brings everything and then performs analysis so data source wise that is the different methods in business intelligence pretty much it is analytical in the sense that okay you have some data we are just trying to present the truth and mostly what has happened historical data that's it
in case of data science we go beyond that we go deeper in terms of finding why a certain behavior has occurred and also go beyond just providing a report there is a deeper statistical analysis that is done that is what is the scientific part and deeper insights are gathered not just reporting so that's from a method perspective from a skills perspective business intelligence needs a little bit of statistics but more of visualization because they primarily consisted of dashboards or they primarily consist of dashboards and reports whereas in data science the visualization of course is there
but there is a lot more of statistics involved because we are looking at things like correlation we are looking for example if we perform machine learning we try to do regression try to predict what will be the sales may be in the future and so on and so forth so it is much more involved in case of data science compared to business intelligence the skills are many more compared to business intelligence and last but not least what is the focus focus of business intelligence is pretty much historical data so the sales have happened based on
the sales still today you try to come up with a report what was my sales maybe for this whole year or maybe for the last five years and so on and so forth in data science you take historical data but you also combine that with maybe some other required information and you also try to predict the future so we try to extrapolate maybe the sales and say okay sales as of now as of today this is sales is 5 million and if we based on the the historical information we see that sales increase on a
maybe i don't know monthly basis 10 percent that is what history says so our sales for next month will be this much right so that is the focus of data science it goes beyond just reporting so what are the prerequisites for data science there are three essential traits required for to be a data scientist one is curiosity you need to be able to ask questions the first step in data science is asking question what is the problem we are trying to solve if you ask the right question only then you'll get the right answer very
often this is a very crucial step where a lot of data science projects fail because you're you may be asking the wrong question and then obviously when you get the answer that's not the answer you're looking for so it is very important that you ask the right question needless to say then the second part or the second trait is common sense so you need to be creative you need to come up with ways to use the data that you have and try to solve the business problem on hand in many cases you may not have
all the data that you need in many cases the data may be incomplete so that is where you need to come up with ways what are the best ways to fill these gaps wherever this is missing and that's where common sense comes into play last but not least after doing all this analysis if you are unable to communicate the results in the right way the whole exercise will fail so communication is a key trait for a data scientist maybe technically you may be a genius but then if you are unable to communicate those results in
a proper way once again that will not help so these are the three main traits curiosity common sense and communication skills in a way you can say these are the three c's okay so what are the other prerequisites first one so machine learning machine learning is the backbone of data science data science involves quite a bit of machine learning in addition to the basic statistics that we do so a data scientist needs to have a good hang or need to be very good at data science the second part is modeling so modeling is also a
part of machine learning in a way but you need to be good at identifying what are the algorithms that are more suitable to solve a given problem what models can we use and uh how do we train these models and so on and so forth so that is the second component then statistics statistics is like the core foundation of data science so you need to understand statistics and you need to have a good hang of statistics in order to be a good data scientist and this will also help in getting good results programming is to
some extent required at least some program or the other would be required as a part of executing the data science project the most common programming languages are python and r python specially is becoming a very popular programming language in data science because of its ease of learning because of the multiple libraries that it supports for performing data science and machine learning and so on so python is by far one of the most popular languages in data science if any one of you is wanting to learn a new language that should be python and then of
course you need to understand databases how databases work and how to handle databases how to get data out of databases and so on so these are some of the key components of data science now coming to the tools and skills that are used in data science these are some of the skills from a language perspective it is python or r and from a skills perspective in addition to some of the programming languages it would help if you have a good knowledge or good understanding of statistics and what are the tools that are used in data
analysis sas is one of the most popular tools it's been there for very long time and that's the reason it is very popular and however this is compared to most of the other tools it is a proprietary software whereas python and r are mostly open source the other tools are like jupiter jupiter notebooks you have r studio these are more development environments and development tools so jupyter notebooks is a interactive development environment similarly rstudio is for performing or writing our code and performing analytics and performing data analysis and machine learning activities you can perform in
rstudio it has a very nice ui and initially r was not so popular primarily because it did not have user interface and rstudio is a relatively new edition and after the advent of rstudio r became extremely popular and there are other tools like matlab and of course some people do with excel as well as far as data warehousing is concerned some of the skills that are required are etl so in order to extract data and transform load etl stands for extract transform load so you have data in the databases like your erp system or a
crm system you need to extract that and then do some transformations and then load it into your warehouse so that all the data from various sources looks uniform then you need some sql skills which is basically querying the data writing sql queries hadoop is another important skill especially if you are handling large amounts of data and also one of the specialities of hadoop as it can be used for handling unstructured data as well so it can be used for large amounts of structured and unstructured data then spark is a excellent computing engine for performing data
analysis or machine learning in a distributed mode so if you have large amount of data the combination of spark and hadoop can be extremely powerful so you store your data in hadoop hdfs and use spark as your computation engine it works in a distributed mode similar to hadoop like a cluster so that those are excellent skills for data warehousing and there are some standard tools that are available like informatica data stage talent and also aws redshift if you want to do some on the cloud i think aws redshift is again a good tool data visualization
tools for data visualization some of the skills that would be required are let's say r you r provides some very good visualization capabilities especially for for developing during development and then you have python libraries matplotlib and so on which provides very powerful visualization capabilities and that is from skills perspective whereas tools that can be used are tableau is a very very popular visualization tool again that's a proprietary tool so it's a little expensive maybe but excellent capabilities from a visualization perspective then there are tools like cognos which is an ibm product which provides very good
visualization capabilities as well and then coming to the machine learning part of it the skills required there are python which is more for programming part and then you will need some mathematical skills like algebra linear algebra especially and then statistics and maybe a little bit of calculus and so on and the tools that are used for machine learning are spark mlib and apache mahoud and on cloud if you want to do something you can use microsoft azure ml studio as well so these are by no means an exhaustive list there are actually many many tools
and probably a few more skills also maybe there but this is this gives a quick overview like a summarizing of summarization of the tools and skills now moving on to the life of a data scientist what does a data scientist do during the course of his work so let's see so typically a data scientist is given a problem a business problem that he needs to solve and in order to do that if you remember from the previous slide he basically asks the question as to what is the problem that he needs to solve so that
is the first thing he has got the problem then the next thing is to gather the data that is required to solve this problem so he goes about looking for data from anywhere it could be the enterprise very often the data is not provided in the nice format that he would like to have it or we would like to have it so first step is to get whatever data that is possible what is known as raw data in whatever format so it could be enterprise data it could be there is a probably a requirement to
go and get some public data in some cases so all that raw data is collected and then that is processed and analyzed and in prepared into a format in which it can be used and then it is fed into the analytics system be it a machine learning algorithm or a statistical model and we get the output and then he puts these output in a proper format for presenting it to the stakeholders and communicating those insights or the results to the stakeholders so this is a very high level view of like a a day in the
life of a data scientist so gathering data raw data performing some quick analysis on that and maybe processing or manipulating this data to bring it into a certain good format so that it can be used for the analysis feeding this into that analysis system that has been designed be it mathematical models machine learning models and then get the results the insights and then present it in a nice way so that the stakeholders can understand how about machine learning algorithms so let's see what are the various machine learning algorithms that would be required for a data
scientist so these are a few of the algorithms again there's not an exhaustive list we have regression is one of the supervised learning models or techniques so in case of regression you try to let's say come up with a continuous number so the difference between regression and let's say a classification is that in case of classification those are discrete values whereas here we are talking about regression where you let's say you are trying to predict the temperature which is a continuous value or the share price which is a continuous value so that is regression so
you need to know what is regression how to perform regression and we need to understand clustering so clustering is one of the unsupervised learning techniques in this case there is no label data that is available and you get some data and then you want to put this into some shape so that you can analyze it and you try to make sense out of it let's say you have one example is you have a list of cricketers and they have not been marked as bowlers and batsmen or all rounders or whatever right so you just have
their names and maybe how many runs this code how many wickets they have taken and so on but there is no readily available information saying that okay this person is a batsman this person is a bowler and so on so i'm talking about cricket hopefully most of you are familiar with the game of cricket so how do we find out so then we put this into a clustering mechanism and then the system will say that okay these are the people who are all who have all scored good amount of runs so they belong to one
cluster these are all the people who have taken good amount of wickets so they belong to one cluster and maybe here are some people who have taken good amount of wickets and they have made good amount of runs so they may be belonging to one group and then we take a look at it and then we label them as okay people who have all together and those who have you know scored many runs they are we label them as batsmen people have taken a lot of wickets we label them as bowlers and people who have
taken good amount of wickets and also made some good runs we label them as all rounders but the system will just say okay this is cluster one cluster two cluster three the names we give the human beings have to give the names now decision tree is used for what is known as classification primarily it can also be used for regression but by and large it is used for classification and here again it's a very logical way in which the algorithm goes about classifying the various inputs one of the biggest advantages of decision tree is that
it's very easy to understand and it's very easy to explain why a certain object has been classified in a certain way compared to maybe some of the other mechanisms like say support vector machines or logistic regression and so on so that's the advantage of dictionary but that is also very popular algorithm then we have support vector machines primarily for classification purpose and then we have naive bayes this is a a statistical probability based classification method so these are a few algorithms there are a few more that are not listed here but there are some more
algorithms as well and by the way there are more detailed or there are detailed videos about each of these algorithms available you can check in the playlist so now let's talk about the life cycle of a data science project okay the first step is the concept study in this step it involves understanding the business problem asking questions get a good understanding of the business model meet up with all the stakeholders understand what kind of data is available and all that is a part of the first step so here are a few examples we want to
see what are the various specifications and then what is the end goal what is the budget is there an example of this kind of a problem that has been maybe solved earlier so all this is a part of the concept study and another example could be a very specific one to predict the price of a 1.35 carat diamond and there may be relevant information inputs that are available and we want to predict the price the next step in this process is data preparation data gathering and data preparation also known as data munching or sometimes it
is also known as data manipulation so what happens here is the raw data that is available may not be usable in its current format for various reasons so that is why in this step a data scientist would explore the data he will take a look at some sample data maybe there are millions of records pick a few thousand records and see how the data is looking are there any gaps is the structure appropriate to be fed into the system are there some columns which are probably not adding value may not be required for the analysis
very often these are like names of the customers they will probably not add any value or much value from an analysis perspective the structure of the data maybe the data is coming from multiple data sources and the structures may not be matching what are the other problems there may be gaps in the data so the data all the columns all the cells are not filled if you're talking about structured data there are several blank records or blank columns so if you use that data directly you'll get errors or you will get inaccurate results so how
do you either get rid of the data or how do you fill this gaps with something meaningful so all that is a part of data munching or data manipulation so these are some additional sub topics within that so data integration is one of them if there are any conflicts in the data there may be data may be redundant data resident redundancy is another issue there may be you have let's say data coming from two different systems and both of them have customer table for example customer information so when you merge them there is a duplication
issue so how do we resolve that so that is one data transformation as i said there will be situations where data is coming from multiple sources and then when we merge them together they may not be matching so we need to do some transformations to make sure everything is similar we may have to do some data reduction if the data size is too big you may have to come up with ways to reduce it meaningfully without losing information then data cleaning so there will be either wrong values or null values or there are missing values
so how do you handle all of that a few examples of very specific stuff so there are missing values how do you handle missing values or null values here in this particular slide we are seeing three types of issues one is missing value then you have null value you see the difference between the two right so in the missing value there is nothing blind null value it says null now the system cannot handle if there are null values similarly there is improper data so it's supposed to be numeric value but there is a string or
a non-numeric value so how do we clean and prepare the data so that our system can work flawlessly so there are multiple ways and there is no one common way of doing this it can vary from project to project it can vary from what exactly is the problem we are trying to solve it can vary from data scientist to data scientist organization to organization so these are like some standard practices people come up with and and of course there will be a lot of trial and error somebody would have tried out something and it worked
and will continue to use that mechanism so that's how we need to take care of data cleaning now what are the various ways of doing you know if values are missing how do you take care of that now if the data is too large and only a few records have some missing values then it is okay to just get rid of those entire rows for example so if you have a million records and out of which 100 records don't have full data so there are some missing values in about 100 cards so it's absolutely fine
because it's a small percentage of the data so you can get rid of the entire records which are missing values but that's not a very common situation very often you will have multiple or at least a large number of data set for example out of million records you may have 50 000 records which are like having missing values now that's a significant amount you cannot get rid of all those records your analysis will be inaccurate so how do you handle such situations so there are again multiple ways of doing it one is you can probably
if a particular values are missing in a particular column you can probably take the mean value for that particular column and fill all the missing values with the mean value so that first of all you don't get errors because of missing values and second you don't get results that are way off because these values are completely different from what is there so that is one way then a few other could be either taking the median value or depending on what kind of data we are talking about so something meaningful we will have put in there
if we are doing some machine learning activity then obviously as a part of data preparation you need to split the data into training and test data set the reason being if you try to test with a data set which the system has already seen as a part of training then it will tend to give reasonably accurate results because it has already seen that data and that is not a good measure of the accuracy of the system so typically you take the entire data set the input data set and split it into two parts and again
the ratio can vary from person to person individual preferences some people like to split it into 50-50 some people like it as 63.33 and 33.3 is basically two-thirds and one-third and some people do it as 80-20 80 for training and 24-testing so you split the data perform the training with the 80 percent and then use the remaining 20 for testing all right so that is one more data preparation activity that needs to be done before you start analyzing or applying the data or putting the data through the model then the next step is model planning
now these models can be statistical models this could be machine learning model so you need to decide what kind of models you're going to use again it depends on what is the problem you're trying to solve if it is a regression problem you need to think of a regression algorithm and come up with a regression model so it could be linear regression or if you're talking about classification then you need to pick up an appropriate classification algorithm like logistic regression or decision tree or svm and then you need to train that particular model so that
is the model building or model planning process and the cleaned up data has to be fed into the model and apart from cleaning you may also have to in order to determine what kind of model you will use you have to perform some exploratory data analysis to understand the relationship between the various variables and see if the data is appropriate and so on right so that is the additional preparatory step that needs to be done so a little bit of details about exploratory data analysis so what exactly is exploratory data analysis is basically to as
the name suggests you're just exploring you just receive the data and you're trying to explore and find out what are the data types and what is the is the data clean in each of the columns what is the maximum minimum value so for example there are out of the box functionality available in tools like r so if you just ask for a summary of the table it will tell you for each column it will give some details as to what is the mean value what is the maximum value and so on and so forth so
this exercise or this exploratory analysis is to get an understanding of your data and then you can take steps to during this process you find there are a lot of missing values you need to take steps to fix those you will also get an idea about what kind of model to be used and so on and so forth what are the various techniques used for exploratory data analysis typically these would be visualization techniques like you use histograms you can use box plots you can use scatter plots so these are very quick ways of identifying the
patterns or a few of the trends of the data and so on and then once your data is ready you you decided on the model what kind of model what kind of algorithm you're going to use if you're trying to do machine learning you need to pass your 80 percent the training data or rather you use the training data to train your model and the training process itself is iterative so the training process you may have to perform multiple times and once the training is done and you feel it is giving good accuracy then you
move on to test so you take the remaining 20 of the data remember we split the data into training and test so the test data is now used to check the accuracy or how well our model is performing and if there are further issues let's say and model is still during testing if the accuracy is not good then you may want to retrain your model or use a different model so this whole thing again can be iterative but if the test process is passed or if the model passes the test then it can go into
production and it will be deployed all right so what are the various tools that we use for model planning r is an excellent tool in a lot of ways whether you're doing regular statistical analysis or machine learning or any of these activities are in along with our studio provides a very powerful environment to do data analysis including visualization it has a very good integrated visualization of plot mechanism which can be used for doing exploratory data analysis and then later on to do analysis detailed analysis and machine learning and so on and so forth then of
course you can write python programs python offers a rich library for performing data analysis and machine learning and so on matlab is a very popular tool as well especially during education so this is a very easy to learn tool so matlab is another tool that can be used and then last but not least sas sas is again very powerful it is a proprietary tool and it has all the components that are required to perform very good statistical analysis or perform data science so those are the various tools that would be required for or that that
can be used for model building and uh so the next step is model building so we have done the planning part we said okay what is the algorithm we are going to use what kind of model we are going to use now we need to actually train this model or build the model rather so that it can then be deployed so what are the various uh ways or what are the various types of model building activities so it could be let's say in this particular example that we have taken you want to find out the
price of 1.35 carat diamond so this is let's say a linear regression problem you have data for various carats of diamond and you use that information you pass it through a linear regression model or you create a linear regression model which can then predict your price for 1.35 carat so this is one example of model building and then a little bit details of how linear regression works so linear regression is basically coming up with a relation between an independent variable and a dependent variable so it is pretty much like coming up with equation of a
straight line which is the best fit for the given data so like for example here y is equal to mx plus c so y is the dependent variable and x is the independent variable we need to determine the values of m and c for our given data so that is what the training process of this model does at the end of the training process you have a certain value of m and c and that is used for predicting the values of any new data that comes all right so the way it works is we use
the training and the test data set to train the model and then validate whether the model is working fine or not using test data and if it is working fine then it is taken to the next level which is put in production if not the model has to be retrained if the accuracy is not good enough then the model is retrained maybe with more data or you come up with a newer model or algorithm and then repeat that process so it is an iterative process once the training is completed training and test then this model
is deployed and we can use this particular model to determine what is the price of 1.35 carat diamond remember that was our problem statement so now that we have the best fit for this given data we have the price of 1.35 carat diamond which is 10 000. so this is one example of how this whole process works now how do we build the model there are multiple ways you can use python for example and use libraries like pandas or numpy to build the model and implement it this will be available as a separate tutorial a
separate video in this playlist so stay tuned for that moving on once we have the results the next step is to communicate this results to the appropriate stakeholders so which is basically taking this results and preparing like a presentation or a dashboard and communicating these results to the concerned people so finishing or getting the results of the analysis is not the last step but you need to as a data scientist take this results and present it to the team that has given you this problem in the first place and explain your findings explain the findings
of this exercise and recommend maybe what steps they need to take in order to overcome this problem or solve this problem so that is the pretty much once that is accepted and the last step is to operationalize so if everything is fine your data scientist presentations are accepted then they put it into practice and thereby they will be able to improve or solve the problem that they stated in step one okay so quick summary of the life cycle you have a concept study which is basically understanding the problem asking the right questions and trying to
see if there is enough data to solve this problem and then even maybe gather the data then data preparation the raw data needs to be manipulated you need to do data munching so that you have the data in a certain proper format to be used by the model or our analytics system and then you need to do the model planning what kind of a model what algorithm you will use for a given problem and then the model building so the exact execution of that model happens in step four and you implement and execute that model
and put the data through the analysis in this step and then you get the results these results are then communicated packaged and presented and communicated to the stakeholders and once that is accepted that is operationalized so that is the final step now in the end let's take a quick look at the demand for data scientists data science is an area of great demand the demand for data scientists is currently huge and the supply is very low so there is a huge gap so what are some of the industries with high demand for data scientists i
think gaming is definitely one area where it's a industry which is consumer facing industry and a lot of people play games and growing industry and it requires a lot of data science so that is an area where data scientists are in demand then we have health care for example data science is used for diagnosis and several other activities within healthcare predicting for example a disease so healthcare is definitely finance definitely banks insurance companies all of these there is a huge demand for data scientists marketing is like a horizontal functionality across all industries there's a demand
for data scientists there then of course in technology area so pretty much all of these areas there is a lot of demand globally there is a huge demand so this is a very very critical skill that would be required currently as well as in the future so let's summarize what we have seen so far we talked about the need for data science what data science can do and what is data science and what are the prerequisites of data science in terms of the skills and programming languages and tools and so on and so forth we
also talked about the various tools that are available like python and r and we did a comparison also between business intelligence and data science and we did a detailed discussion about the life cycle of a data science project with an example and last but not least we talked about the demand for data scientists the global demand there's a huge demand for data scientists we talked about that as well so with that we come to the end of this session thank you very much for watching this video and if there are any feedback any comments please
or any questions please put it below and we will get back to you provide your contact information or email so that we can respond to you and thank you very much once again and have a good day bye bye hi there if you like this video subscribe to the simply learn youtube channel and click here to watch similar videos to nerd up and get certified click here