[Music] hi everyone i welcome you all to the live session on data analytics full course by intellipaat this session is conducted by multiple experts who will be teaching you all about data analytics from basics to advanced level but before we begin the session make sure to hit the subscribe button and also hit on the bell icon so that you will never miss an update from us now let's see the agenda firstly we will begin with introduction to data analysis in that we will be explaining you what is data analysis why do we require that and
types of data analysis once we finish that we will explain you about the life cycle of data analysis then later on we will tell you how to become data analyst what are the skills required job and career prospects of data analyst post that we will learn about pandas and data analysis using pandas then later on we will learn about exploratory data analysis and also numpy then later on we we'll be doing a very quick hands-on demo on how to create numpy array after that we will be seeing the difference between a data analyst and data
scientist then finally we will be covering data analyst interview questions and answers so this is the agenda of the video without any further delay let's get started we've all heard this term somewhere it pretty much rings a bell that every time we think about it right so analytics with respect to data well what are we trying to analyze is what we need to know here right so the formal definition of how it goes it's basically uh extracting uh meaningful information from raw data it's as simple as that so we have data right so consider anything
to be data if that is not useful to you at that point it is considered as data but then if you perform some operation on this data and make sure that it is useful for your organization this is when the data becomes information and this information is what is usable for you right so the process of pretty much converting raw data into information uh by performing certain analysis on it can be termed as data analytics guys i mean this is just a very rough definition of what goes in the industry to give you a better
insight of what that means basically it is the pursuit of extracting meaning from raw data using specialized computer systems case and then these systems they transform they organize and they model the data end of it is to basically draw the conclusions from all of the data uh so you'll have again as i've mentioned with the raw example our goal here is to basically draw conclusions to identify patterns to make use of this raw information in a better way i mean sure data can just be used in its raw form but then not with every case
right to give you an example so think of the data probably you understand so let's say you're looking at a huge data set which contains thousands of values those are just numbers right so having these numbers sure as a data analyst you will understand what you're doing but then let's say you need to explain it to a person who doesn't know what data analytics is or you need to explain it to someone probably uh your peer or someone above you as well let's say you have a business meeting where you have to explain this giving
the numbers will not really be that great right it will not make a very good presentation so converting all of these numerical data making it into graphs making it in a way where everyone understands the data and showing out graphically whatever insights that you can generate from the data that can be considered as data analytics as well guys again uh today basically the field of data analytics is ever so growing rapidly and why is it growing so rapidly well because the market demand for it is that much so we have pretty much every uh you
know startup companies these days have a requirement for big data as well so big data uh to dump down the definition it pretty much means a huge amount of data that cannot be handled by just one machine if you're stretching out your data across multiple nodes you have data coming in from various sources uh then you need a method to handle all of this big data that's coming in to process the data and then to perform analysis on it guys so this has been in demand in the market for a while and uh for the
last couple of years data analytics has been uh in the boom and then it's providing a very straight uh answer to all of the questions being raised about how all these can be handled guys and of course we're going to need the people who have the skills which is needed in skills which are needed to manipulate all the data queries uh to translate all these numbers into graphs and uh to make insightful analysis on it right so that's what is the main key uh role of a data analytic person uh what does the need for
data analytics if you have to break it down into three steps guys so again we already know that it is on the rise right so i can pretty much go about to uh step out into the open and then tell you that it is not soon that it will be a an integral part of an organization it is already an integral part of every organization there is guys so the first important need is again we already know that it is a top priority for all of these organizations ranging from the small organizations all the way
to the big guys and then this is needed to make very good uh decision making well with respect to you are being confused to let's say uh take a decision for you then looking at this analytics will make your decision on making skills and decision taking skills a little bit easier and it'll be more validable as well guys and it'll be more validated as well guys and the second thing is to over require or new revenue right so uh let's say you have the data which pretty much only a couple of people can make sense
of understand let's say your market is very niche in that case sure you'll be still making revenue still be making money on it but then at the end of the day if you have to reach out to a broader audience you need to make sure your data is understood by this broader audience right so that forms the second very important need for data analytics guys and the third one is to obviously uh decrease these operation costs for every organization so this can be a very demanding task if you have to convert raw data into information
process the data and make out very good uh visualizations out of it as well but then if you have the workforce for it then sure you can do it but if not again since it's probably a manual task until a couple of years ago it surely was a manual task and these days you can implement machine learning and data analytics to pretty much automate itself and make your job easier as well it speeds up your work it keeps you very efficient in such a way where your data is being visualized processed very effectively so again
time is money when it comes to big organizations right so on that note pretty much it helps decrease all of the operation costs with respect to you know either waiting in the data pipeline or waiting to process uh you know waiting to publish waiting to do some analytics waiting for visualization all of these weights uh you know you need not consider them anymore because they will pretty much be reduced to null guys so on that note you might be wondering who are data analysts right so data analysts are the people who sit at the end
of the chain i'll just explain the chain in the next slide uh so these guys form a very good uh workforce for the company where they basically you know deliver the values by taking in all the data that data scientists might give it to them a data engineer might give it to them and then he uses this person data analyst uses all of these to answer a couple of questions and then communicate all of the results back and all these results that he just gives back is basically used to take very good business decisions case
it might include analytics of past trends it may be prediction of what the future looks like and so much more and then the common tasks done by data analysts are data cleaning uh performing analytics and then creating visualizations well data cleaning to come about it is basically simple guys so again data is wrong for raw information right to make sense out of it to pick out information that we only require and to make sure that again efficiency is the game here if you're pretty much processing data that we pretty much will not require then it's
just a waste of time and resource so cleaning the data making sure that your data is just perfect enough for processing that's that's a very important first step guys so this is the spearhead of a data analyst's role and then the second one is obviously as soon as your data is clean you need to perform some very good analytics on it and create very good visualizations guys so these four designations you see on the screen right now is basically the designations the data analyst is also known by uh so the first is the business analyst
so data analyst is also a business analyst uh you know he or she can be an operations analyst a business intelligence analyst and a database analyst as well guys so coming to the tasks that they pretty much do the first important task that we already spoke about was cleaning and organizing uh raw unstructured data right so this forms this is again a very overlooked concept in today's world unless you get your hands dirty with the data itself uh when you start to uh to go about doing that you will realize that cleaning and organizing the
raw data is extremely important because at the end of it your data might be unstructured your data might be semi-structured or it might be a structured data this will not matter if the data you're processing is of no use at the end guys so cleaning and organizing raw data is very important and the second thing is the analysis of all of the hidden trends found in the data so making sense of something again maybe predictions in the future or looking at past trends uh you know generating this sort of an information which you cannot figure
out upfront just by looking at the data this is pretty much the umbrella under which the hidden trends function of our data analysts work so you'll be looking at a data you'll find something very interesting so let's say you find a trend in your data where pretty much you can access a new side of the market with respect to your sales uh in the next five years ten years and then you were not told this but then your data was telling you this so you making sense of the data in which this trend was found
which was not up front then this is again a very important skill of a data analyst as well guys and the third important thing is pretty much the big picture view or using descriptive statistics at the end of it sure we know what the past trends of the company are we know how the company is going on now but then if you need to uh just to summarize all of this and say even add some predictions for the next five ten years getting this big picture view of what's happened what's happening now and what will
happen in the future or not just using uh very simple statistics but using descriptive statistics where we'll be picking up each uh aspect of what's going on and then perform analytics on it and find out what's going on if we are right at this place if you're wrong at this point how can this be improved you know how can this concept be uh checked how can we reach the clients better and so much more so getting this big picture view is again a very important task our data analyst goes about doing guys and the fourth
one is probably the most important thing a data analyst does is pretty much the creation of dashboards and visualizations guys so as i already told you in the introduction part of the video where uh we're starting out with raw information again a lot of numbers but if you have to show these numbers at a business meeting yeah i mean numbers are numbers some people like them but a majority of the people who they might not understand the numbers so making these numbers as your input creating very good visualizations giving them a user interface and giving
your customers your peers your superiors your business meeting members are giving them a very good user experience with all of this data again at the end of it will add up to very good business methodologies and then it will help you take better business or driven decisions again and again the same goes for representations of results to your clients and your internals as well guys so coming to the chain of how the data moves around in a form i have three important things that i just want to walk you through guys in a quick just
ask the first person who is the spearhead of the data for an organization by spearheading he's the first one to look at the data or he'll be responsible for bringing in all of the data from various sources guys so data coming in from the various sources right so think about all the data you can get from twitter or if you're performing any sentiment analysis think of all the data that can come from all of the big data sources you can have data coming from your various nodes from hadoop so much more you can have your
data coming from your own network away from your network and so much more guys the data engineer is the spearhead who handles bringing in the data and making it understandable by the organization guys and then as soon as the data engineer finishes his part with the data the data moves on to the data scientist the data scientist uh he's responsible for working on all of these raw data converting it into valid information but then we already told this for the data analyst as well well the data scientist pretty much uses machine learning algorithms he uses
deep learning he uses let's say uh binary classifications naive bias he makes use of so many concepts here that a data analyst might not know and he uses a lot of machine learning and deep learning as i've already mentioned to convert this raw data into valid information and 99 of the time the information which is being converted into are numbers guys so after the data scientist pretty much goes about doing his magic onto the data and then we have the data analyst who steps in this person again as we've already been mentioning uh he is
responsible for the prediction of what happens uh in the future that is finding out trends and then the presentation of all these uh informations to your peers to your clients uh to your superiors and so much more uh this is the basic gist of how data moves around so the data is first seen by the data engineer he does his job comes to the data scientist the data scientist works his magic we are all these fancy algorithms and whatnot and then pushes the data to the data analyst the data analyst performs predictions sees trends visualizes
the data makes it presentable for everyone to understand and then eventually goes about analyzing the data guys so on that note we can come check out the type of data analytics uh that can be done in today's business case uh there are a couple of types of data analytics and i'll be walking you through the same so the first type of data analytics uh you can go about doing is the descriptive analytics uh with respect to descriptive analytics again the quick uh one uh phrase answer to what descriptive or analytics is is basically uh you
know uh picking up data from a source or summarizing your data making sure your data can be understood by everyone that is picking out very good insights from your data from all the past events doing some predictive uh analytics on that as well and then keeping it ready so your data is descriptive at the end of the day and uh a question which goes by to understand the descriptive analytics is uh what happened in my business so the answer to this question what happened in my business is given to by the descriptive analytics guys so
most of the time uh the data which is generated by descriptive analytics is very comprehensive it is extremely accurate and the visualizations are very effective as well so here's a very simple scenario that you guys can consider so consider a scenario or you know where an e-learning website decides to focus on a trending course according to the analytics of the search volume of the course content uh which pretty much the users go on searching and all of the revenue generated by the course in the past few months as well so there are certain technologies again
we live in a world full of trends right so let's say uh data science is in the uh boom right now so pretty much any e-learning company or any company for that matter they're gonna find these trends they're gonna know that hey data science is sitting in the top tier and we need to do something about it if they have students in data science then most of these companies are aimed at making the the students life better right so even as a firm here an intelli path that's exactly what we do as well we get
this amazing insight from all of our learners uh for any particular course and we use all of those insights all of the feedback that we get and what's trending in the market what's latest in the market and we use all of these to perform analytics and then pretty much all of these helps us to put out a better course as well guys so at the end of it you can already see how or the descriptive analysis is already helping our business right and then the next type of data analytics we can check out is the
diagnostic analytics guys so we already checked out what is happening to my business with respect to descriptive analytics with respect to diagnostic analytics you will check out of why this particular thing is happening to my business guys to simplify it again it's basically gives you the ability to drill down to the root cause of why something is behaving just like how it should guys so drilling down the data to identify certain set of problems which are present in your data or in the trends that the data shows is a very vital part of diagnostic analytics
guys and then it basically helps in answering uh the question for us about why the issue has occurred you know so it just takes one look towards the data towards the analytic result of the data to understand what the cause behind a problem if one should exist as guys so again to give you a very simple scenario of how diagnostic analytics would work uh let's say consider a company where the sales went down for a month so let's say they weren't doing as good as the previous months so to diagnose this particular problem you will
let's say consider the situation where the number of employees were quitting their job uh and they weren't bringing in a lot of sales so this could pretty much you know this number of people quitting the company could directly impact the sales that they brought into the company so the sales went down this month because they did not have the company right uh they did not have the sales right so finding out why this is happening and uh hunting for that root cause which is basically a treasure hunt with your data where you go on finding
something and taking insights from your data finding out that why part of it is extremely important when you'll be finding out what's happening uh to the data and why it's behaving this way guys so this again is a very vital part of diagnostic analytics and coming to the third or type of data analytics it is predictive data analytics guys and as the name suggests the quick question you'll be asking is what will happen in the future based on the past trends guys so finding out hunting out uh you know historical patterns that are being used
to basically predict all these specific outcomes using let's say machine learning algorithms using deep learning and using so many more concepts to predict something in the future uh based on the data we have now is again a very important niche of data analytics guys so predicting the future trends and the possibilities in the market based on the current trends self-explanatory and then it helps in optimizing all of the business plans for the future because now your business has a direction to head to right so you're predicting certain aspects in the future and you know which
path leads to a better business or strategies right and this will give you an edge over all of the other businesses as well to give you a very nice example think of the netflix recommendation systems guys so this will basically use statistical modeling to analyze all the content that it's being watched by the audience across the world so let's say uh there are a couple of tv shows which are very famous in india but then there might be a couple of tv shows which is uh you know trending in the united states and let's say
australia united kingdom so much more making sure that the right users in the right geographical area get to know the right recommendations are extremely important for netflix as a business right so this again predictive analytics will just do them good there guys again it will provide us with the prediction of all of the upcoming content of the shows that must be watched by all the different class of audience as well so let's say someone sitting in the united states is very interested in watching an indian trending television series as well so knowing on netflix finding
out that this person exists and to recommend an indian tv show for this person sitting in the united states to watch again this is another very good business opportunity for netflix and then if they go about nailing their recommendation system for everyone it just makes them a better business model and then it just makes a better user environment for the users to go about watching videos there guys so the next one the next type of data analytics is prescriptive data analytics uh so as soon as we think of a prescription again uh we think of
a doctor we think of something a medical appointment or something right here it's again something similar the question which will be asked here is what should be done so applying advanced analytical algorithms to ensure that you make very good recommendations you make sure you're punching out very good strategies that help the business and so much more is a very vital part of prescriptive analytics guys so it basically involves breaking down all of this complex information into a very simple set of steps and these steps are like the prescription we handed out when we uh visit
the doctor right and these are the prescriptions which are precautious as well let's say uh that the precautions basically used to remove any of the future problems that might occur and it will also help in performing predictive analysis this will basically help to predict outcome and eventually that will help us optimizing business as well guys so this again demands the use of artificial intelligence and big data and then the data analyst person will always be in touch with the data engineer as well and the data scientist as well uh because he he's going to need
all the help he can get with respect to artificial intelligence from the data scientist and help with respect to big data from the data engineer as well right so all these three guys working in correlation make up a business and i've already uh told you that so again to give you a common scenario of how this would happen consider uh the prescriptive data analytics that google map goes about doing right so every day we're commuting from our office to our homes or let's say we're going on a vacation and then we need to get out
of the city as quick as possible so google maps has a wonderful api where they map out the best possible route considering live traffic conditions weather conditions road closures and so much more it considers your distance traffic constraints and again as i already told you it even considers if it's raining if you're walking uh you know what's the best route that you can take to walk what's the best route to take if you're on a car and now they've come up with the bike route as well so if you're in a moped or a bike
you can take a different route compared to the person who'd uh commute with the car and so much more so knowing this kind of a prediction um into the future and giving you the sort of a prescription to ensure you don't run into any problems along the way is uh is a very important part of prescriptive data analytics guys so on that note let's quickly uh check out what the life cycle of data analytics is like so the data analytics lifestyle it basically defines the analytics process and all of the best practices which goes on
from the discovery of the data or the project till the completion of this project guys so there are a couple of steps involved in this life cycle process and the first one is business understanding at the end of it again understanding the purpose and all of the requirements that come from your business and understanding it from the business viewpoint is very important and vital for the functioning of a business right this also consists of a very good introductory plan it consists of a decision plan it consists of a it consists of a formal to-do list
let's say for the business to go on to achieving the target so the first important uh thing about the life cycle is the understanding of what's going on around you and the second one is the data understanding yes with respect to data understanding again this mainly involves the process where we're collecting the data and we're processing the data in a way which leads to analytics and then after the analysis of the data is done we need to uh pick up some insights that we can you know go about using from the data so extracting all
of these meaningful insights from the data again is a very vital uh step in the life cycle of data analytics through data understanding guys uh the third part of it is data preparation so data preparation is the uh let's say converting the data from an unstructured form to a structured form and this involves constructing let's say a data set as well and then this data set will be provided and fed into a model and then this model will be used by a machine learning algorithm to train to understand to see what's going on to perform
predictions and then it will be given to the data analysts to be visualized using tableau or any other business intelligence tools and so much more so data preparation again is a very important phase where the data is actually being transformed and even at the stage the cleansing of the data is pretty much performed as well guys and then comes modeling this step is very important because it involves the selection of various modeling techniques you know applying all of these modeling techniques uh making sure the parameters are right and all the readings here are right to
ensure that your data being converted into the information raw data being converted into the information is off uh the optimal let's say it's the optimal tolerance that can be used for your business usage and then as soon as your modeling part of it is done as soon as the techniques are applied and all the parameters are marked then comes evaluation guys so evaluation is a very important phase where again the model which is being built it will be built very rigorously tested very or rigorously based on what you've built in the initial status based on
what you've built in the initial stages and then so many tests will be performed on this data as well so evaluating something that you have generated performing various tests on it is extremely important uh most of the time this is overlooked but then these days everyone knows the value of evaluating your data guys so this again involves reviewing all of the steps that are again uh needed to carry out uh or carry out to construct this particular model and to perform tests on it is very important and with uh evaluation that's exactly what we do
guys so the next uh life cycle concept i want to tell you guys is deployment deployment is the last uh step in the data like in the data analytics life cycle because deploying is when you're sending out your model into the world or for let's say by the world i mean let's say for your team for your peers or let's say even for your client and customers as well so making your data go from just uh using it all the way to spreading the data and uh performance spreading the data to your clients to your
peers or anyone else if you know you can perform more tests on these as well so after deployment you can have after deployment tests if something is wrong again you can go back to modeling perform more evaluation perform more deployment as well so the thing you need to know here is that deployment pretty much goes about to be the final phase of the data analytics uh lifestyle guys and on that note or we can uh check out very quickly what are the roles of analytics in various industries around us guys again data analytics has an
amazing insight and impact when it comes to the telecom industry because again if you've been observing all the prices of or let's say the calls the messages or the internet packs these days have been coming down and down and there was a time when they were being exorbitantly high as well so again the telecom industry realized that if you keep the prices very high to make more profits and eventually your customers will not come they will not buy the internet packs so to keep this in mind probably they decided let's say we're having a bad
impact right now so let's drop down our prices to see if it works and it is i guess and this has helped the telecom industry in bringing better business and for us as users this has helped us uh you know just uh make it a bit economical and efficient on our side with respect to money as well and then the retail banking industry of data analytics has a huge impact in the retail banking industry as well to know what their customer wants because uh again with like telecom industry as well they have a huge amount
of customers to play with right so they need to understand the view of each customer's requirement of each customers and then to find out if all of the customers are actually in the common chain of what's being supplied by the bank or if the customers are against something that's being sent by the bank so that again is a very important thing that you need to check here as well and then with respect to the e-commerce industry as well to perform uh again some analytics in the e-commerce industry it might be ad recommendations or there are
very big sales which are run by some of the big names in the industry such as amazon flipkart mintra and so much more so these guys will have to perform extremely heavy analytics on the data that they see based on the products that they sell and based on the places based on the city in which the product sells or if the people are unhappy with the price again performing analytics has just changed the e-commerce industry uh if you may guys and the last most important industry where data analytics has touched is the healthcare industry this
has had the most impact with respect to analytics in the healthcare industry with respect to so many things uh mainly it is respect to with respect to uh finding out what medication is required by what countries what amount of medication is working for what population and so much more guys i can probably uh talk about just the roles of analytics in these industries for days together and uh we can still be going on and have a very good discussion of how important analytics has become these days again with respect to insurance as well just a
quick info guys test your knowledge of data analytics by answering this question which of the following method creates a new array object that looks at the same data a view b copy c paste d all of the above comment your answer in the comment section below subscribe to intellipath to know the right answer now let's continue with the session uh what what what geographical location requires uh insurance what uh what what the audience are and so much more so again as i've said uh we can go on talking about this but then to keep it
to the scope of this tutorial uh we can just quickly brief through each of these guys so with respect to the healthcare industry uh this is the formal way of going about analytics now the first one is to again analyze all of these disease patterns analyze all of the disease outbreak that pretty much goes out we had a disease outbreak a couple of years back which is ebola and so much more we had h1n1 and so much more right to keep a track of all of these outbreaks this basically again improves uh the surveillance with
respect to health and then this gives out good responses to the emergency sectors as well guys and then development of better targeted preventive techniques obviously and then the development of vaccines are making sure your vaccines again reach your customers and all of these where you need to reach your customers or let's say in this case the patients are very important guys so identifying the consumers again in this is the greatest risk of business so identifying certain customers or patients who are the greatest risk is again you know because they might be developing some adverse health
outcomes and then developing welfare programs to keep a track of their health to track up their health on a daily basis even a weekly basis monthly basis to perform analytics on all of the aspects and the parameters that you're tracking with respect to the patients again that is a very important thing and lastly to ensure that they can reduce readmissions because they might know what the cause of an adverse effect is and if there are 10 patients with very similar symptoms then you can perform analysis and uh you can at least filter out and find
out that all of these 10 patients might have this one common uh symptom associated with them and this might be the cause of that so mapping that for every patient is again very important yes so coming to the uh telecom industry again telecom industry pretty much goes about using predictive analysis to gain all of the insights that they need to make better decisions to make faster decisions and to make more effective decisions again as i was talking about the internet pack example again this was a very key in that and then by learning more and
more about the customers daily and the preferences and the needs these telecom companies can be more successful in this extremely highly competitive industry as well it's good for them with respect to business and it is good for us as customers uh by bringing down the prices again so it is used for analytical uh customer relationship management to use for fraud reduction it is used for bad debt reductions use for price optimization call center optimization and so much more so now that you're looking at data analytics in this way you realize that data analytics eventually has
a big play or has a big say when it comes to any of these business models right so again coming to banking as i've already mentioned analytics is making banks become very smart day by day guys so it is managing all the the plethora of challenges that the bank faces and then again while pretty much uh you know going about or doing some basic reporting all the way to descriptive analytics and this is all a must for every single bank right even performing let's say advanced uh prescriptive analytics and so much more and all these
are starting at this age banks have started to realize that this will again help you generate very good insights and this will result in extremely good uh business impact that will help the banks as well on that one we need to check out how data analytics is helping in the banking industry as well right so it is used to acquire and retain customers it is used to detect fraud which is again extremely important with respect to banking it is used to improve risk control find new sources of growth for the bank and to optimize all
of the product and uh generate their portfolio models as well so as soon as we check out the banking sector again with the e-commerce industry right so this is again this is the market which is exploding for the last couple of years i can see even the decade right ebay came up flipkart came up amazon is again taking over everything mintra niko guys there's so many e-commerce portals today and to make sure we perform very good analytics on these e-commerce industries is very vital so how is data analytics used again it is used to improve
user experiences it is used to enhance customer engagement customize offers and promotions maintain effective supply chain management optimize pricing models minimize the risk of frauds provide them very good advertisements that pretty much help them pick up products good recommendation systems where they'll pick up another product after the first product guys and so much more if you've just bought an iphone again you'll pretty much be recommended with a couple of cases that the people have bought as soon as they bought another iphone so you might be that person you might like the case and you might
pick it up at the end of it you have the case to protect your phone the business just created more money out of it right so again the analytics or the role of analytics in e-commerce industry is is extremely vital let's say the people in the e-commerce industry have known this for a while guys so coming to the analytics in the insurance industry again here as well guys this is basically used to enhance your customer engagement acquire new customers retain the existing customers make sure the customers don't leave prevent the frauds at the end of
it reduce the frauds prioritize all the claims that need you'll have medical insurance you'll have again health insurances uh you'll have life insurance you have so much more that you need to take and all of these directly impact the user right so making sure you take the feedback from the user work on it and then create some analytics out of it is again very vital guys so on that note uh we can quickly come to our raw case study which i was talking about and this case study is a very famous one it's basically the
house of praiser case study and here all we'll be doing is predicting house prices guys so we'll have a certain set of data which we will use to predict the prices of the houses so basically how can we predict the price of a house there are so many things that you need to know right you'll be looking at the locality in which the house is present you'll be looking at the amenities you'll be looking at the number of bedrooms the living space the number of floors in your house or the number of cars that can
fit in your garage the size of the garage the quality of the construction of the house if it has a swimming pool or not if it has a uh you know a spa or not i mean so many things if we have to list down uh this particular use case then it will be extremely tough because each one of us has our own judgment of how we can validate a house right because house is something that's very personal to us again the materials of what was used to go to build the house the style of
the house which is built in the number of uh you know if the house has an elevator how convenient is the house for or disabled people guys so much more so basically we'll be performing exploratory analysis on this guy so exploratory data analysis again is used to find a hidden trend in your data by performing analysis on it and then at the end of it the trends will be shown as numbers but since we already know visualization we're going to be pretty much using these numbers to visualize all of the data for us and we'll
be doing it step by step so we'll be finding correlation between the data as well so again correlation is basically to check how one variable is linked and how the changing of one variable directly uh you know changes the other variable as well so how these two variables are pretty much hung up together uh how changing one variable change the other or can be known using correlation as well guys so a couple of steps that pretty much is generally followed in the case of performing exploratory analysis first we'll be visualizing all of our data finding
the missing values and we'll be looking for correlations guys and then after this is done we'll be cleaning the data to check uh if any issues are fixed or if we'll be checking out of the data that we have we have is pretty much being used fully or not and then we go about building a model which is used to visualize our result it will give us the diagnostic it will give us the residual diagnostic roc curves you know charts graphs tabs uh tables and so much more guys i do not want to basically overwhelm
you with the use case so to keep this use case very simple we'll just perform a exploratory data analysis at this stage to find out correlations between the data and as soon as we go about progressing with respect to our data set and to work with you will understand how beautiful data analytics is guys so let me quickly jump into google collab which is basically a jupyter notebook hosted on the google cloud and here we can go about performing our raw data analytics on the use case that i just walked you through so we're going
to need a couple of files to run our use case we'll actually need one file which is our training data set file uh let me just quickly uh add the file to our google collab and then we can go about proceeding performing our analytics guys those will just take a second to upload the file give me a second we just actually need one file from out here but then it doesn't harm to upload it and keep it in your run time but then you just take the message right so particularly pretty much all of your
files are recycled as soon as the runtime is pretty much changed so the first step uh for our use case is to load all of the necessary files in the libraries that we require guys so the first again we'll be using pandas to handle all of our data we'll be using seaborn and matplotlib to perform plotting operations on all of this data how to perform visualizations on all of these data and the style that we'll be using is pretty much called as the bmh method and with respect to the bmh again bmh is nothing but
the bayesian method for hackers and this is just a type of a graph visualization method which gives us graphs which look nicer and then it helps us to perform analysis better on linear data sets guys so the second thing we'll be doing is again loading all of the necessary files the one of the important files we need is the training data set which is all this and here you have the id of the houses uh you have the subclass of where the process is present you have the zone in which the house is present uh
what is the area of the front edge that you have what is the area lot of your house what is the street it's a uh what is the alley again lot shape lot contour or what is it you're the house was built and what it was remodeled in what is the size of the roof uh you know again so many conditions out here what is the foundation made of how is the quality of the basement condition of the basement uh the exposure to the basement again a finishing type of the basement guys you know just
look at how expansive this uh data set is this is again a data set which is extremely popular among us analysts where we pretty much uh like to work on this because it consists of everything and you can perform so much on the single data set and so much more guys so on that note let's quickly find out the information of all the variables that are present again since i was just working with the housing as heating how is the quality control of the heating does have centralized air conditioning how is the condition of the
electricals what is the uh square footing of the first door square footing of the second floor what is the low quality finish that's uh and how many square feet of that do we have uh what's the living space area how many full bathrooms do we have how many half bathrooms do we have uh kitchen quality guys this will go on right so all these data is what we need to check out again if you see here all of these are the values that are present which will help us map something but then ali uh again
doesn't have many values you can check out here itself right so ali it's nan is basically not a number so there are not many you know ali or details which we can make analysis out of so we will not require ally again coming down not much of fireplace uh quality as well we do not have pool quality control at all miscellaneous features are very less again even fencing is very less so uh the average is somewhere around 1400 right so to make sure that our cleaning up our data uh very important part of it is
to basically how we go about doing it again or we'll just remove all of these data outside less than 30 uh so less than 30 of again 1460 and then we can have at least 70 percent of the data to give us some accurate results right so it's already checked ali id not every house has an id so that's removed not uh every house is mapped to an alley pool quality control is not their fence is not their miscellaneous switches are very less so we have dropped all of these columns and we will not be
using these to perform our analytics guys and so again to describe how what goes on to uh you know do a distribution uh i hope you guys know this concept about normal distribution and uh and with respect to all of the details that we can get out of it when we uh perform the math operation guys so basically we can count the total number of data that's present with respect to all the individual data what is the mean uh sale price of the data what is the standard deviation let's say the mean of the normal
distribution at the right at the center somewhere here guys so the mean is somewhere around 18 000 right uh i'm sorry this is one eighty thousand oh if you just keep tracing from the center point down this is somewhere where uh 180 000 exists guys so again what is standard standard deviation is basically the deviation from the mean so what there are house values which deviate from this mean as well again 25 percent deviation 50 deviation 75 deviations and then what is the maximum sale price of the house as well so all of these uh
can be found out from this particular graph guys and if you can already observe and even if you're not exposed to a normal distribution this starts out as a very steep curve but then it ends out uh with respect to a lot of data here as well right goes on until uh 800 000. so basically probably from 5 000 or let's say even 400 000 all the way to 800 000 we call these as outliers these are called outliers because these are very far away from our normal distribution and then these actually might not be
useful for us with respect to our mean or whatever and these will impact a lot when we are performing analytics with respect to the mean or the standard deviation or anything for that matter so we will have to actually remove them and not consider them uh to basically perform very accurate analysis guys so on that note we can pretty much go on to finding uh you know the type of the data set and the type of the data that we'll only consider because in this particular case since you're playing with numbers it has to be
the numeric data type right again as you can check out we have integer numbers we have floating numbers and so much more so let's pretty much go on to print uh what it looks like after we have dropped the values where we're not using we're not using id we're not using ally we're not using so much more right so these are all the numerical uh values that we'll be using to pretty much consider again your built is a numerical value 2003 is a date i mean a year overall condition overall quality all these can be
rated from a particular scale right so again a square footing is a particular number as well so the second floor of this particular house has 854 square feet so much more so on that particular note as soon as we check out all of these numbers are present we can start performing an analysis guys so before that again we need to just plot all of these to just check uh what it look like on graphs because seeing numbers are one thing seeing graphs on the other hand are something else right so pretty much we'll be uh
developing histograms and we can be checking out this so let me just scroll down a little yeah perfect so with respect to first floor square footing the mean is somewhere around or here right so it's around let's say 500 square foot thousand square foot and again look at a second floor or the square footing look at the bedroom average basement finishing qualities uh garage is the number of cars that you can park in the garage and look at the value here too so the majority of the houses here have two spaces to park your cars
the year that the garage was built in just a quick info guys intellipaat provides online data analytics course in partnership with ibm and microsoft the course link is given in the description below now let's continue with the session again uh the ground floor living area how many half bathrooms you can see one half bathrooms again at this point of time you have certain values at zero as well well sure we can consider values of zero if it's very important but then since we're talking about sale price what is present is more important than what is
absent right we need to have something descriptive for our analytics methods to work so in that case we'll have something called as the golden features list and this variable will basically contains all of the features uh that will be associating with respect to why our sales price is as high as it is guys so this variable called as the golden features list will have all of the features guys so basically we're creating a variable where we're finding on the correlation and then we can already check out the top 10 correlated values which are strongly correlated
by correlated again let's say uh let me give you a quick impact basically we have uh uh described this in the descending order as you can already check out so the first thing here is overall quality again uh to transcribe this into literal terms overall quality of the house whatever the rating that was given in our data set is mattering the most of how the houses is being priced again the living area of it the number of garages you can park is having almost 64 impact of for with respect to why the price is like
that garage area is having a 62 impact the basement uh square footing has a 61 percent impact and then the year it was remodeled and uh you know changes made is having a 50 impact of why the house price the sales price is like that guys so looking at a couple of linear relationships you can pretty much find out that a lot of values are zeros that i just walked you through a couple of seconds and look at that uh with respect to our again the ground floor living area we have a very little number
of zeros with respect to the sales price again uh check out the basement or surface area again with respect to the total basement surface area again even here as well so look at all these tiny dots which are sticking up to zero with respect to sales price so these have no impact for our sales price because they have they're not giving us any valid linear data look at this again with zero oh it is raised up to a lot about six hundred thousand dollars right so all these are not adding any meaning to our data
so if x is equal to zero again this might indicate that that house does not have that feature at all so if this is zero this house does like i mean a lot of houses do not have pools here so pool area is zero so in that particular case we need to remove all of these zeros so that we can go on to finding more correlated values that we can actually use to go about working within case again here's all of the correlated values that we found as soon as we run this command basically we
are sorting it again in the descending order to find out what uh helps most and you can see that it's almost 80 percent of the total quality of the house which matters when you're buying uh the because of the price of the house and people are actually preferring this living area second floor surface area has a 67 you know chances of affecting the price and so much more and check out what the golden uh features list looks like uh with respect to all of the strongly correlated values we just found again you're remodeled you're built
uh so much more so total surface area again number of full bathrooms the first floor surface area the garage area total basement or square footing the number of cars you can park in the square footing of the second floor the living area and the overall quality again in this particular order from the least to the highest is exactly what we're trying to uh find out with respect to exploratory data analysis guys so just looking at the data set you could never figure out why the list price of a house was so much and once we
break down into simple terms like this we can find out that there is an 80 impact from the overall quality of the house or which the user is checking out to just consider the house or not if the quality is very less he will not pick up the house if the quality is higher then sure he will pick up the house so 80 percent of the reason why the price is set like that is a very important aspect of why the house price casing study is important guys again this has been a very important very
nice data set to work with and you will get a lot of analytics that can be done using uh this particular data set as well guys uh next up i want to discuss the skills of a data analyst with you guys so you know a data analyst is a person who pretty much uh works with and creates a lot of beautiful visual skills so to do this you need to understand the basics of mathematics because if you have to have knowledge of statistics mathematics is the foundation to go and after you understand the foundation you
will build a step-by-step growth a career path for yourself where learning becomes the most important thing because again having the knowledge of statistics will help you play with a lot of numbers at the same time and keeping the math part aside for a second you need understanding of languages such as python and our programming language as well you know guys at this point of time my goal here is to not overwhelm you if you do not know statistics if you do not know python or r or any of these technologies mentioned on your screen well
fear not make sure to stick with me till the end of the video and i will guide you on a fast track path to become a professional data analyst guys and then the next point we have is data wrangling data wrangling again is to have an idea to play around with data throw the data which is not needed keep the important ones to know that you can work with because if you're working with inefficient data your visualizations will be really bad you know it's like again if i bring back the food example adding a little
less salt or adding too much salt is bad as well and to give you a more uh clarity of data wrangling let's say you're preparing noodles but then if you just put the noodles in fine it's it working it's working because you're just making use of the part you eat but what if you throw the entire packet of noodles along with the packet right it doesn't make sense if you try to boil the packet as well here is where data wrangling makes so much sense in the world of data cleansing and pretty much data processing
guys so you need to understand your data before you do this and then coming to a little bit of big data concepts you will require a little bit of knowledge about spark you know a couple of components and tools of spark one of it is big and another one of another one is hive as well guys well uh to break it down a little bit simpler for you guys let us talk about each skills uh the first thing i want to talk about is the analytical skills because you know data analysts will work with large
amounts of data at every point of time so what is this large amount of data you know it can include a lot of facts it can be a lot of numbers it can be a lot of figures and so much more guys this if it is structured data if it is unstructured data they can work with images they can work with audio video and much much more at the same time so basically what they'll do is you know they need to go through this particular data they need to understand the data and analyze it to
pretty much find some sort of conclusion so that is a very important skill to have and then coming to the communication skills well again as i told you at the start of this presentation itself a data analyst will have to present his or her findings to to a non-technical person so having the communication skill where you can convey all of this really well is extremely important because at the end of the day you know you'll be translating your data which could be meaningless to someone because they cannot understand it at this point of time into
understandable documents uh you know very good looking dashboards or again very nice looking reports at the same time guys because again you know having a good communication skills will ensure that you can convert a complex idea into uh you know something which is easily understood by a lot of people and this brings us to the skills uh next skill set which is pretty much very important and this is one among the important skills to have which is critical thinking skills uh because at the end of the day again you'll be looking at thousands of numbers
right so you need to know where to look you need to know what to look what your numbers can do for you where are the trends how can you get to those trends what is your process benchmarking how can you get uh to the goal that you set and so much more guys and why is all of this done well all of this is done to make sure that you can have a conclusion to look up to right so to simplify it again to give you an example uh let's say you go shopping for all
the food that you have to pretty much you know cook at your house so you're gonna need vegetables you're going to need seasonings you're going to need so many things to cook with it you're going to need oil for some dishes so you will plan right so most of us pretty much you know go out once a month or twice a month to bring out all the groceries and you know store it so if that's the case then you need to understand how much of groceries you need every single month how much you're consuming right
so you have to set a goal saying you know what this is the amount of vegetables that i'm going to need for this month and then you're going to plan it and you know pick it up weekly daily or whatever your schedule is so how do you set that duration of when you should go pick your vegetables that is exactly why you need to formulate conclusions to understand how you can go from nowhere to there guys and then coming to the most important communication skill is that you know it'll again work together with the critical
thinking skill which will basically enhance the output of you guys and then the most important thing i want you guys to contemplate about and think about as a data analysis is an art guys because you'll be working with numbers you'll be working with graphs visualizations and everything can look its best it's most beautiful if you know how to create it right so contemplate on that for a second guys again if you have any questions about data unless again guys i want you guys to head to the comment section so pandas is an open source python
library which is used for manipulating one dimensional and two dimensional data and the name pandas is derived from panel data which is a common term for multi-dimensional data sets encountered in statistics and econometrics now let's look at the types of data structures in pandas so we've got one dimensional and multi-dimensional data so if you're working on a one-dimensional data set it is known as a series object and the two-dimensional data object is known as the pandas data frame and if you're working on higher than two-dimensional data pandas would create a panel data for you so
let's properly understand what exactly is a series object well a series object in pandas is a one dimensional labeled array which is capable of holding mixed data types like integer string floating point number and so on now let's understand about data frame so what data frame is a two-dimensional label data structure with columns which contain data of different types so here we see that we have a two-dimensional data frame where the first and the third columns are of string type and the second column is numerical in nature so now that we've understood what exactly is
pandas and we've also understood the different types of data structures in pandas let's go to jupiter and start working with pandas so i'll start by importing the pandas data frame so i'll type in import pandas as pd so this pd which you see this is just the alias so i am importing pandas with the alias pd so we have successfully imported this pandas data frame now i'll go ahead and create a series object from a list so let's see i will name the list as data and i will given these values so the values inside
these lists are 1 2 3 and 4. now to create a series object from a list all we have to do is use this pd dot series function so i'll use pd.series and inside this i will pass in this list and i will store this in s1 now let me print this out right so we have successfully created our series object from this list so we've got these four numbers and these are the index values so what you see over here is the indexing start from zero so zero one two and three this is these
are the index values and these are the actual values now let's see how can we change the index of the series object so i'll just copy this over here paste it over here again so we have got the index attribute over here inside this what i'll do is i will give in different set of values now let's say i want the index values to be a b c and d so i'll just pass in these values for the index parameter over here and i'll store it back to s1 now let me print s1 over here
so we have successfully created a series object where the index values are a b c and d now if we want we can actually extract these individual elements with these index values so let's say i want to extract the value which is present at the index value c so i'll type in s1 i'll put in parenthesis and inside this i'll put in c right so i've successfully extracted this value similarly let's say if i wanted to extract the element which is present at index a so i'll type in a over here right now let's say
if i want to extract the first two elements so if i want to extract the first two elements i'll just put in a column over here i'll put in two right so i've extracted the first two elements similarly if i want to extract the last two elements i'll put in a colon and over here on the left side of the column i'll put in minus two so this is how we can extract the last two elements from the series object right so we have created a series object out of a list now let's go ahead
and create a series object or of a dictionary just a quick info guys test your knowledge of data analytics by answering this question which of the following is a false statement about data analytics a it collects data b it looks for patterns c it does not organize data d it analyzes data comment your answer in the comment section below subscribe to intellipaat to know the right answer now let's continue with the session so let me create my dictionary over here i'll name this as d1 i'll put in curly braces over here now i'll create my
dictionary so i'll just start off by assigning the key value pairs so a 1 after that b plus 2 after that for c that is 3 and after that for d that is 4 all right so i have successfully created my dictionary over here let me also print this out right so these are the key value pairs now again to create a series object out of this i will type pd dot series and inside this i will pass in the dictionary which is d1 and i will store this in s2 now let me print out
s2 so this is my series object over here right so all of the keys have been assigned as the index and all of the values are the actual values in the series object right so a becomes the index value over here b is the index c is the index d is the index and the value corresponding to this key becomes the value in the series object as well now let's say if i want to resequence the index values over here let's see how can we do it so i will copy this and i will paste
it over here now again i will use the index parameter and inside this let's say instead of abcd i want the sequence to be dcba so that is what i'll give over here d c b and a now let me run this and let me just print out s2 over here right so we have reverse the sequence of these indices so initially it was a b c and d now i have reversed it and now it is d c b a right so this is how we can create a series object out of a data
frame and also change the sequence of these indices all right so we have work with series now let's go ahead and see how to create a data frame out of a list so i'll just write a comment over here creating data frame from a list right so let me again go ahead and create my list data equals one two three and four right so i have created my list now i'd have to go ahead and create the data frame and to create the data frame this would be the syntax so i'll type in pd dot
data frame so over here you have to keep in mind that d is capital and f is capital right and inside this i will just pass in this list and i will store this in df now let me print this out right so we have successfully created a data frame out of this list now let's also go ahead and create a data frame out of a dictionary so i will type in fruit over here so the name of the dictionary is fruit and over here i will write the key to be equal to fruits and
the values for this key are so we've got all of these fruits we've got apple we've got mango after that we've got banana and finally we've got guava so this was the first key value pair after that i'll given the second key value pair which would be the count of the fruits so i'll type in count and over here i will just sell in the count so let's say there are 10 apples 20 mangoes 40 bananas and 30 guavas right so we have created a dictionary now let me print this out now let me also
go ahead and create a data frame out of this so you already know we'd have to type in pd dot data frame and inside this i will pass in fruit and i will store this in let's say fruit underscore df now let me print this out fruit underscore df right so what happened over here is these two keys have turned into the column names so fruit has become the column name over here count has become the column name over here and these values come into the rows right so fruits we have over here and apple
mango banana and guava are the row values again count becomes the column name and these values over here come over here right so this is how we can create a data frame out of a dictionary right so now that we've understood the basics of series and data frame let's go ahead and see how to import a data frame and do some sort of data manipulation on top of it so i have this customer churn data frame with me let me go ahead and import it so to import a data frame i'll type pd dot read
csv now inside this i will just given the name of the file so the name of the file is customerchurn.csv and i will store this in a new object and name that object to be customer churn now let me go ahead and print the first five columns of it it will be customer churn dot head right so this is our customer churn data frame and it comprises of all of these columns so we've got customer id gender senior citizen partner and so on right so this is a special function in pandas which gives you the
first five rows right so one two three four five now similarly let's say if you want to have a glance the first 10 rows you just given the value 10 over here and you can glance at the first 10 rows of this customer churn data frame so an analogous function to head is the tail function which would give you the last few rows so i'll type in tail over here now i'll click on run so this over here gives you the last five rows from this customer churn data frame similarly if you want to have
a glance in the last 10 rows you will just put in the value 10 right so these are the last 10 rows present in the customer churn data frame all right now let's see how can we extract a specific rows and columns from a pandas data frame so for this we've got lock and dialog functions so let's actually start working with the i lock function so let's say i want to extract only the rows from row number five to row number 15 and only the columns from column number two to column number four let's see
how can we do it right so i'll start off by giving the name of the data frame which is customer churn after that i will use the function i lock and i'll put in a comma over here right so whatever is present on the left side of the comma that denotes all of the rows and whatever is present on the right side of the comma that denotes all of the columns so let's say if i want to extract all of the rows from 5 to 15 so i'll put in 5 i'll put in a column
and i'll type in 15 right so i'll be extracting all of the rows starting from row number 5 to row number 15. similarly if i want all of the columns starting from column number two to column number five this is how it'll go right so let's actually have a glance with this so this two to five over here so since we already know that in python indexing starts from zero so zero 1 2 so this would be our column number 2 which is senior citizen so 2 3 and 4 senior citizen partner and dependence right
so we have extracted column number 2 column number 3 and column number 4 and we have extracted the rows starting from row number five to row number fourteen right so this is how we can extract specific rows and columns from a data frame now let's go ahead and see how can we perform some sort of data manipulation so let's say from this entire data frame i want only those records where the gender of the customer is female right so for this what we'll do is i'll just start off by typing the name of the data
frame which is customer churn and then inside the parenthesis i will given the name of the column which is gender after that i'll use the double equal to operator and then given the condition which is the gender of the customer needs to be equal to female right now i'll click on run and let's see what do we get so you get a bunch of true and false labels now this bunch of true and false labels basically means that over here at record number 0 it is actually true that is the gender of the customer is
female so false over here indicates that gender of the customer is not female here again it is false here again it is false so again it record number four it is true that is the gender of the customer is female now what i'll do is i will cut this and i will paste it back inside this right so what is happening over here is from this customer churn data frame i will extract only those records where this condition is true right and i will store this in let's say female customer all right i'll click on
run now let me print out the head of this female underscore customer dot head right so we have successfully extracted a subset from the original data frame where the gender of the customer is only female now similarly let's say if you want to do some sort of complicated operation than this so let's say we want to extract only those records where tenure of the customer is greater than 50 and the internet service of the customer is equal to fiber optic all right so now let's start off with the first condition so i'll given the name
of the data frame which is customer churn and inside the parenthesis i will given the name of the column which is 10 yard so the tenure of the customer needs to be greater than 50. all right so i'll cut this and put this inside places i'll put the and operator and given the second condition so the second condition is the internet service of the customer needs to be equal to dsl so i'll type in internet service and this needs to be equal to dsl right again i'll put this condition inside quotes over here all right
so we've got these two conditions and finally i'll put those two conditions inside the data frame so what is happening is from this entire customer churn data frame will be extracting only those records where these two conditions are satisfied right and i will store this in let's say c underscore tenure underscore infinite now let me print out the head of this so it'll be c underscore tenure underscore internet dot head all right so let me have a glance at the tenure so over here if you see the tenure of the customer is greater than 50
for all of these values so 62 58 72 70 and so on now similarly if i have a glance at the internet service column then you'll see that all of these values are dsl right so this is how we can perform data manipulation operations on top of the pandas data frame all right so we are done with the practical so exploratory data analysis or eda for short is the process of performing initial investigations on data so as to discover patterns abnormalities or anomalies and assumptions with the help of summary statistics and graphical representation basically when
we have some data on which we want to perform data science and statistical analysis and machine learning modeling we first need to make sure that we understand what the data represents shape of the data what are the different kinds of things that are available in our data different data types available in our data if we could visualize our data and understand relationships between individual columns or individual features in our data that's really good and all of that and more such as visualization and all is done the step called exploration data analysis that's what eda is
all about now eda allows us to get a better understanding of our data and make important observations on it in order to understand this let's understand with an example if you have a data set which you have no understanding of you don't understand what the data it contains is what do the columns represent how do the columns relate to each other and which columns are important for the particular tasks that you are trying to solve in cases like this it is very difficult to understand how you are going to solve the problem how you are
going to build the model how you are going to perform statistical analysis so on and so forth so in this sections what happens is we would like to perform exploratory data analysis in order to get a better understanding of our data this also helps us understand whether or not the data that we have needs some sort of uh boost so for instance if we have some data set that's highly biased on one result for example we are trying to predict whether or not a particular particular internship is going to convert into a job let's say
we have a data set that we want to figure out whether or not we can create a predictive model in which we can feed in some data about an intern and figure out whether or not we should give them a job although it's a very specific use case if all we have is data where people did not get converted into full-time employees our data set is completely biased and this is something that we could miss if we don't take time to analyze our data using exploratory data analysis so if that is the problem then we
would have to get more data so that we can balance out both the probabilities and then teach our model about these things so it has better understanding of what features lead to what results so now we come to why eda why would you want to perform exploratory data analysis we've understood what it is but why exactly would you want to do it well exploratory data analysis is one of the most crucial steps in data center it allows us to achieve certain insights and statistical measures that is essential for the data scientist good understanding of exploratory
data analysis process allows us to make some important observations and early decisions that could help us produce or that could help us not take steps that are not needed so as we discussed in the previous example if a data set is biased then performing uh modeling and using multiple algorithms and trying to figure out the accuracy of each model is not going to be a good idea mainly because the data set we have is not good if you don't have a good data set