The Rate remains static whatever is the external rate that will be there right so but this but what I'm trying to represent here is there is a redundant column in there and this redundant column should be removed from your data set right so this will help you to to get you to get a better model why am I saying you will get a better model when you remove redundant columns from your data how is that gonna impact why would you get a better model variants error would be less and I think from a normal perspective your variance is a ratchet that's like a deciphered answer yeah.

So what happens if you have correlation yeah why would it over fit anyway Avinash okay think about linear regression, for example, how does a linear regression equation look like so I'll give you an equation a simple equation 3x plus 4y plus 3x plus 4 plus 4y plus 5z right so if this is the equation, yeah all of you have given me right what is x1 here one feature number one what is x2 here feature number 2 and there is you know constant at the end that is which is okay, but the problem is all the features will have some weight if you have redundant features that redundant feature will also have some weight so which means you will get a line that is not best-fit right so by removing that feature you are actually fitting a better line into your data set right there for our representation of feature is required.

Yeah, that's all I am saying also Avinash but why is it not why you should not include such features is what I'm doing so there you don't have to manually identify this there is something called call graph in both in R and Python you can use that call graph to remove all those features that are highly correlated it doesn't have to be the same kind of information as well say the earlier example I said USD and INR you don't actually need to have some information that is represented across two columns there could be a situation where two columns are highly correlated right so those columns also you can remove an example could be the number of years since ages his education is a complete number of years of experience right so it they are not the same number of years since his experiences last education is complete essentially is from from the day he has completed his education until today.

How many years have passed okay in most cases yeah good the path that's correct multicollinearity is what we are talking about right so that is one thing that you will have to test so I am giving an example for it, so Pitch has finished his education in 2013 for example so 2013 to 2000 2019 it's about six years but two years he was just searching for a job after this education right so but if you look at both the columns most people nowadays get selected in the campus that essentially means the number of years passed since his last education and experience is going to be the same.

If you actually do a correlation between these two columns it's almost could be so there is no point in having both these columns in your data set you may have to renew it how do you identify that again through testing the multicollinearity there are commands for it go ahead search Google Google is your friend Stack Overflow copy-paste the code and do it all right so yeah this is all again whatever I have spoken about in fermentors discriminative features non-discriminatory Chi raggCentera this PowerPoint will be shared with you after this talk you can have a look at them so feature construction are we also talking about autocorrelation can you expand autocorrelation Rajah I am not sure about the concept that's too technical for me I don't even know what the test is I have to be honest with you I'm really sorry yeah.

You're taking my MY exam really sorry Watched yeah so I have not used that and and and another thing that you will yes yes that's correct so when I'm knowing the correlation between Target and features that is what you do when I when you do chi-square attribute evolution filter or gain attribute evaluation filter these are different ways of measuring how a feature is going to impact your protocol impact your target right we have already spoken about that yes I watch it you should get in touch with funds looks like he has some knowledge on autocorrelation so he can definitely help on that I'll also brush up my knowledge on that once I go back home all right.

So dimensionality reduction we've already spoken about PCA highly correlated PCA and auto-encoder and all of it feature extraction so this is one other feature transformation technique say if two dates are there instead of representing two dates as two columns in your data set what you can do are subtract one date from another one so an example that I gave you are years what is the current date if I have a current date as one column and if I have last education date instead of representing it has two columns what I can do is I can subtract its last education date from the current date and I can represent it as one column so that is feature extraction.

There is absolutely no limit to what kind of features you can create the number of features you can draw trend lines and you can source data from outside external world something like stock prices and market demand for a particular technology all of this will impact your data based on the all of this will impact your problem based on the current problem that you are trying to solve say for example if you are trying to look at how much money a person will earn then there are several things that you can source from the outside world like the number of job openings that are there in the current week last week for that particular technology and is it in the number of job openings or now free for that particular technology is it increasing or decreasing.

How many people are available in that particular technology easy certified from reputed institute how is the graph for that particular institution trending is it upwards or downwards how is the sentiment about that institution what is the reputation of the previous organization years worth that data you can source from Glasgow how many people have given recommendations for that particular individual on LinkedIn this is how this is some data that you can source from LinkedIn.

So this is where your domain expertise comes into the picture and it's unfortunate it is not science it's more of art you need to sit down and think about all those chances and and and then you need to correlate them and most of the time when you're fresh in an organization you might not have very deep knowledge into it over a period of time you will develop it but until then you might have to be friends with the so-called domain experts or business analysts to read their knowledge and make the best features possible right now the data is done now what data modeling we have already spoken about the data modeling problems that you would yeah so how these are these are things that I mean some problems are there in data modeling so we will have to go ahead and solve them.

Post a Comment

Previous Post Next Post