This can be said as one of the major issues faced by an amateur data scientist of which they are unaware most of the time. Since you have started your career as a data scientist, you have grasped all the fancy algorithms of machine learning as well as deep learning. One thing to keep in mind that data plays a very crucial role and ignoring it will just be doing “Garbage in-Garbage out”. Also, after you have done analyzing the data- run the algorithm- and got fair metrics, you think that the work is accomplished. That is where you ignored all the possible risks that may be seen in the future. Your eagerness for completing the task may lead you to risks such as data leakage, over fitting, and other biases.
When building a model, most of the work is done at the data and features level. It is better to concentrate a little more on data than on the algorithm because your data and its features will shape your model at the last. The quality of your final model will completely depend on the data than the algorithm.
Many budding data scientists rely on the theoretical knowledge of algorithms while trying to build a model. Knowing all the complex algorithms will not help you to build fair working models. If you want to build a good model, it is crucial to know and understand the data you will be using, the purpose behind your model, and the basic domain knowledge.
Pushing data randomly without exploring it much will show biases in the results of your model. Hence, it is needed to do some exploratory data analysis which will help you to make some hypotheses of the model and you can be informed about what you are doing.
Most of the time the budding data scientists prefer to build a model than to visualize and explore the data. They lack the idea that by spending more time on understanding the data can gain you deeper insights on what the outcome of your model will be. Being curious and eager to finish building a model and complete the task while ignoring the exploring and visualization part can cause serious damage to the model.
The basic and most important tools of a data scientist are to explore and visualize the data. Understanding your dataset is the foremost task that an aspiring data scientist should do as it will later reflect in your model.
Less communication usually happens to amateur data scientists that they hesitate to question about the difficulties and when you are not ensured about certain issues. They often shy away from putting their views forward as afraid of being criticized, forgetting that without drawing any feedback one cannot improve much further.
To be a keen data scientist, you have to be a good communicator. It really helps a lot. One should always keep in mind that data scientists are meant to solve other people’s issues and without communicating whether it be inside the organization or some outside business clients, it is merely a difficult as well as unsolvable task.
Many beginners fall into the trap of spending too much time on theory, whether it be math related (linear algebra, statistics, etc.) or machine learning related (algorithms, derivations, etc.). It is good to get a grasp of the theory behind machine learning techniques. But if you don’t apply them, they are only theoretical concepts. When someone starts learning they study lots of books and go through a number of online courses but they rarely get the chance of applying theory into solving practical problems.
It is not a new concept that to get a better understanding of what you are learning, there should be a proper balance between theory and practical. Google is the best place to find different datasets for practice.
Straightforward diving into the deep areas of data science is a common practice majority of aspiring data scientists make, resulting in a lack of knowledge about basic stuff and ultimately you will face problem in solving practical problems. If you do code an algorithm from scratch, do so with the intention of learning instead of perfecting your implementation. At the start, you really don’t need to code every algorithm from scratch.
You need to clear four basic concepts before diving deep and these four concepts are- Linear Algebra, Statistics, Probability and Calculus. Data science is sum of all individual parts. Till the time you don’t have a clear picture about these four concepts, don’t even think of diving deep into the core of data science.
As they say, “Rome wasn’t built in a day.” The same goes for data science too. I understand that you want to build the technology of the future, self-driving cars, robots and what not! Things like this require techniques such as deep learning and natural language processing. Before going into this typical stuff you must first master the fundamentals of machine learning etc.
First, master the techniques and algorithms of “classical” machine learning, which serve as building blocks for advanced topics. It is a common practice that people just practice 2–3 problems and after solving them they begin to think that they have mastered the concepts but this isn’t true. The more you practice the more groomed you become.
How a predictive model makes a prediction is a very common overlooked part of the data science workflow. Accuracy isn’t always everything. A predictive model which can predict with 95% accuracy is obviously good but if you can’t explain it to the other person, how the model got there, which features led it there, and what your thinking was when building the model, your client will reject it.
The best way to prevent self from making this mistake is speaking to people working in the industry. There is no better teacher than experience. You can practice making simpler models and then try explaining them to non-technical people. Then slowly add complexity to your model and keep doing this even if you don’t understand what’s going on beneath your data model. This will teach you when to stop, and why simple models are always given preference in real-life applications.
Ever since data science became popular, certifications and degrees have cropped up just about everywhere. A strong degree in a related field can definitely boost your chances but it’s neither sufficient nor is it usually the most important factor. Getting a degree or certification is not easy but one should not solely rely on them. In most cases, what’s taught in an academic setting is simply too different from the machine learning applied in businesses. While working in a real environment you have to handle a lot of deadlines, technical roadblocks, clients etc., these are just some of the things you will need to overcome to become a good data scientist. Just a certification or degree will not qualify you for it.
Certifications are valuable, but only when you apply that knowledge outside the classroom and put it out in the open. Take relevant internships, even if they are part-time. Reach out to local data scientists on LinkedIn for coffee chats. Always be open for learning. Go out in the real world and try to learn how the industry works.
Having knowledge of tools and libraries is a very good thing but combining that knowledge with the business problem posed by the domain is where a true data scientist steps in.
When you are applying for a data scientist role in a particular industry, read up on how companies in that domain are using data science. Search for data sets and problems relating to that industry and try solving them, this will give you a massive boost.