Data analysis and machine learning – Udacity thoughts

Lesson no 3 is about data analysis.  If you were able to collect data and prepare it it’s time to draw conclusions.  How to use datasets? How to predict the future? This is what I hoped to learn now..

Tests

I was able to meet Kurt.  He works at Twitter and says what was important when analyzing data from healthcare domain.  He said that it’s important to

Make reasonable inferences from data – Kurt

How to judge if we are able to reason from data?  Execute the test.  If statistical distribution is Gaussian then you can execute t-Test.  Otherwise Mann-Whitney test has to be used.  This is how easy it is to do a t-Test in Python environment using scipy’s ttest_ind.

Machine learning

Machine learning is another step.  What does it mean?  There are some who cannot see the difference between machine learning and statistics.  Learning is more about future and making models for predictions.  There are two types of machine learning: supervised and unsupervised learning.

The best way of learning is based on examples.  So these two types of machine learning can be visible in following use-cases.  Supervised learning is embedded in mail services you use every day.  If some e-mails were marked as spam, similar will be marked as spam too.  This kind of machine learning requires some work to be done on items before predicting future.  Unsupervised learning does not.  For instance clustering by features of elements.  Guess you have pictures, and there is a nice algorithm that can read features out of these.  There are clustering mechanisms which will group similar items into chunks without having any input from human teacher.

Do’s and dont’s

If anyone thinking of working with data in near future there were some nice hints I heard about in this udacity course’s lesson.  When working with some dataset remember to reduce dimensionality to the point you are able get some insights, too many dimensions can make your work really hard.  Moreover when performing machine learning options choose features that really matter the most, these can be sometimes different features from these having the biggest weight.

So you know you want to do this? Be data scientist? But this science domain is so broad.  You may want to choose which of these will be of your specialization.   There are many roles in data-science team that you one can think of.  If you are coder you may focus on programming side of data science.  There is a lot of code to be written and this code has to be of good quality, efficient.  If you are rather statistician, you gonna to help your colleagues with math’s equations and statistical tests (it was kind of hard to me, to be honest, when developing gradient descent tasks).  Maybe you are planner or strategist.  Then you will have a lot of work with knowledge abstraction and communication between domain experts and developers.

All of my progress visible as always on github repo here.