Another chapter from Udacity course is data visualization. What I have learned was very important. To acquire, munge and analyze data. But it would be nothing when such a information wouldn’t be communicated to others. The easiest way it is to be done with data visualization. I learned about Napoleon’s march to Russia and why you have to think twice before you will post hue-colored diagram.
What is effective visualization? It is effective, when you will remember about these 3 rules:
Picture is worth more than thousand of words. So you will better check that Minard’s diagram of Napoleon’s army than think how it is possible.
It is a great example of clarity, precision and efficiency. Line width clearly shows number of soldiers in army. It is located on map so you can easily imagine where it was happening, precisely. And you can draw conclusions what where the hardest times of that journey, that image effectively communicates message.
To be effective visualization has to explain main idea clearly and lead to conclusions. It was even proved by guest of this lesson Don and Risha from AT&T.
Humans are wired to receive things in story form. So if you can craft a narrative behind whate are you doing, it’s more compelling – Don
It was a little of biology and anaetomy on that lesson what I could learn about. So all about visualization is when you think of human’s perception. Encoding data using visual form can be done using 3 main techniques: position, length, angle. We can also use colours and sizes to mix techniques altogether. I was taught about some scientific research that has been done in 1985, proving that the most accurate is position encoding, the least accurate: hue encoding. It doesn’t say that we should use scatter plot all times, rather reminds you about human’s perception.
Do it in Python
Doing visualization in python is easy (as everything that can be done in Python). We can use very famous and popular matplotlib library. Panda’s dataframes has also very convenient way of plotting itself. But think of ggplot. Why? It’s made according to some graphical grammar rules. And drawing plot is as easy as shown below:
df = pandas.read_csv('hr_by_team_year_sf_la.csv')
print(ggplot(df, aes(x='yearID',y='HR', color='teamID')) + geom_line())
So it’s just about creating ggplot object and adding some functions which will be interpreted as values series, titles, axes and so on.
Do’s and Dont’s
A big nono is when you are misleading reader of your chart, and leading him to impropoer conlusions. It was nicely shown with a bar chart that had Y-axis customized, suggesting that increase was really big.
LOESS curves and weighed regression is a nice way when we want to see some trends in data changes. Analyzing data is looking at the same problem from variety of angles. We cannot focus too much on the time window that is too small. We have to take a broad look.
There is a great number of ways of encoding data. Why, then, to use only one of these? For instance we can stick to line diagrams, forgetting the great value coming from scatter plots, colors and sizes. Of course, all of these means can be used in the same diagram, according to the needs. But all times, remember Risha’s last words:
It is not enough to know the tools but to know how to use them, and use them indeed – Risha