Knowledge in Artificial Intelligence

An introduction to plots and charts

From Data to Diagrams: An introduction to plots and charts Introduction The last tutorials was all text and grey, so let’s add some glitter to the world of data: Data Visualization. Data visualization is not just about making what you found look good – often it is a way of gaining insight into the data. We just understand graphical information on a better level than we understand numbers and tables. Look at the example below: How long does it take to see the trend in the table, how long in the chart?  Data visualization is a great skill and if done right has great value. If done incorrectly, you will lead people astray and plant wrong ideas in their heads. Remember: With great power comes great responsibility. In this tutorial we have two missions: To understand which type of chart is most appropriate to present your data To learn the basic workflow for inserting basic charts into a spreadsheet with Google Docs. For this tutorial you will need to have the share settings on your spreadsheet set to “public on the web” – otherwise some of the things we cover won’t work. Do so by changing the settings with the blue share button on the top right. In case you haven’t completed the last module the spreadsheet we are working with is here. How to present data? We have largely quantitative data in our dataset. The question we have to ask ourselves is: Do we compare one entity over time, multiple entities with each other or do we want to know how two variables interact? Depending on this we choose different presentation formats. What we want to doPresentation chosen Compare values from different categoriesbarchart Follow value over time (timeseries)linechart Show interaction between two valuesscatterplot Show data related to geographymap Presenting quantitative data from different categories – Bar/columncharts A barchart is one of the most commonly used forms of presenting quantitative data. It is simple to create and to understand. It is best used when comparing data from different categories: e.q. public healthcare expenditure in the top 10 countries – and as 11th column your country. A typical columnchart looks like this: Reading barcharts is simple: We usually have a few values – ordered as categories on the x or y axis (for column and barcharts respectively) in our example it’s the countries. Then we have the values expressed as bars (horizontal) or columns (vertical). The extent of the bars is the value. As simple as it is there are a few rules to keep in mind: Don’t overload barcharts Although you can do multiple colours and pack two categories in there, if it’s too many categories it becomes confusing. Always label your axes whoever is looking at your graphs needs to know what the units are they are looking at. Start your values at 0. Most spreadsheet tools will automatically adjust the range: undo this and set it to 0 – this shows contrast in an appropriate scale! We’ll show you why this is important in the next module. Walkthrough: Create a column chart for the top 10 countries. So let’s create a column chart from our dataset. It’s not really good style to have too many different columns in there: the chart becomes very hard to read. So what we will do is to limit ourselves to the 10 countries with the highest healthcare expenditure. This is an arbitrary cutoff and you can look at all the countries as well. Doing so might help you discover something that’s hidden in the data. To do so, filter the World Bank dataset for a single year (e.g. 2009). Sort the filtered world bank data set by the column “Health care expenditure total per person (US$)” one of the columns we created in the last challenge. You can avoid having the first row being moved by going to View -> Freeze -> 1 row Select the top 10 countries (the first 11 rows including the header row) and copy it to another sheet. (For this press ctrl + c for copy and then insert a new sheet, press ctrl+ v in the new sheet to paste). To select the data we are interested in, we can first select the name of the countries, then select the Health expenditure total per person while maintaining CTRL (CMD on a Mac) pressed. Another interesting option, especially if you have a small screen, is to move the second column to put it next to the first. To do so, click on the grey label to select it. Release the mouse then click and drag it until it is in position. Your column A should now be Country Names, Column B should be “healthcare expenditure per person total US$”. Your sheet should look like this: Now select the two columns of interest and then open chart... from the insertmenu. One of the suggested charts should be a column chart Click on it and you will see a preview. Did you note the range on the y axis? It starts with 4000 so it looks like Belgium is only spending a fraction of Luxembourg’s spending on healthcare – let’s change this. Open the Customize tab and scroll down to “Axis” . Now select “Left Vertical”from the drop down. See the max and min boxes? Just enter 0 into the min and the range will start at 0. This way the contrast between the countries looks more realistic. Play around with the customizing settings. Try to remove and position the legend, change the colour of your bars etc. When you are done, click on Insert and your chart will be there. If you click on the chart you can move it around. Notice the triangle up right? It’s the option menu. Select Edit chart to change the settings of the chart. Can you change it to a bar chart? Task: Create a column chart with other data from the World Bank sheet. So now you know how to create a column chart – feel free to experiment with other types of chart and use the recipes in the Handbook to guide you. The following sections deal with when to pick a particular type of chart and what data it is suitable for. We cover the most common charts: line charts, choropleth maps and scatterplots. For all of these, you can find an accompanying howto recipe in the handbook. Presenting data from categories over time – linecharts Sometimes you do not only have categories: e.g. countries, but you have values over time. This is where line charts are quite handy. A line chart looks like: On the y axis we still have our values on the x axis we have the time measured. This graph works best if the time interval between the measurements is equal (Of course line charts are not limited to timeseries). Again it’s important, when comparing multiple categories, to start your y axis with 0. Only when displaying a single line it’s ok to start somewhere in between – but give a relation – say where your graph starts and where it ends. Task: Compare Luxembourg to the other top spending countries – create a line chart with the different countries on one chart. Showing geographical data – mapping In our case we do not only have numerical data but we also have numerical data that is linked to geographical places. This calls for a map! Whenever you have a large number of countries or regions, displaying data on a map helps. If you have countries or regions you usually create a choropleth map. This special type of map displays values for a specific region as colours on that region. An example of a choropleth map from our data is shown below: The map shows health care expenditure in % of GDP. It allows us to discover find interesting aspects of our dataset. E.g. Western Europe is spending more on healthcare in %GDP than eastern Europe and Liberia spends more than any other state in Africa. Some things to be aware of when using choropleth maps: One shortcoming of choropleth maps are the fact that bigger regions or countries attract most attention, so smaller regions may get lost. Pay attention to colour-sclae. The standard red-green colour scale is not very well suited for a variety of reasons such as making it difficult for colour-blind observers (Read more about this in Gregor Aisch’s post in the Further Reading section). Single hued colour scales are in most cases easier to guess. If your range of values becomes too big it will be hard to single out things Task: Try another set of data on a choropleth. How does it work? Researching interaction between variables – scatterplots What if we are interested not in a single variable but in how different variables depend on each other? Well in this case we have scatterplots – good for looking at interaction between two variables. Look at the sample scatterplot above: we have one numerical value on the X and another numerical value on the Y axis. The dots are one data point. This plot has certain shortcomings as well: The dots overlap and thus if there are a lot of dots you don’t really see where they are. This could be solved by adding transparency or by selecting a specific range to show. Nevertheless one trend becomes clear: Above a certain life expectancy, health care costs suddenly increase dramatically. Also notice the three single dots on the lower left? Interesting outliers – we’ll look at them in a later module.

Look Out!: Common Misconceptions and how to avoid?

Look Out!: Common Misconceptions and how to avoid them. Introduction Do you know the popular phrase: “There are three kinds of lies: lies, damned lies and statistics”? It illustrates the common distrust of numerical data and the way it’s displayed. And it has some truth: for too long, graphical displays of numerical data have been used to manipulate people’s understanding of ‘facts’. There is a basic explanation for this. All information is included in raw data – but before raw data is processed, it’s too much for our brains to understand. Any calculation or visualisation – whether that’s as simple as calculating the average or as complex as producing a 3D chart – involves losing a certain amount of data, so that we can take it in. It’s when people lose data that’s really important and then try to make big statements about the whole data set that most mistakes get made. Often what they say is ‘true’, but it doesn’t give the full story’ In this tutorial we will talk about common misconceptions and pitfalls when people start analysing and visualising. Only if you know the common errors can you avoid making them in your own work and falling for them when they are mistakenly cited in the work of others. The average trap Have you ever read a sentence like: “The average european drinks 1 litre of beer per day”? Did you ask yourself who this mysterious “average european” was and where you could meet him? Bad news: you can’t. He or she doesn’t exist. In some countries, people drink more wine than beer. How about people who don’t drink alcohol at all? And children? Do they drink 1 litre per day too? Clearly this statement is misleading. So how did this number come together? People who make these kind of claims usually get hold of a large number: e.g. every year 109 billion liters of beer is consumed in Europe. They then simply divide that figure by the number of days per year and the total population of Europe, and then blare out the exciting news. We did the same thing two modules ago when we divided healthcare expenditure by population. Does this mean that all people spend that much money? No. It means that some spend less and some spend more – what we did was to find the average.The average makes a lot of sense – if data is normally distributed. Normal distribution is the classic bell shaped curve. The image above shows three different normal distributions. They all have the same average. And yet they are clearly different.What the average doesn’t tell you is the range of data. Most of the time we do not deal with normal distributions either: take e.g. income. The average income (something frequently reported) would suggest that half of the people would earn less and half of them would earn more than the average. This is wrong. In most countries, many more people earn below the average salary than above it. How? Incomes are not normally distributed. They show a peak around a certain level and then have a long tail towards large salaries. The chart shows actual income distribution in US$ for households up to 200,000 US$ Income from the 2011 census. You can see a large number of households have incomes around 15,000-65,000 US$, but we have a long tail skewing the average up. If the average income rises, it could be because most of the people are earning more. But it could also be that a few people in the top income group are earning way more – both would move the average. Task: If you need some figures to help you think of this, try the following: Imagine 10 people. One earns 1€, one earns 2€, one earns 3€… up to 10€. Work out the average salary. Now add 1€ to each of their salaries (2€, 3€….11€). What is the average? Now go back to the original salaries (1€, 2€, 3€ etc) and add 10€ only to the very top salary (so you have 1€, 2€, 3€… 9€, 20€). What’s the average now? Economists recognise this and have added another value. The “ GINI-Coefficient ” tells you something about the distribution of income. The “GINI-Coefficient”” is a little complicated to calculate and beyond the scope of this basic introduction. However, it is worth knowing it exists. A lot of information gets lost when we only calculate an average. Keep your eyes peeled as you read the news and browse online. Task: Can you spot examples of where the use of the average is problematic? More than just your average… So if we’re not to use the average – what should we use? There are various other measures which can be used to give a simple mean figure some more context. Combine the average figure with the range; e.g say range 20-5000 with an average of 50. Take our beer example: it would be slightly better to say 0-5 litres a day with an average of 1 litre. Use the median: the median is the value right in the middle where 50% of values are above and 50% of values are below. For the median income it holds true that 50% of people earn less and 50% of people earn more. Use quartiles or percentiles: Quartiles are like the median but for 25,50 and 75%. Percentiles are the same but for varying percent ranges (usually 10% steps.) This gives us way more information than the average – it also tells us something about the distribution of data (e.q. do 1% of the people really hold 80% of the wealth?) Size matters In data visualization, size actually matters. Look at the two column charts below: Imagine the headlines for these two graphs. For the graph on the left, you might read “Health Expenditure in Finland Explodes!”. The graph on the right might come under the headline “Health Expenditure in Finland remains mainly stable”. Now look at the data. It’s the same data presented in two different (incorrect) ways. Task: Can you spot why the data is misleading? In the graph on the left, the data doesn’t start at $0, but somewhere around $3000. This makes the differences appear proportionally much larger – for example, expenditure from 2001-2002 appears to have tripled, at least! In reality, this wasn’t the case. The square aspect ratio (the graph is the same height as width) of the graph further aggravates the effect. The graph on the right starts with $0 but has a range up to $30,000, even though our data only ranges to $9000. This is more accurate than the graph on the left, but is still confusing. No wonder people think of statistics as lies if they are used to deceive people about data. This example illustrates how important it is to visualize your data properly. Here are some simple rules: Always use a range that is appropriate to your data Note it properly on the respective axis! The changes in size we see in a chart should actually reflect the change of size in your data. So if your data shows B is 2 times A, then B should be 2 times bigger in your visualization. The simple “reflect the size” rule becomes even more difficult in 2 dimensions, when you have to worry about the total area. At one point, news outlets started to replace columns with pictures, and then continue to scale the dimensions of pictures up in the old way. The problem? If you adjust the height to reflect the change and the width automatically increases with it, the area increases even more and will become completely wrong! Confused? Look at these bubbles: Task: We want to show that B is double the size of A. Which representation is correct? Why? Answer: The diagram on the right. Remember the formula for calculating the area of a circle? (Area = πr² If this doesn’t look familiar, see here). In the left hand diagram, the radius of A (r) was doubled. This means that the total area goes up by a scale factor of four! This is wrong. If B is to represent a number twice the size of A, we need the area of B to be double the area of A. To correctly calculate this, we need to adjust the length of the radius by ⎷2. This gives us a realistic change in size. Time will tell? Time lines are also critical when displaying data. Look at the chart below: A clear stable increase in health care costs since 2002? Not quite. Notice how before 2004, there are 1 year steps. After, there is a gap between 2004 and 2007, and 2007 and 2009. This presentation makes us believe that healthcare expenditure increases continuously at the same rate since 2002 – but actually it doesn’t. So if you deal with time lines: make sure that the spacing between the data points are correct! Only then will you be able to see the trends correctly. Correlation is not causation by XKCD This misunderstanding is so common and well known that it has its own wikipedia article. There is nothing more to say about this. Simply because two data points show changes that can be correlated, it doesn’t mean that one causes the other. Context, context, context One thing incredibly important for data is context: A number or quality doesn’t mean a thing if you don’t give context. So explain what you are showing – explain how it is read, explain where the data comes from and explain what you did with it. If you give the proper context the conclusion should come right out of the data. Percent versus Percentage points change This is a common pitfall for many of us. If a value changes from 5% to 10% how many percent is the change? If you answered 5% – I’m afraid you’re wrong! The answer is 100% (10% is 200% of 5%). It’s a change in 5 percentage points. So take care the next time people try to report on elections, surveys and the like – can you spot their errors? Need a refresher on how to calculate percentage change? Check out the “Maths is Fun” page on it. Catching the thief – sensitivity and large numbers Imagine, you are a shop owner and you just installed and electronic theft detection system. The system has a 99% accuracy of detecting theft. The alarm goes off, how likely is it, that the person who just passed is a thief? It’s tempting to answer that there is a 99% chance that this person stole something. But actually, that isn’t necessarily the case. In your store you’ll have honest customers and shoplifters. However, the honest customers outnumber the thiefs:: there are 10,000 honest customers and just 1 thief. If all of them pass in front of your alarm, the alarm will sound 101 times. 1% of the time, it will mistakenly identify a honest customer as a thief – so it will sound 100 times. 99% of the time, it will correctly recognise that a shoplifter is a shoplifter. So it will probably sound once when your thief does walk past. But of the 101 times it sounds, only 1 time will there actually be a shoplifter in your store. So the chance that a person is actually a thief when it sounds is just below 1% (0.99%, if you want to be picky).

Ai

Ai report

Tools & Technique Lab

Tools & Technique Lab

Artificial Intelligence

Artificial Intelligence

solution of artificial intelligence

solution-manual-artificial-intelligence-a-modern-approach

Life Cycle Models

Life Cycle Models The topics included in the documents attached are Classical Waterfall Model, Relative Effort for Phases, Classical Waterfall Model, Feasibility Study.