Look Out!: Common Misconceptions and how to avoid?
Look Out!: Common Misconceptions and how to avoid them. Introduction Do you know the popular phrase: “There are three kinds of lies: lies, damned lies and statistics”? It illustrates the common distrust of numerical data and the way it’s displayed. And it has some truth: for too long, graphical displays of numerical data have been used to manipulate people’s understanding of ‘facts’. There is a basic explanation for this. All information is included in raw data – but before raw data is processed, it’s too much for our brains to understand. Any calculation or visualisation – whether that’s as simple as calculating the average or as complex as producing a 3D chart – involves losing a certain amount of data, so that we can take it in. It’s when people lose data that’s really important and then try to make big statements about the whole data set that most mistakes get made. Often what they say is ‘true’, but it doesn’t give the full story’ In this tutorial we will talk about common misconceptions and pitfalls when people start analysing and visualising. Only if you know the common errors can you avoid making them in your own work and falling for them when they are mistakenly cited in the work of others. The average trap Have you ever read a sentence like: “The average european drinks 1 litre of beer per day”? Did you ask yourself who this mysterious “average european” was and where you could meet him? Bad news: you can’t. He or she doesn’t exist. In some countries, people drink more wine than beer. How about people who don’t drink alcohol at all? And children? Do they drink 1 litre per day too? Clearly this statement is misleading. So how did this number come together? People who make these kind of claims usually get hold of a large number: e.g. every year 109 billion liters of beer is consumed in Europe. They then simply divide that figure by the number of days per year and the total population of Europe, and then blare out the exciting news. We did the same thing two modules ago when we divided healthcare expenditure by population. Does this mean that all people spend that much money? No. It means that some spend less and some spend more – what we did was to find the average.The average makes a lot of sense – if data is normally distributed. Normal distribution is the classic bell shaped curve. The image above shows three different normal distributions. They all have the same average. And yet they are clearly different.What the average doesn’t tell you is the range of data. Most of the time we do not deal with normal distributions either: take e.g. income. The average income (something frequently reported) would suggest that half of the people would earn less and half of them would earn more than the average. This is wrong. In most countries, many more people earn below the average salary than above it. How? Incomes are not normally distributed. They show a peak around a certain level and then have a long tail towards large salaries. The chart shows actual income distribution in US$ for households up to 200,000 US$ Income from the 2011 census. You can see a large number of households have incomes around 15,000-65,000 US$, but we have a long tail skewing the average up. If the average income rises, it could be because most of the people are earning more. But it could also be that a few people in the top income group are earning way more – both would move the average. Task: If you need some figures to help you think of this, try the following: Imagine 10 people. One earns 1€, one earns 2€, one earns 3€… up to 10€. Work out the average salary. Now add 1€ to each of their salaries (2€, 3€….11€). What is the average? Now go back to the original salaries (1€, 2€, 3€ etc) and add 10€ only to the very top salary (so you have 1€, 2€, 3€… 9€, 20€). What’s the average now? Economists recognise this and have added another value. The “ GINI-Coefficient ” tells you something about the distribution of income. The “GINI-Coefficient”” is a little complicated to calculate and beyond the scope of this basic introduction. However, it is worth knowing it exists. A lot of information gets lost when we only calculate an average. Keep your eyes peeled as you read the news and browse online. Task: Can you spot examples of where the use of the average is problematic? More than just your average… So if we’re not to use the average – what should we use? There are various other measures which can be used to give a simple mean figure some more context. Combine the average figure with the range; e.g say range 20-5000 with an average of 50. Take our beer example: it would be slightly better to say 0-5 litres a day with an average of 1 litre. Use the median: the median is the value right in the middle where 50% of values are above and 50% of values are below. For the median income it holds true that 50% of people earn less and 50% of people earn more. Use quartiles or percentiles: Quartiles are like the median but for 25,50 and 75%. Percentiles are the same but for varying percent ranges (usually 10% steps.) This gives us way more information than the average – it also tells us something about the distribution of data (e.q. do 1% of the people really hold 80% of the wealth?) Size matters In data visualization, size actually matters. Look at the two column charts below: Imagine the headlines for these two graphs. For the graph on the left, you might read “Health Expenditure in Finland Explodes!”. The graph on the right might come under the headline “Health Expenditure in Finland remains mainly stable”. Now look at the data. It’s the same data presented in two different (incorrect) ways. Task: Can you spot why the data is misleading? In the graph on the left, the data doesn’t start at $0, but somewhere around $3000. This makes the differences appear proportionally much larger – for example, expenditure from 2001-2002 appears to have tripled, at least! In reality, this wasn’t the case. The square aspect ratio (the graph is the same height as width) of the graph further aggravates the effect. The graph on the right starts with $0 but has a range up to $30,000, even though our data only ranges to $9000. This is more accurate than the graph on the left, but is still confusing. No wonder people think of statistics as lies if they are used to deceive people about data. This example illustrates how important it is to visualize your data properly. Here are some simple rules: Always use a range that is appropriate to your data Note it properly on the respective axis! The changes in size we see in a chart should actually reflect the change of size in your data. So if your data shows B is 2 times A, then B should be 2 times bigger in your visualization. The simple “reflect the size” rule becomes even more difficult in 2 dimensions, when you have to worry about the total area. At one point, news outlets started to replace columns with pictures, and then continue to scale the dimensions of pictures up in the old way. The problem? If you adjust the height to reflect the change and the width automatically increases with it, the area increases even more and will become completely wrong! Confused? Look at these bubbles: Task: We want to show that B is double the size of A. Which representation is correct? Why? Answer: The diagram on the right. Remember the formula for calculating the area of a circle? (Area = πr² If this doesn’t look familiar, see here). In the left hand diagram, the radius of A (r) was doubled. This means that the total area goes up by a scale factor of four! This is wrong. If B is to represent a number twice the size of A, we need the area of B to be double the area of A. To correctly calculate this, we need to adjust the length of the radius by ⎷2. This gives us a realistic change in size. Time will tell? Time lines are also critical when displaying data. Look at the chart below: A clear stable increase in health care costs since 2002? Not quite. Notice how before 2004, there are 1 year steps. After, there is a gap between 2004 and 2007, and 2007 and 2009. This presentation makes us believe that healthcare expenditure increases continuously at the same rate since 2002 – but actually it doesn’t. So if you deal with time lines: make sure that the spacing between the data points are correct! Only then will you be able to see the trends correctly. Correlation is not causation by XKCD This misunderstanding is so common and well known that it has its own wikipedia article. There is nothing more to say about this. Simply because two data points show changes that can be correlated, it doesn’t mean that one causes the other. Context, context, context One thing incredibly important for data is context: A number or quality doesn’t mean a thing if you don’t give context. So explain what you are showing – explain how it is read, explain where the data comes from and explain what you did with it. If you give the proper context the conclusion should come right out of the data. Percent versus Percentage points change This is a common pitfall for many of us. If a value changes from 5% to 10% how many percent is the change? If you answered 5% – I’m afraid you’re wrong! The answer is 100% (10% is 200% of 5%). It’s a change in 5 percentage points. So take care the next time people try to report on elections, surveys and the like – can you spot their errors? Need a refresher on how to calculate percentage change? Check out the “Maths is Fun” page on it. Catching the thief – sensitivity and large numbers Imagine, you are a shop owner and you just installed and electronic theft detection system. The system has a 99% accuracy of detecting theft. The alarm goes off, how likely is it, that the person who just passed is a thief? It’s tempting to answer that there is a 99% chance that this person stole something. But actually, that isn’t necessarily the case. In your store you’ll have honest customers and shoplifters. However, the honest customers outnumber the thiefs:: there are 10,000 honest customers and just 1 thief. If all of them pass in front of your alarm, the alarm will sound 101 times. 1% of the time, it will mistakenly identify a honest customer as a thief – so it will sound 100 times. 99% of the time, it will correctly recognise that a shoplifter is a shoplifter. So it will probably sound once when your thief does walk past. But of the 101 times it sounds, only 1 time will there actually be a shoplifter in your store. So the chance that a person is actually a thief when it sounds is just below 1% (0.99%, if you want to be picky).
Sort and Filter: The basics of spreadsheets
Sort and Filter: The basics of spreadsheets Introduction The most basic tool used for data wrangling is a spreadsheet. Data contained in a spreadsheet is in a structured, machine-readable format and hence can quickly be sorted and filtered. In other recipes in the handbook, you’ll learn how to use the humble spreadsheet as a power tool for carrying out simple sums (finding the total, the average etc.), applying bulk processes, or pulling out different graphs and charts. By the end of the module, you will have learned how to download data, how to import it into a spreadsheet, and how to begin cleaning and interpreting it using the ‘sort’ and ‘filter’ functions. Spreadsheets: An Overview Nowadays spreadsheets are widespread so a lot of people are familiar with them already. A variety of spreadsheet programs and applications exist. For example Microsoft’s Office package comes with Excel, the OpenOffice package comes with Calc and so on. Not surprisingly, Google decided to add spreadsheets to their documents package. Since it does not require you to purchase or install any additional software, we will be using Google Spreadsheets for this course. Depending on what you want to do you might consider using different spreadsheet software. Here are some of the considerations you might make when picking your weapon of choice: SpreadsheetGoogle SpreadsheetsOpen(Libre)OfficeMicrosoft Excel UsageFree (as in Beer)Free (as in Freedom)Commercial Data StorageGoogle DriveYour hard diskYour hard disk Needs InternetYesNoNo Installation requiredNoYesYes CollaborationYesNoNo Sharing resultsEasyHarderHarder VisualizationsLarge rangeBasic chartsBasic charts Creating a spreadsheet and uploading data In this course we will use Google docs for our data-wrangling – it allows you to start right away without need of installing software. Since the data we are working with is already public we also don’t need to worry about the fact that it is not stored on our local drive. Walktrough: Creating a Spreadsheet and uploading data. Head over to Google docs. If you are not yet logged in to Google docs, you need to login. The first step is going to be creating a new spreadsheet. Do this by clicking the create button to the left and select spreadsheet. Doing so will create a new spreadsheet for you. Let’s upload some data. You will need the file we downloaded from the World Bank in the last tutorial. If you haven’t done the tutorial or lost the file: download it here . In your spreadsheet select import from the file menu. This will open a dialog for you. Select the file you downloaded. Don’t forget to select insert new sheets, and click import Navigating and using the Spreadsheet Now we loaded some data let’s deal with the basics of spreadsheets. A spreadsheet is basically a table of “cells” in which you can input data. The cells are organized in “rows” and “columns”. Typically rows are labeled by numbers, columns by letters. This also means cells can be addressed by their “column” and “row” coordinates. The cell A1 denotes the cell in the first row in the first column, A2 the one in the second row, B1 the one in the second column and so on. To enter or change data in a cell click on it and start typing – this will change the contents of the cell. Basic navigation can be done this way or via keyboard. Find a list of keyboard shortcuts good to know below: Key or CombinationWhat it does TabEnd input on the current cell and jump to the cell right to the current one EnterEnd input and jump to the next row (This will try to be intelligent, so if you’re entering multiple columns, it will jump to the first column you are entering UpMove to the cell one row up DownMove to the cell one row down LeftMove to the cell left RightMove to the cell on the Right Ctrl+<direction>Move to the outermost cell in the direction given Shift+<direction>Select the current cell and the cell in <direction> Ctrl+Shift+<direction>Select all cells from the current to the outermost cell in <direction> Ctrl+cCopy – copies the selected cells into the clipboard Ctrl+vPaste – pastes the clipboard Ctrl+xCut – copies the selected cells into the clipboard and removes them from their original position Ctrl+zUndo – undoes the last change you made Ctrl+yRedo – undoes an undo Tip: Practice a bit, and you will find that you will become a lot faster using the keyboard than the mouse! Locking Rows and Columns The spreadsheet we are working on is quite large. You will notice, that while scrolling the column with the column labels will frequently disappear, leaving you quite lost. The same with the country names. To avoid this you can “lock” rows and columns so they don’t disappear. Walkthrough: Locking the top row Go to the Spreadsheet with our data and scroll to the top. On the top left, where the column and row labels are you’ll see a small striped area. Hover over the striped bar on top of box showing row “1”. A hand shaped cursor should appear, click and drag it down one row. Your result should look like this: Try scrolling – notice how the top row remains fixed? Sorting Data The first thing to do when looking at a new dataset is to orient yourself. This involves at looking at maximum/minimum values and sorting the data so it makes sense. Let’s look at the columns. We have data about the GDP, healthcare expenditure and life expectancy. Now let’s explore the range of data by simply sorting. Walkthrough: Sorting a dataset Select the whole sheet you want to sort. Do this by clicking on the right upper grey field, between the row and column names. Select “Sort Range…” from the “Data” menu – this will open an additional Selection Check the “Data has header row” checkbox Select the column you want to sort by in the dropdown menu Try to sort by GDP – Which country has the lowest? Try again with different values, can you sort ascending and descending? Tip: Be careful! A common mistake is to forget to select all the data. If you sort without selecting all the data, the rows will no longer match up. A version of this recipe can also be found in the Handbook. Filtering Data The next thing commonly done with datasets is to filter out the values you don’t want to see. Did you notice that some “Country Names” are actually not countries? You’ll find things like “World”, “North America” and “Arab World”. Let’s filter them out. Walkthrough: Filtering Data Select the whole table. Select “Filter” from the “Data” menu. You now should see triangles next to the column names in the first row. Click on the triangle next to country name. you should see a long list of country names in the box. Find those that are not a country and click on them (the green check mark will disappear). Now you have successfully filtered your dataset. Go ahead and play with it – the data will not be deleted, it’s just not displayed.
Finding Data Introduction Now we know what data is and the questions we’re interested in, we’re ready to go out and hunt for it online. In this tutorial, you will learn where to start looking for data. In this course, we will then look at different ways of getting hold of data, before setting you loose to find data yourselves! Data Sources There are three basic ways of getting hold of data: Finding data – this involves searching and finding data that has already been released Getting hold of more data – asking for ‘new’ data from official sources e.g. through Freedom of Information requests. Sometimes data is public on a website but there is not a download link to get hold of it in bulk – but don’t give up! This data can be liberated with what datawranglers call scraping. Collecting data yourself – This means gathering data and entering it into a database or a spreadsheet – whether you work alone or collaboratively. In this tutorial we’ll focus on finding data that already has been released. We will deal with getting more data and collecting data yourself in future courses. Step 1: Identify your Data Source Many sources frequently release data for public use. Some examples: Government In recent years governments have begun to release some of their data to the public. Many governments host special (open) government data platforms for the data they create. For example the UK government started data.gov.uk to release their datasets. Similar data portals exist in the US, Brazil and Kenya – and in many other countries! Does your country have an open data portal (Datacatalogs.org is a good starting point)? Organisations Other sources of data are large organisations. The World Bank and the World Health Organization for example regularly release reports and data sets. Science Scientific projects and institutions release data to the scientific community and the general public. Open data is produced by NASA for example, and many specific disciplines have their own data repositories, some of which are open. More and more initiatives exist trying to provide access to already published data (e.g. Dryad) To help people to find data, projects like the Open Access Directory’s data repository listor the Open Knowledge Foundation’s datahub.io have been started. They aim either to collect data sources, or collect together different data sets from various sources. Step 2: Getting data in the format you need it In the “What is Data” course, we talked briefly about the importance of machine-readabledata. You’ll save yourself a lot of trouble and time in working with the data if you get hold of data in the correct format initially. Here’s a handy tip for how to tell Google which format you are looking for. Using data to answer your question Now that you have an overview of some of the key concepts related to data, it’s time to start hunting for your own! Over the next courses in the Data Fundamentals series, we will be further exploring the question we posed ourselves in the What is Data Course? How does healthcare spending influence life expectancy?. To get the data for this course, please see our recipe on Getting Data from the World Bank. Task: If you found your own alternative data to answer this question, congratulations! Take a moment to upload it to the DataHub – and have a browse to see what other School of Data learners have found. Extension Task: Explore the web, and see what open data you can find. If you find something really interesting and think of an exciting question it could help to address, tweet it to @SchoolofData – or write a short post for the School of Data blog