Knowledge in Artificial Intelligence

IOT

CA- GOURAV PATEL SVITS

Aritiicial intelligence introuction

This knowledge contains the topics of Artificial Intelligence which is study by trainee during the Industrial training program for AI.

Unstructured vs. Structured data

Unstructured vs. Structured data Data for Humans A plain sentence – “we have 5 white used golf balls with a diameter of 43mm at 50 cents each” – might be easy to understand for a human, but for a computer this is hard to understand. The above sentence is what we call unstructured data. Unstructured has no fixed underlying structure – the sentence could easily be changed and it’s not clear which word refers to what exactly. Likewise, PDFs and scanned images may contain information which is pleasing to the human-eye as it is laid-out nicely, but they are not machine-readable. Data for Computers Computers are inherently different from humans. It can be exceptionally hard to make computers extract information from certain sources. Some tasks that humans find easy are still difficult to automate with computers. For example, interpreting text that is presented as an image is still a challenge for a computer. If you want your computer to process and analyse your data, it has to be able to read and process the data. This means it needs to be structured and in a machine-readable form. One of the most commonly used formats for exchanging data is CSV. CSV stands for comma separated values. The same thing expressed as CSV can look something like: “quantity”, “color”, “condition”, “item”, “category”, “diameter (mm)”, “price per unit (AUD)” 5,”white”,”used”,”ball”,”golf”,43,0.5 This is way simpler for your computer to understand and can be read directly by spreadsheet software. Note that words have quotes around them: This distinguishes them as text (string values in computer speak) – whereas numbers do not have quotes. It is worth mentioning that there are many more formats out there that are structured and machine readable. Task: Think of the last book you read. What data is connected to it and how would you make it structured data?

Sort and Filter: The basics of spreadsheets

Sort and Filter: The basics of spreadsheets Introduction The most basic tool used for data wrangling is a spreadsheet. Data contained in a spreadsheet is in a structured, machine-readable format and hence can quickly be sorted and filtered. In other recipes in the handbook, you’ll learn how to use the humble spreadsheet as a power tool for carrying out simple sums (finding the total, the average etc.), applying bulk processes, or pulling out different graphs and charts. By the end of the module, you will have learned how to download data, how to import it into a spreadsheet, and how to begin cleaning and interpreting it using the ‘sort’ and ‘filter’ functions. Spreadsheets: An Overview Nowadays spreadsheets are widespread so a lot of people are familiar with them already. A variety of spreadsheet programs and applications exist. For example Microsoft’s Office package comes with Excel, the OpenOffice package comes with Calc and so on. Not surprisingly, Google decided to add spreadsheets to their documents package. Since it does not require you to purchase or install any additional software, we will be using Google Spreadsheets for this course. Depending on what you want to do you might consider using different spreadsheet software. Here are some of the considerations you might make when picking your weapon of choice: SpreadsheetGoogle SpreadsheetsOpen(Libre)OfficeMicrosoft Excel UsageFree (as in Beer)Free (as in Freedom)Commercial Data StorageGoogle DriveYour hard diskYour hard disk Needs InternetYesNoNo Installation requiredNoYesYes CollaborationYesNoNo Sharing resultsEasyHarderHarder VisualizationsLarge rangeBasic chartsBasic charts Creating a spreadsheet and uploading data In this course we will use Google docs for our data-wrangling – it allows you to start right away without need of installing software. Since the data we are working with is already public we also don’t need to worry about the fact that it is not stored on our local drive. Walktrough: Creating a Spreadsheet and uploading data. Head over to Google docs. If you are not yet logged in to Google docs, you need to login. The first step is going to be creating a new spreadsheet. Do this by clicking the create button to the left and select spreadsheet. Doing so will create a new spreadsheet for you. Let’s upload some data. You will need the file we downloaded from the World Bank in the last tutorial. If you haven’t done the tutorial or lost the file: download it here . In your spreadsheet select import from the file menu. This will open a dialog for you. Select the file you downloaded. Don’t forget to select insert new sheets, and click import Navigating and using the Spreadsheet Now we loaded some data let’s deal with the basics of spreadsheets. A spreadsheet is basically a table of “cells” in which you can input data. The cells are organized in “rows” and “columns”. Typically rows are labeled by numbers, columns by letters. This also means cells can be addressed by their “column” and “row” coordinates. The cell A1 denotes the cell in the first row in the first column, A2 the one in the second row, B1 the one in the second column and so on. To enter or change data in a cell click on it and start typing – this will change the contents of the cell. Basic navigation can be done this way or via keyboard. Find a list of keyboard shortcuts good to know below: Key or CombinationWhat it does TabEnd input on the current cell and jump to the cell right to the current one EnterEnd input and jump to the next row (This will try to be intelligent, so if you’re entering multiple columns, it will jump to the first column you are entering UpMove to the cell one row up DownMove to the cell one row down LeftMove to the cell left RightMove to the cell on the Right Ctrl+<direction>Move to the outermost cell in the direction given Shift+<direction>Select the current cell and the cell in <direction> Ctrl+Shift+<direction>Select all cells from the current to the outermost cell in <direction> Ctrl+cCopy – copies the selected cells into the clipboard Ctrl+vPaste – pastes the clipboard Ctrl+xCut – copies the selected cells into the clipboard and removes them from their original position Ctrl+zUndo – undoes the last change you made Ctrl+yRedo – undoes an undo Tip: Practice a bit, and you will find that you will become a lot faster using the keyboard than the mouse! Locking Rows and Columns The spreadsheet we are working on is quite large. You will notice, that while scrolling the column with the column labels will frequently disappear, leaving you quite lost. The same with the country names. To avoid this you can “lock” rows and columns so they don’t disappear. Walkthrough: Locking the top row Go to the Spreadsheet with our data and scroll to the top. On the top left, where the column and row labels are you’ll see a small striped area. Hover over the striped bar on top of box showing row “1”. A hand shaped cursor should appear, click and drag it down one row. Your result should look like this: Try scrolling – notice how the top row remains fixed? Sorting Data The first thing to do when looking at a new dataset is to orient yourself. This involves at looking at maximum/minimum values and sorting the data so it makes sense. Let’s look at the columns. We have data about the GDP, healthcare expenditure and life expectancy. Now let’s explore the range of data by simply sorting. Walkthrough: Sorting a dataset Select the whole sheet you want to sort. Do this by clicking on the right upper grey field, between the row and column names. Select “Sort Range…” from the “Data” menu – this will open an additional Selection Check the “Data has header row” checkbox Select the column you want to sort by in the dropdown menu Try to sort by GDP – Which country has the lowest? Try again with different values, can you sort ascending and descending? Tip: Be careful! A common mistake is to forget to select all the data. If you sort without selecting all the data, the rows will no longer match up. A version of this recipe can also be found in the Handbook. Filtering Data The next thing commonly done with datasets is to filter out the values you don’t want to see. Did you notice that some “Country Names” are actually not countries? You’ll find things like “World”, “North America” and “Arab World”. Let’s filter them out. Walkthrough: Filtering Data Select the whole table. Select “Filter” from the “Data” menu. You now should see triangles next to the column names in the first row. Click on the triangle next to country name. you should see a long list of country names in the box. Find those that are not a country and click on them (the green check mark will disappear). Now you have successfully filtered your dataset. Go ahead and play with it – the data will not be deleted, it’s just not displayed.

‘But what does it mean?’: Analyzing data

‘But what does it mean?’: Analyzing data  Introduction Once you have cleaned and filtered your dataset – it’s time for analysis. . Analysing data helps us to learn what our data might mean and helps us to extract answers to our questions from the dataset. Look at the data we imported. (In case you didn’t finish the previous tutorial, don’t worry. You can copy a sample spreadsheet here). This is World Bank data containing GDP, population, health expenditure and life expectancy for the years 2000-2011. Take a moment to have a look at the data. It’s pretty interesting – what could it tell us? Task: Brainstorm ideas. What could you investigate using this data? Here are some ideas we came up with: How much (in USD) is spent on healthcare in total in each country? How much (in USD) is spent per capita in each country? In which country is the most spent per person? In which country is the least spent? What is the average for each continent? For the world? What is the relationship between public and private health expenditure in each country? Where do citizens spend more (private expenditure)? Where does the state spend more (public expenditure)? Is there a relationship between expenditure on healthcare and average life expectancy? Does it make any difference if the expenditure is public or private? NOTE: With these last two questions, you have to be really careful. Even if you find a connection, it doesn’t necessarily mean that one caused the other! For example: imagine there was a sudden outbreak of the plague; it’s not always fatal, but many people who contract it will die. Public healthcare expenditure might go up. Life expectancy drops right down. That doesn’t mean that your healthcare system has suddenly become less efficient! You always have to be REALLY careful about the conclusions you draw from this kind of data… but it can still be interesting to calculate the figures. There are many more questions that could be answered using this data. Many of them relate closely to current policy debates. For example, if my country were debating its healthcare spending right now, I could use this data to explore how spending in my country has changed over time, and begin to understand how my country compares to others. Formulas So let’s dive in. The data we have is not entirely complete. At the moment, healthcare expenditure is only shown as a percentage of GDP. In order to compare total expenditure in different countries, we need to have this figure in US Dollars (USD). To calculate this, let’s introduce you to spreadsheet formulas. Formulas are what helped spreadsheets become an important tool. But how do they work? Let’s find out by playing with them… Tip: Whenever you download a dataset, the very first thing you should do is to make a copy of it. Any changes you should make should be done in this copy – the original data should remain pure and untouched! This means that you can go back and check it at any time. It’s also good practice to note where you got your data from, when and how it was retrieved. Once you have your own copy of the data (try adding ‘working copy’ or similar after the original name), create a new sheet within your spreadsheet. This is for you to mess around with whilst you learn about formulae. Now move across to the “Total fruits sold” column. Start in the first row. It’s time to write a formula… Walkthrough: Using spreadsheets to add values. Using this example data. Let’s calculate the total of fruits sold. Get the data and create a working copy. To start, move to the first row. Each formula in a spreadsheet starts with = Enter = and select the first cell you want to add. Notice how the cell reference appears in the formula? now type + and select the second cell you want to add Press Enter or tab . The formula disappears and is replaced by the value. Try changing the number in one of the original cells (apples or plums) you should see the value in total update automatically. You can type each formula individually, but it also possible to cut and paste or drag formulas across a range of cells. Copy the formula you have just written (using ctrl + c ) and paste it into the cell below (using ctrl + v ), you will get the sum of the two numbers on the row below. Alternatively click on the lower right corner of the cell (the blue square), and drag the formula down to the bottom of the column. Watch the ‘total’ column update. Feels like magic! Task: Create a formula to calculate the total amount of apples and plums sold during the week. Did you add all of the cells up manually?: That’s a lot of clicking – for big spreadsheets, adding each cell manually could take a long time. Take a look at the “spreadsheet formulae” section in the Handbook – can you see a way add a range of cells or entire columns simply? Where Next? Once you’ve got the hang of building a basic formula – the sky is your limit! The School of Data Handbook will additionally walk you through: Multiplication using spreadsheets Division using spreadsheets Copying formulae sideways Calculating minimum and maximum values Dealing with empty cells in your data (complex formulae). This stage uses Boolean logic. You may need to refer to these chapters to complete the following challenges. Multiplication and division challenge Task: Using the data from the World Bank (if you don’t have it already, download it here.). In the data we have figures for healthcare only as a % of GDP. Calculate the full amount of private health expenditure in Afghanistan in 2001 in USD. If your percentages are rusty – check out the formulae section in the Handbook. Task: Still using the World Bank Data. Find out how much money (USD) is spent on healthcare per person in Albania in 2000. Task: Calculate the mean and median values for all the columns. Task: What is the formula for healthcare expenditure per capita? Can you modify it so it’s only calculated when both values are present (i.e. neither cell is blank)?