1 | data146

1

A package is a space that organizes libraries in a distributable manner. Libraries organize a set of functions that make sense to be together.

The two steps you need to execute in order to install a package and then make that library accessible is to:

1) find and install the package into the environment 2) import package_name to install the package and as nickname to rename it as something easily accesible

For example, to import pandas:

1) pip install package_path 2) import pandas as pd

To import matplotlib.pyplot

1) pip install matplotlib_package_path 2) import matplotlib.pyplot as plt

2

A dataframe is an object that stores data as a table of rows and columns. A library that is useful when working with data frames is pandas. To read a file that is stored on my operating system I would use the

pd.read_() function. The first argument in the function is the path to the file. For example,

data = pd.read_csv('gapminder.tsv', sep = '\t')

Here, we have to specify the sep argument, which is short for separator. read_csv() assumes that the data being imported is separated by commas, however, the gapminder data set is separated by tabs, so we must specify sep = \t – \t for tab. If we did not specify the sep argument in this case, the data would not have been successfully imported.

To look at the data frame we can call it by name:

data

To get the general idea of what a data frame looks like you can use the df.describe function:

data.describe()

To get the rows and columns of a data frame you can use df.shape which puts the rows and columns in (x,y) format – x for rows and y for columns:

data.shape()

Another way to describe a row in a data set is an observation, and another way to describe a column is a variable. If you would like to see all the column names, you can used the df.columns function.

3

The ‘year’ variable has regular intervals, it increases by 5 years with each additional row. If I were to add new outcomes to the raw data in order to update and make it more current, I would add 2012 and 2019 to keep wtith the regular intervals.

Stretch goal: There are 142 unique countries, but each country is once for every date. So, if I were to add 2 dates I would need to add a date for every country. 2x142 = 184. I would need to add 184 more observations.

4

In 1992 Rwanda had the lowest life expectancy at 23. The life expectancy was so low in that year becuase from 1990-1994 Rwanda was experiencing a civil war in which the majority of casualties were civilians. This resulted in 500,000 to 800,000 deaths. With almost the entire population aged under 64, and half of it under 15, it is expected that in a civil war with such a high casuality rate many of the population did not live past 23.

5

country	year	popXgdp
Spain	2007	1.165760e+12
Italy	2007	1.661264e+12
France	2007	1.861228e+12
Germany	2007	1.165760e+12

6

& AND operator. Evaluates to True only if both arguments specified are true.

ex) If you needed to find a combination of two columns in a dataframe like a specific country for a given year.

| - OR operator. Evaluates to True if one or both conditions specified are true.

ex) If you wanted to find one of two countries in a dataframe, you could use the | operator.

== - Set equal to. This is to evaulate whether a condition is equal to what you are asking for. For example, if you wrote

4 == 2*2

it would evaluate to True because 2*2 equals 4.

^ - EXCLUSIVE OR operator. Only evaluates to True if only one of two conditions specified is true.

ex) If you need to locate one country or another in a dataframe, but only would like one or the other, you could use the ^ operator.

7

.loc and .iloc are both methods within dataframes that allow index. iloc is used for position-based indexing and .loc is used for label-based indexing.

If I wanted to extract a series of consecutive observations from n position to k position I would use .iloc[n:k-1]

stretch goal: to extract all observations from a series of consecutive columns from rom n position to k position I would use iloc[:, n:k-1]

8

API stands for Application Programming Interface. APIs are the interface consumers interact with when requesting access to a remote server. For example, when accessing COVID-19 tracking data from covidtracking.com, the API is the website’s interface, from which we can request data from.

To construct a request to a remote server and pull from server and write to a local file:

r = requests.get(url)
with open(filename, 'wb') as f:
    f.write(r.content)

Where filename is a name you contruct and url is the url of the server you are pulling from.

9

The apply() function is a series method that sends each time of series into a function (the argument of apply). It’s purpose is apply a function once over series versus having to repeat the function call for every row of the series. Using apply() is an alternative approach to iterating over a series with a for loop. apply() could be a preferred approach to iterating over a series because it requires less lines of code and is faster than interation.

To import it to your current work session:

df = pd.read_csv(filename)

10

An alternative approach to filtering the number of columns in a data frame is to use the subsetting with brackets ([]). You can filter or select data from a dataframe and to assign it to a new dataframe you would use the syntax:

new_df = old_df[old_df[column] == 'data']]

data146

baby's first repository

1

2

3

4

5

6

7

8

9

10