Python code ( pls see the attachment below)
{
“cells”: [
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“# Python Basics (Instructor: Dr. Milad Baghersad)\n”,
“\n”,
“## Module 8: Data Analysis with Python Part 3\n”,
“\n”,
“- Reference: McKinney, Wes (2018) Python for data analysis: Data wrangling with Pandas, NumPy, and IPython, Second Edition, O’Reilly Media, Inc. ISBN-13: 978-1491957660 ISBN-10: 1491957662”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“—\n”,
“### Chapter 14: Data Analysis Examples (five examples)\n”,
“Here we review one of them: example 3. ”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“## US Baby Names 1880–2010\n”,
“- The United States Social Security Administration (SSA) has made available data on the frequency of baby names from 1880 through the present. \n”,
“- SSA makes available data files, one per year, containing the total number of births for each sex/name combination.\n”,
“- The raw archive of these files can be obtained from http://www.ssa.gov/oact/babynames/limits.html.\n”,
“\n”,
“There are many things you might want to do with the dataset:\n”,
“- Visualize the proportion of babies given a particular name (your own, or another name) over time\n”,
“- Determine the relative rank of a name\n”,
“- Determine the most popular names in each year or the names whose popularity has advanced or declined the most\n”,
“- Analyze trends in names: vowels, consonants, length, overall diversity, changes in spelling, first and last letters\n”,
“- Analyze external sources of trends: biblical names, celebrities, demographic changes”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“import pandas as pd\n”,
“import numpy as np\n”,
“import matplotlib.pyplot as plt”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“#read the 1980’s file\n”,
“names1880 = pd.read_csv(‘US Names Dataset- Module 8//yob1880.txt’, names=[‘name’, ‘sex’, ‘births’])”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“names1880.head()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“names1880.info()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“#Calculate the total number of births in that year:\n”,
“names1880.births.sum()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“#Calculate the sum of the births column by sex as the total number of births in that year:\n”,
“names1880.groupby(‘sex’).births.sum()”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“### Use a for loop to read all text files in the dataset folder”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“years = np.arange(1880, 2011)\n”,
“years”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“# an empty list\n”,
“pieces = []”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“for year in years:\n”,
” path = ‘US Names Dataset- Module 8//yob%s.txt’ % year\n”,
” \n”,
” #read the text file for each year\n”,
” frame = pd.read_csv(path, names=[‘name’, ‘sex’, ‘births’])\n”,
” \n”,
” #add a new column named year to save the year\n”,
” frame[‘year’] = year\n”,
” \n”,
” #append the dataframe to the list:\n”,
” pieces.append(frame)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“print(pieces)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“len(pieces)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“pieces[0]”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“pieces[1]”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“# Concatenate everything into a single DataFrame\n”,
“names = pd.concat(pieces, ignore_index=True)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“names.head()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“names.tail()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“names.info()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“#save it as a csv file\n”,
“names.to_csv(‘names.csv’)”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“### calculate total birth based on sex and year\n”,
“###### use pivot_table method:\n”,
“DataFrame has a pivot_table method. In addition to providing a convenience interface to groupby, pivot_table can add partial totals, also known as margins.\n”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“total_births = names.pivot_table(‘births’, index=’year’,columns=’sex’, aggfunc=sum)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“total_births.head()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“total_births.plot()\n”,
“plt.title(‘Total births by sex and year’)”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“### Next, let’s insert a column prop with the fraction of babies given each name relative to the total number of births in each year and sex. \n”,
“A prop value of 0.02 would indicate that 2 out of every 100 babies were given a particular name in that year and gender.”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“def add_prop(group):\n”,
” group[‘prop’] = group.births / group.births.sum()\n”,
” return group\n”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“grouped = names.groupby([‘year’, ‘sex’])\n”,
“names = grouped.apply(add_prop)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“names”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“#When performing a group operation like this, it’s often valuable to do a sanity check:\n”,
“names.groupby([‘year’, ‘sex’]).prop.sum()”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“### Extract a subset of the data to facilitate further analysis: \n”,
“- top 1,000 names for each sex/year combination. ”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“def get_top1000(group):\n”,
” return group.sort_values(by=’births’, ascending=False)[:1000]”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“grouped = names.groupby([‘year’, ‘sex’])\n”,
“top1000 = grouped.apply(get_top1000)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“top1000”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“# Drop the group index, not needed\n”,
“top1000.reset_index(inplace=True, drop=True)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“top1000”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“### Analyzing Naming Trends\n”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“#Splitting the Top 1,000 names into the boy and girl portions:\n”,
“boys = top1000[top1000.sex == ‘M’]\n”,
“girls = top1000[top1000.sex == ‘F’]”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“boys.head()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“#Let’s form a pivot table of the total number of births by year and name:\n”,
“total_births = top1000.pivot_table(‘births’, index=’year’, columns=’name’,aggfunc=sum)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“total_births”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“total_births.info()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“subset = total_births[[‘John’, ‘Harry’, ‘Mary’, ‘Marilyn’]]”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“subset”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“subset.plot(subplots=True, figsize=(12, 10), grid=False, title=\”Number of births per year\”)”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“Note: On looking at this, you might conclude that these names have grown out of favor with the American population. But the story is actually more complicated than that, as will be explored in the next section.”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“### Measuring the increase in naming diversity\n”,
“One explanation for the decrease in plots is that fewer parents are choosing common names for their children. \n”,
“\n”,
“This hypothesis can be explored and confirmed in the data.”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“#One measure is the proportion of births represented by the top 1,000 most popular names:\n”,
“table = top1000.pivot_table(‘prop’, index=’year’, columns=’sex’, aggfunc=sum)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“table.plot(title=’Sum of table1000.prop by year and sex’, yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“You can see that, indeed, there appears to be increasing name diversity (decreasing total proportion in the top 1,000).\n”,
“\n”,
“___”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“- Another interesting metric is the number of distinct names, taken in order of popularity from highest to lowest, in the top 50% of births. \n”,
“\n”,
“This number is a bit more tricky to compute. Let’s consider just the boy names from 2010:”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“df = boys[boys.year == 2010]”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“df”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“After sorting prop in descending order, we want to know how many of the most popular names it takes to reach 50%.”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“df_sorted = df.sort_values(by=’prop’, ascending=False)\n”,
“df_sorted.head()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“prop_cumsum = df_sorted.prop.cumsum()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“prop_cumsum[:10]”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“Calling the method searchsorted returns the position in the cumulative sum at which 0.5 would need to be inserted to keep it in sorted order:”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“prop_cumsum.values.searchsorted(0.5)”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“Since arrays are zero-indexed, adding 1 to this result gives you a result of 117.”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“#By contrast, in 1880 this number was much smaller:\n”,
“df = boys[boys.year == 1880]\n”,
“in1900 = df.sort_values(by=’prop’, ascending=False).prop.cumsum()\n”,
“in1900.values.searchsorted(0.5) + 1”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“___\n”,
“You can now apply this operation to each year/sex combination, groupby those fields,and apply a function returning the count for each group:”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“def get_quantile_count(group, q=0.5):\n”,
” group = group.sort_values(by=’prop’, ascending=False)\n”,
” return group.prop.cumsum().values.searchsorted(q) + 1\n”,
“\n”,
“diversity = top1000.groupby([‘year’, ‘sex’]).apply(get_quantile_count)\n”,
“diversity = diversity.unstack(‘sex’)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“diversity.head()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“diversity.plot(title=\”Number of popular names in top 50%\”)”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“As you can see, girl names have always been more diverse than boy names, and they have only become more so over time. ”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“___\n”,
“___\n”,
“___\n”,
“#### The “last letter” revolution\n”,
“In 2007, baby name researcher Laura Wattenberg pointed out on her website (http://www.babynamewizard.com/) that the\n”,
“distribution of boy names by final letter has changed significantly over the last 100\n”,
“years. \n”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“names.head()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“# extract last letter from name column\n”,
“last_letters = names[‘name’].str[-1]”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“last_letters.head()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“last_letters.name = ‘last_letter'”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“last_letters”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“table = names.pivot_table(‘births’, index=last_letters,columns=[‘sex’, ‘year’], aggfunc=sum)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“table”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“#Then we select out three representative years spanning the history and print the first few rows:\n”,
“subtable = table.reindex(columns=[1910, 1960, 2010], level=’year’)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“subtable.head()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“#Next, normalize the table by total births \n”,
“#to compute a new table containing proportion of total births for each sex ending in each letter:\n”,
“subtable.sum()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“letter_prop = subtable / subtable.sum()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“letter_prop”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“fig, axes = plt.subplots(2, 1, figsize=(10, 8))\n”,
“letter_prop[‘M’].plot(kind=’bar’, rot=0, ax=axes[0], title=’Male’)\n”,
“letter_prop[‘F’].plot(kind=’bar’, rot=0, ax=axes[1], title=’Female’,legend=False)”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“fig, axes = plt.subplots(2, 1, figsize=(10, 8))\n”,
“letter_prop[‘M’].plot(kind=’bar’, rot=0, ax=axes[0], title=’Male’)\n”,
“letter_prop[‘F’].plot(kind=’bar’, rot=0, ax=axes[1], title=’Female’)\n”,
“plt.tight_layout()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“letter_prop = table / table.sum()\n”,
“dny = letter_prop.loc[[‘d’, ‘n’, ‘y’], ‘M’]\n”,
“dny.head()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“dny_ts = dny.T\n”,
“dny_ts”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“dny_ts.plot()”
]
},
{
“cell_type”: “markdown”,
“metadata”: {},
“source”: [
“___\n”,
“___\n”,
“___\n”,
“___\n”,
“___\n”,
“___\n”,
“___\n”,
“___\n”,
“______\n”,
“___\n”,
“___\n”,
“\n”,
“#### Boy names that became girl names (and vice versa)\n”,
“Another fun trend is looking at boy names that were more popular with one sex earlier in the sample but have “changed sexes” in the present. \n”,
“- One example is the name Lesley or Leslie.”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“#compute a list of names occurring in the dataset starting with “lesl”:\n”,
“all_names = pd.Series(top1000.name.unique())”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“lesley_like = all_names[all_names.str.lower().str.contains(‘lesl’)]”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“lesley_like”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“#we can filter down to just those names and sum births grouped by name\n”,
“#to see the relative frequencies:\n”,
“\n”,
“filtered = top1000[top1000.name.isin(lesley_like)]\n”,
“filtered.groupby(‘name’).births.sum()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“#Next, let’s aggregate by sex and year and normalize within year:\n”,
“\n”,
“table = filtered.pivot_table(‘births’, index=’year’, columns=’sex’, aggfunc=’sum’)\n”,
“table = table.div(table.sum(1), axis=0)\n”,
“table.tail()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“table.head()”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: [
“table.plot(style={‘M’: ‘k-‘, ‘F’: ‘k–‘})”
]
},
{
“cell_type”: “code”,
“execution_count”: null,
“metadata”: {},
“outputs”: [],
“source”: []
}
],
“metadata”: {
“kernelspec”: {
“display_name”: “Python 3”,
“language”: “python”,
“name”: “python3”
},
“language_info”: {
“codemirror_mode”: {
“name”: “ipython”,
“version”: 3
},
“file_extension”: “.py”,
“mimetype”: “text/x-python”,
“name”: “python”,
“nbconvert_exporter”: “python”,
“pygments_lexer”: “ipython3”,
“version”: “3.7.3”
}
},
“nbformat”: 4,
“nbformat_minor”: 2
}
haripriya priya
haripriyastudy@gmail.com
Manage your Google Account
Default
haripriya priya
haripriyastudy@gmail.com
haripriya priya
haripriya18priya@gmail.com
All Brand accounts
Add another account
Sign out of all accounts
Privacy Policy
•
Terms of Service
Google Account
haripriya priya
haripriyastudy@gmail.com