{ "cells": [ { "cell_type": "markdown", "id": "175d63c4-931e-4133-b363-74c16c6b72d0", "metadata": {}, "source": [ "<span>Welcome to my code log! These are just some codes / packages that I find useful and often use alongside while performing data analysis.It makes it easier to copy, paste and conform to my code.</span>\n", "<span>Pay attention to the section headers, as Ill try to group codes the best that I can. </span>\n", "<span>This will be an ongoing project, as Ill add more codes / usages that I learn along the way! *V1.0 Jan 24 2024*</span>\n", "\n", "<span>Hoping this is a good resource, feel free to contact with critiques / suggestions at dgs-ds.com</span>" ] }, { "cell_type": "markdown", "id": "5d4ed0a1-969a-4395-bc38-4d327df3d051", "metadata": { "tags": [] }, "source": [ "# For Beginners" ] }, { "cell_type": "markdown", "id": "0dc2d7c1-acd0-47d6-bf81-cc4e210a4617", "metadata": {}, "source": [ "## <span style=\"background-color: lightblue\">Markdown</span>" ] }, { "cell_type": "markdown", "id": "3d0180ad-4f77-4372-9116-0b68bdeb7a0a", "metadata": {}, "source": [ "<span>Markdown codes are below, they help organize your notebook. Make sure to change the cell type to \"markdown\" and run the cell!</span><br>\n", "<span>The hashtag also allows you to annotate your code, whats after the # wont run in your cell</span>" ] }, { "cell_type": "raw", "id": "86e728f1-85e3-4918-971c-852d9b830d50", "metadata": {}, "source": [ "# Heading 1\n", "## Heading 2\n", "\n", "### Heading 3\n", "#### Heading 4\n", "\n", "OR \n", "\n", "<h1>Heading 1</h1>\n", "<h1>Heading 2</h1>" ] }, { "cell_type": "markdown", "id": "1d1ca213-933d-4a53-829d-71b4a0c6e3d7", "metadata": { "tags": [] }, "source": [ "#### <span style=\"background-color: lightblue\"> Formatting text in Markdown</span>" ] }, { "cell_type": "raw", "id": "93b3d345-2a10-4722-add2-29133412b31f", "metadata": {}, "source": [ "**Bold**\n", "*Italic*" ] }, { "cell_type": "markdown", "id": "505d247a-4423-448d-91bd-65784a1393a3", "metadata": { "tags": [] }, "source": [ "## <span style=\"background-color: lightblue\">Libraries</span>" ] }, { "cell_type": "markdown", "id": "a7fee2c7-a5a6-4d6f-8b2e-351b6a5ed728", "metadata": {}, "source": [ "<span>First we want to import libraries. By importing libraries into your python notebook. Metaphorically, your python notebook is like a canvas and libraries are paintbrushes.</span>\n", "\n", "<span>Typically, if this is your first python notebook you will get an error. You need to install the library that you are using in your computers bucket of paintbrushes</span><br><br>\n", "<span style=\"text-decoration: underline;\">To install it, type in !pip install (your library)</span><br><br>\n", " \n", "<span>To import a library, you follow the general format \"import x as y\" x is your library, and y is the code to dip your paintbrush in the paint</span>\n", "\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">pandas is used for alot of data manipulation</span>\n", "**<p>import pandas as pd</p>**\n", "\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">numpy does alot of numerical operations</span>\n", "**<p>import numpy as np</p>**\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">matplotlib.pyplot helps visualize stats</span>\n", "**<p>import matplotlib.pyplot as plt</p>**" ] }, { "cell_type": "markdown", "id": "661ed93e-2094-48d2-9987-a8ed75b6cba6", "metadata": {}, "source": [ "## <span style=\"background-color: lightblue\">Putting our Libraries to use</span>" ] }, { "cell_type": "markdown", "id": "77fb9c5d-c279-4c67-9555-a1c80937622f", "metadata": {}, "source": [ "<span>I used pandas and numpy to create a dataset to follow along with for beginners, the csv file in this section of the website.</span>\n", "<span>Lets put some paint on the pallete so we can start painting</span>\n", "\n", "**df = pd.read_csv(\"example_dataset.csv\")**\n", "\n", "<span>df is essentially the name of a color on our pallete. Its commonly used in much code, as it stands for \"Data Frame\". Its the dataset loaded into the python notebook and saved under df for future use.</span>\n", "<span>read_csv is an operation performed from the pandas library. We need to reference it when we use it. Kinda like calling its name</span>\n", "<span>example_dataset.csv is the name of our file, make sure it is in the same directory as your python notebook so python can find it.</span>\n" ] }, { "cell_type": "markdown", "id": "8b6a5415-dac5-48c0-a8c2-9200931f4844", "metadata": {}, "source": [ " <span style=\"text-decoration: underline;\">Now, lets use a function the numpy library to put our dataset to use. Theres 99 rows total, as .tail() looks at the last 5 automatically. Lets look at our csv first</span>\n", "**<p>df.tail()<p>**" ] }, { "cell_type": "markdown", "id": "5ca54e64-7b5f-4ae6-b07e-86555bcf514a", "metadata": {}, "source": [ "<span style=\"text-decoration: underline; font-size: larger;\">Now, lets build on using our functions</span>\n", "\n", "**print(df['Salary'].mean())**\n", "\n", "<br><span style=\"text-decoration: underline; font-size: larger;\">Print tells the output </span>\n", "<p>df['Salary'] says \"look for the salary column in df, the \"\"example_dataset.csv\" file. We use brackets to specify columns in the csv file.</p>\n", "<p>.mean() says \"take the mean of what you found\"</p>\n", "<p>Make sure each \"(\" has a closing \")\"!</p>\n", "\n", "<br><span style=\"text-decoration: underline; font-size: larger;\">What if you want to store the mean and use it later?</span>\n", "\n", "**<p>mean_salary = df['Salary'].mean()</p>**" ] }, { "cell_type": "markdown", "id": "1a3b9495-1222-45a3-8ee8-1092ac2e5b4e", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "# Importing and Exporting Files" ] }, { "cell_type": "markdown", "id": "a53ed21c-8f27-4b25-8f17-fa590a21df75", "metadata": { "tags": [] }, "source": [ "## <span style=\"background-color: lightblue\"> .csv files</span>" ] }, { "cell_type": "markdown", "id": "82149646-0cf6-4050-b121-4e741730cbda", "metadata": {}, "source": [ "<span style=\"text-decoration: underline; font-size: larger;\">Import</span>\n", "**<p>df = pd.read_csv(\"your_file.csv\")</p>**\n", "\n", "<br><span style=\"text-decoration: underline; font-size: larger;\">Export</span>\n", "**<p>df.to_csv(\"your_file.csv\")</p>**\n", " **<p>df.to_csv(\"your_file.csv\", index = False)</p>**\n", " <p>Index specifies if the index column is written or not in the csv file.</p>" ] }, { "cell_type": "markdown", "id": "98ee05ce-d507-47e3-9c9d-cafb121cc3ce", "metadata": { "tags": [] }, "source": [ "## <span style=\"background-color: lightblue\">.xlsx files</span>" ] }, { "cell_type": "markdown", "id": "aa89a26c-b862-41a0-bf85-d4d716883ef3", "metadata": {}, "source": [ "<span style=\"text-decoration: underline; font-size: larger;\">Import</span>\n", "**<p>df = pd.ExcelFile(\"your_file.xlsx\")</p><br>**\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">To read a single sheet from a .xlsx file with multiple sheets(After you already loaded in the entire file)</span>\n", "**<p>df = your_file.parse(your_file_sheetname)</p>**\n", "\n", "<br><span style=\"text-decoration: underline; font-size: larger;\">Export</span>\n", "**<p>df.to_excel(\"your_file.xlsx\")</p>**" ] }, { "cell_type": "markdown", "id": "813f2be8-a420-4673-a309-663e60296052", "metadata": { "tags": [] }, "source": [ "## <span style=\"background-color: lightblue\">.json files</span>" ] }, { "cell_type": "markdown", "id": "72b79c07-4206-4088-ace0-29dbd18b9b74", "metadata": {}, "source": [ "<span style=\"text-decoration: underline; font-size: larger;\">Import</span>\n", "**<p>df = pd.read_json(\"your_file.json\")</p>**\n", "\n", "<br><span style=\"text-decoration: underline; font-size: larger;\">Export</span>\n", "**<p>df.to_json(\"your_file.json\")</p>**" ] }, { "cell_type": "markdown", "id": "097c4dee-10d6-40ef-a0a8-17bb8d8c9570", "metadata": { "tags": [] }, "source": [ "## <span style=\"background-color: lightblue\">Specifying Output Guidelines</span>" ] }, { "cell_type": "markdown", "id": "f03a5bbc-5067-4153-b9b3-afae18e7253e", "metadata": {}, "source": [ "<span style=\"text-decoration: underline; font-size: larger;\">Ensure to run these after you load your file. It can make what is printed from your code organized.</span><br><br>\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">Sets the maximum number of rows</span><br>\n", "**pd.set_option('display.max_rows', 1000)**\n", "\n", "**pd.options.display.max_rows = 1000**\n", "\n", "<br><span style=\"text-decoration: underline; font-size: larger;\">Sets the maximum number of columns</span><br>\n", "**pd.set_option('display.max_columns', 1000)**\n", "\n", "**pd.set_option('display.width', 1000)**" ] }, { "cell_type": "markdown", "id": "daeb3327-d388-43dd-8d15-4b7875b05b46", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "# Brief dataframe inspections" ] }, { "cell_type": "markdown", "id": "3245ce68-524e-4d68-a340-f6e4cd6c5573", "metadata": {}, "source": [ "<span style=\"text-decoration: underline; font-size: larger;\">Display the first x rows.</span>\n", "**<p>df.head(x)</p><br>**\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">Display the last x rows.</span>\n", "**<p>df.tail(x)</p><br>**\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">To see the shape of your dataset (How many rows and how many columns)</span>\n", "**<p>df.shape()</p><br>**\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">General df information</span>\n", "**<p>df.info()</p><br>**\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">Statistical df information</span>\n", "**<p>df.describe()</p><br>**\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">Display column names</span>\n", "**<p>df.columns()</p><br>**\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">Check for missing values</span>\n", "**<p>df.isnull().sum()</p><br>**\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">To look at all of the unique values in a column</span>\n", "**<p>df['column_name'].unique()</p><br>**\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">To look at how many times a unique value in a column occurs</span>\n", "**<p>df['column_name'].value_counts()</p><br>**" ] }, { "cell_type": "markdown", "id": "5d6cb20e-1654-429d-98d6-0cc76e144c2b", "metadata": {}, "source": [ "<span style=\"text-decoration: underline;\">Ensure to run these after you load your file. It can make what is printed from your code organized.</span>" ] }, { "cell_type": "markdown", "id": "46806698-a4a6-4541-9bab-1fa747eba1cf", "metadata": { "tags": [] }, "source": [ "# Null Values" ] }, { "cell_type": "markdown", "id": "9e391f7c-0768-4339-87f4-d862c728bddb", "metadata": {}, "source": [ "## <span style=\"background-color: lightblue\">Displaying Null Values</span>" ] }, { "cell_type": "markdown", "id": "09858c33-9115-468b-95e5-afc594a140b6", "metadata": {}, "source": [ "<span style=\"text-decoration: underline; font-size: larger;\">Check for null values in entire dataframe</span>\n", "**<p>df.isnull()</p>**<br>\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">Check for null values in a dataframe column</span>\n", "**<p>df[\"column_name\"].isnull()</p>**<br>\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">To get a ratio of null values in each dataframe column</span>\n", "**<p>nulls = pd.Dataframe(df.isna().sum()</p>**\n", "**<p>nulls['Ratio'] = nulls / df.shape[0]</p>**\n", "**<p>print(nulls)</p><br>**" ] }, { "cell_type": "markdown", "id": "e5e289ef-4978-4c9b-bfde-634629d75aa7", "metadata": {}, "source": [ "## <span style=\"background-color: lightblue\">Replacing Null Values</span>" ] }, { "cell_type": "markdown", "id": "e6ae0cff-d3ce-4a91-a1ad-031a7342227c", "metadata": {}, "source": [ "<span style=\"text-decoration: underline; font-size: larger;\">To replace null values in a column with a single value</span>\n", "**<p>df['Column_Name'].fillna(\"None\", inplace = True)</p><br>**\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">To replace with a constant value</span>\n", "**<p>df_filled_constant = df.fillna(0)</p><br>**\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">To replace null values with \"median\", \"Mean\" or \"Mode\" </span>\n", "**<p>df_filled_mean = df.fillna(df.mean())</p><br>**\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">To interpolate / estimate null values based on existing data</span>\n", "**<p>df_estimated = df.interpolate()</p><br>**" ] }, { "cell_type": "markdown", "id": "21a57f68-4ea4-490d-8b9f-1ea0959b1a8e", "metadata": {}, "source": [ "# Dataframe Modifications" ] }, { "cell_type": "markdown", "id": "5924cb39-5f70-4b1f-81ce-6303c394456e", "metadata": {}, "source": [ "<span style=\"text-decoration: underline; font-size: larger;\">Joining 2 columns into one with a space character</span>\n", "\n", "<p>df['New_Column_Name'] = df['Column1'] + '_' + df['Column2']</p><br>\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\">Standardizing the values</span>\n", "\n", "<p> df['Standardized_New_Column'] = df['New_Column_Name'].apply(lambda x: ''.join(sorted(x))) </p>" ] }, { "cell_type": "markdown", "id": "06635c7f-0ecc-458f-8ece-fefba958b98e", "metadata": {}, "source": [ "<span style=\"text-decoration: underline; font-size: larger;\"> Creating a new column based on text content </span>\n", "\n", "<p>df['New_Column_Name'] = df['Column'].apply(lambda x: 'Yes\" if 'text' in x else 'no')</p><br>" ] }, { "cell_type": "markdown", "id": "a80af635-8138-4d18-8494-08d4eb14893b", "metadata": {}, "source": [ "<span style=\"text-decoration: underline; font-size: larger;\"> Filtering rows based on conditions </span>\n", "\n", "<p> OR operator | AND operator & </p>\n", "\n", "<p>filtered_df = df[(df['Column1'] > 30) & (df['Column2'] > 4 </p>\n", " \n", "<span style=\"text-decoration: underline; font-size: larger;\"> Filtering rows using a query string </span> \n", " \n", "<p> df_query = ('Column1 > 30 & Column2 > 4') </p><br> " ] }, { "cell_type": "markdown", "id": "7123566e-e32a-498f-9b1b-c4ee9fa29289", "metadata": {}, "source": [ "<span style=\"text-decoration: underline; font-size: larger;\"> Grouping / Aggregating Data </span> \n", " \n", "<p> avg_x_by_y = df.groupby('ColumnX')['ColumnY'].mean() </p><br>" ] }, { "cell_type": "markdown", "id": "fe75edff-c9f8-496c-a6a0-8506d557eae4", "metadata": {}, "source": [ "<span style=\"text-decoration: underline; font-size: larger;\"> Value Counts </span> \n", "df['Column'].value_counts()\n", "\n", "<span style=\"text-decoration: underline; font-size: larger;\"> To save value counts to a variable </span>\n", "<p> value_counts = df.grouby('Column').size() </p><br>" ] }, { "cell_type": "markdown", "id": "e630570a-c1fd-46f6-ba87-ee6f4710c57b", "metadata": {}, "source": [ "# Plotting Data" ] }, { "cell_type": "code", "execution_count": null, "id": "5f2b5845-eacf-4132-96a3-fab4b850fac9", "metadata": {}, "outputs": [], "source": [ "# Specify figure dimensions\n", "plt.figure(figsize=(10,6))\n", " \n", "# Specify data and type of plot | line, bar, barh, hist, box, area | Specify fig size | Specify colors\n", "colors = ['red', 'green', 'blue']\n", "\n", "data.plot(kind = 'bar', figsize = (10,6), colors = colors)\n", "\n", "# Figure Title\n", "plt.title('Title')\n", " \n", "# X and Y Labels\n", "plt.xlabel('Label1')\n", "plt.ylabel('Label2')\n", " \n", "# Rotate Tick Labels\n", "plt.xticks(rotation = 45)\n", "\n", "# To label bars with values\n", "for index, value in enumerate(data)\n", " plt.text(index, value, str(value), ha = center, va = bottom)\n", " \n", "# Print figure\n", "plt.show" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 5 }