{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "196a647a-6faa-4aee-a0bf-a345852251dd",
   "metadata": {},
   "source": [
    "## 深入浅出pandas\n",
    "\n",
    "pandas是一个支持数据分析全流程的Python开源库，它的作者Wes McKinney于2008年开始开发这个库，其主要目标是提供一个大数据分析和处理的工具。pandas封装了从数据加载、数据重塑、数据清洗到数据透视、数据呈现等一系列操作，提供了三种核心的数据类型：\n",
    "1. `Series`：数据系列，表示一维的数据。跟一维数组的区别在于每条数据都有对应的索引，处理数据的方法比`ndarray`更为丰富。\n",
    "2. `DataFrame`：数据框、数据窗、数据表，表示二维的数据。跟二维数组相比，`DataFrame`有行索引和列索引，而且提供了100+方法来处理数据。\n",
    "3. `Index`：为`Series`和`DataFrame`提供索引服务。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb84f909-921a-47da-87b1-61578c871422",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.rcParams['font.sans-serif'].insert(0, 'SimHei')\n",
    "plt.rcParams['axes.unicode_minus'] = False\n",
    "get_ipython().run_line_magic('config', \"InlineBackend.figure_format = 'svg'\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2102e83e-2a6d-47aa-b449-c058bea1a601",
   "metadata": {},
   "source": [
    "### 创建DataFrame对象"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "87dbde08-dcab-4ede-a791-b56e11dd9115",
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c5b2767-2074-4cdf-b1ba-beff6f425942",
   "metadata": {},
   "outputs": [],
   "source": [
    "stu_names = ['狄仁杰', '白起', '李元芳', '苏妲己', '孙尚香']\n",
    "cou_names = ['语文', '数学', '英语']\n",
    "scores_arr = np.random.randint(60, 101, (5, 3))\n",
    "scores_arr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f8c2a6bf-ca5e-479d-ab63-f5c3620186e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 方法一：通过二维数组构造DataFrame对象\n",
    "df1 = pd.DataFrame(data=scores_arr, columns=cou_names, index=stu_names)\n",
    "df1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "baad5381-fb7d-4cc9-9288-a05d750144af",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 行索引\n",
    "df1.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7f06b76-b60b-49cb-be72-adafb0978fca",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 列索引\n",
    "df1.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13b1275d-77e5-4d5d-b227-19db3f4196fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 值 - 二维数组\n",
    "df1.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dbf5bb11-1600-4ae4-bc95-369bc8189c20",
   "metadata": {},
   "outputs": [],
   "source": [
    "scores_dict = {\n",
    "    '语文': [95, 91, 69, 82, 92],\n",
    "    '数学': [86, 88, 80, 67, 100],\n",
    "    '英语': [75, 86, 71, 94, 81]\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c300bbbd-329a-4852-bf76-78ce1de02b8f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 方法二：通过数据字典构造DataFrame对象\n",
    "df2 = pd.DataFrame(data=scores_dict, index=stu_names)\n",
    "df2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "705c0de6-43ff-46c6-85d5-301743d18d43",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 查看DataFrame信息\n",
    "df2.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "71417ac2-8f4b-4950-9336-de6fbc1f5da4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 方法三：从CSV文件加载数据创建DataFrame对象\n",
    "df3 = pd.read_csv(\n",
    "    'res/2023年北京积分落户数据.csv',\n",
    "    # encoding='utf-8',  # 指定字符编码\n",
    "    # sep='',  # 指定字段的分隔符（默认逗号）\n",
    "    # delimiter='#',\n",
    "    # header=0,  # 表头所在的行\n",
    "    # quotechar='\"',  # 包裹字符串的字符（默认双引号）\n",
    "    # index_col='公示编号',  # 索引列\n",
    "    # usecols=['公示编号', '姓名', '积分分值'],  # 指定加载的列\n",
    "    # nrows=10,  # 加载的行数\n",
    "    # skiprows=np.arange(1, 101),  # 跳过哪些行\n",
    "    # true_values=['是', 'Yes', 'YES'],  # 哪些值会被视为布尔值True\n",
    "    # false_values=['否', 'No', 'NO'],  # 哪些值会被视为布尔值False\n",
    "    # na_values=['---', 'N/A'],  # 哪些值会被视为空值\n",
    "    # iterator=True,  # 开启迭代器模式\n",
    "    # chunksize=1000,  # 每次加载的数据体量\n",
    ")\n",
    "df3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67b86b13-566b-4f97-86fd-3723ef21a87f",
   "metadata": {},
   "outputs": [],
   "source": [
    "df4 = pd.read_csv('res/big_data_file.csv.gz', low_memory=False)\n",
    "df4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e52ff38d-8e40-4532-8df9-4d2807a3e2ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "df4.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65e871ff-e87c-4e6b-86cc-624af7ccbdc1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# %pip install pyarrow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48fab84a-8b86-4405-966a-6bfb99582de5",
   "metadata": {},
   "outputs": [],
   "source": [
    "df5 = pd.read_csv('res/big_data_file.csv.gz', engine='pyarrow')\n",
    "df5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea575de9-4398-46fe-b2f8-8fb37b93179b",
   "metadata": {},
   "outputs": [],
   "source": [
    "df5.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5723cf03-b78f-4fc9-943c-f9b10036affa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# %pip install xlrd xlwt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb3387b9-3402-4b25-a5d5-ff9690a1ac06",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 方法四：从Excel文件加载数据创建DataFrame对象\n",
    "df6 = pd.read_excel(\n",
    "    'res/2020年销售数据.xlsx',\n",
    "    sheet_name='data',\n",
    ")\n",
    "df6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d06abbd8-9a34-4ab3-a75c-76e3ed8eb36c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# %pip install -U pymysql cryptography sqlalchemy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5aa0e35f-2a13-4c8e-a9fd-87b0bf72307e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 方法五：从数据服务器加载数据创建DataFrame对象\n",
    "from sqlalchemy import create_engine\n",
    "\n",
    "# URL \n",
    "engine = create_engine('mysql+pymysql://guest:Guest.618@47.109.26.237:3306/hrs')\n",
    "engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4b344f17-f5a1-4d7d-ad3c-ede4b122609c",
   "metadata": {},
   "outputs": [],
   "source": [
    "dept_df = pd.read_sql('tb_dept', engine, index_col='dno')\n",
    "dept_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5d1ffa3-6962-4c26-ae92-a8d7bc7da0cb",
   "metadata": {},
   "outputs": [],
   "source": [
    "emp_df1 = pd.read_sql('tb_emp', engine, index_col='eno')\n",
    "emp_df1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f84b6886-09d8-4f13-89cc-487574991dba",
   "metadata": {},
   "outputs": [],
   "source": [
    "emp_df2 = pd.read_sql('tb_emp2', engine, index_col='eno')\n",
    "emp_df2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c60e96d2-9a0d-4901-b39c-c31760de47a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 关闭连接释放资源\n",
    "engine.connect().close()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12086a7a-c161-4753-9a8e-180f9e8b2edf",
   "metadata": {},
   "source": [
    "### 查看信息"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "785e58f9-b3f7-49a6-affc-8caaa66cebf1",
   "metadata": {},
   "outputs": [],
   "source": [
    "df6.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd8a9156-3939-430d-9738-60b3d8a95563",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 获取前N行\n",
    "df6.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b75ace23-9b92-4425-b58f-bcd81e8d72e7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 获取后N行\n",
    "df6.tail(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2b2a909-0b40-473c-bb3f-85aca1925a19",
   "metadata": {},
   "source": [
    "### 操作行、列、单元格"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe964b3b-7f51-4202-b528-f5102d9be9f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 访问列\n",
    "df6['销售日期']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b2e5ccb3-4b97-4a02-8316-b1321390f286",
   "metadata": {},
   "outputs": [],
   "source": [
    "df6.销售渠道"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80ad78dc-4f47-4421-8478-ba7797350db4",
   "metadata": {},
   "outputs": [],
   "source": [
    "df6['销售渠道']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7b970671-6f16-4e07-8666-715495de2832",
   "metadata": {},
   "outputs": [],
   "source": [
    "type(df6['销售日期'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c9cb56b-6a2b-479e-8c57-c61683858387",
   "metadata": {},
   "outputs": [],
   "source": [
    "df6[['销售渠道']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "75730cd3-0459-4a62-97ee-e037256cc98a",
   "metadata": {},
   "outputs": [],
   "source": [
    "type(df6[['销售渠道']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e097e49-b762-4c9f-9d93-98abb1701d97",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 访问多个列 - 花式索引\n",
    "df6[['销售日期', '销售区域', '直接成本']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf31a169-549e-4182-8206-789f97316115",
   "metadata": {},
   "outputs": [],
   "source": [
    "df6.columns[3:7]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "792713c0-13bc-4810-86cc-5f6f6ce78719",
   "metadata": {},
   "outputs": [],
   "source": [
    "df6[df6.columns[3:7]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "02d43b17-15e3-44d5-844b-a50d365bf863",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 访问行 - loc属性\n",
    "df6.loc[1944]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "79da6932-f985-44dc-9f4b-e051e4749c65",
   "metadata": {},
   "outputs": [],
   "source": [
    "df6.iloc[-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6246b39b-7229-4e0f-af7b-0915e707492a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 访问多行 - 花式索引\n",
    "df6.loc[[0, 100, 58, 1000, 1000, 1000, 1099]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77321324-0ca9-4c2e-a792-3c717189cb27",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 访问多行 - 切片索引\n",
    "df6.loc[101:200]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5eb250eb-18e0-4181-a37a-dec55c633116",
   "metadata": {},
   "outputs": [],
   "source": [
    "# df6[101:200]\n",
    "df6.iloc[101:200]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f2daddd7-3635-40b1-9416-c1137315948c",
   "metadata": {},
   "outputs": [],
   "source": [
    "df6.iloc[-1:-101:-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9321811f-e62b-4db5-a478-cdc0934f097b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 访问单元格\n",
    "df6.at[2, '售价']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd1670bc-0a13-457f-95f1-352a4d61b3a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "df6.at[2, '售价'] = 999\n",
    "df6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7460ef03-3f45-4cc0-99a3-85039c2606b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "df6.iat[2, -3] = 888\n",
    "df6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34c81da6-f58f-4c36-8596-004266e9374b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 添加列\n",
    "df6['销售额'] = df6['售价'] * df6['销售数量']\n",
    "df6['季度'] = df6['销售日期'].dt.quarter\n",
    "df6['月份'] = df6['销售日期'].dt.month\n",
    "df6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3c60210-202d-4bd8-8804-1d657746b29c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 添加行 - 实际工作中基本没有意义"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6bf78f3d-05a2-4c7a-a0f0-fb6659f1bd6f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 删除列\n",
    "# inplace=False - 默认设定 - 不修改原对象返回修改后的新对象\n",
    "# inplace=True - 直接修改DataFrame对象不返回新对象 - 方法没有返回值\n",
    "df6.drop(columns=['季度'], inplace=True)\n",
    "df6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cdf8cf10-5193-4c38-8fef-bc3d38a8a0a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 删除行\n",
    "# df6.drop(index=[0, 1, 2, 100, 1944, 1943])\n",
    "df6.drop(index=[0, 1, 2, 100, 1944, 1943], inplace=True)\n",
    "df6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ddfe77d-aa92-4d6a-b2db-8469b1222ed3",
   "metadata": {},
   "outputs": [],
   "source": [
    "df6.drop(index=df6.index[100:200], inplace=True)\n",
    "df6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8020bbb0-740e-496a-9224-fe3495a19c92",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 重命名\n",
    "df6.rename(columns={'销售区域': '区域', '销售渠道': '渠道', '销售订单': '订单号'}, inplace=True)\n",
    "df6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d028d2be-0944-4b70-a3ea-f7d06cdd458f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 重置索引\n",
    "# drop=False - 默认值 - 原来的索引变成一个普通列\n",
    "# drop=True - 原来的索引直接丢弃\n",
    "df6.reset_index(drop=True, inplace=True)\n",
    "df6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb55a518-f4bd-4fac-8554-4353c0798bc6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 设置索引\n",
    "df6.set_index('订单号', inplace=True)\n",
    "df6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "101bd804-5a90-4cd3-a545-613df6d9b8e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 筛选数据 - 布尔索引\n",
    "df6[df6['销售额'] > 100000]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "64c83a43-fcb0-4ba1-9400-ae4a5b21715c",
   "metadata": {},
   "outputs": [],
   "source": [
    "df6[(df6['销售额'] > 100000) & (df6['月份'] == 6)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22c01e56-b188-40f7-9e53-3a3d2f0bcb29",
   "metadata": {},
   "outputs": [],
   "source": [
    "df6[(df6['销售额'] > 100000) | (df6['月份'] == 6)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5adb86b9-8b31-49cb-9292-94189f3714c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "df6.query('销售额 > 100000')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b768afa0-7066-4a1d-8f10-b88386587388",
   "metadata": {},
   "outputs": [],
   "source": [
    "df6.query('月份 == 6 and 渠道 == \"实体\"')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2e57b21c-0565-4352-8924-de169497bce0",
   "metadata": {},
   "outputs": [],
   "source": [
    "df6.query('销售额 > 100000 and 月份 == 6')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ef8ba56-5293-41b0-8208-85a0eed735e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 随机抽样\n",
    "df6.sample(n=100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bfcd52d7-eac4-4776-b0e3-a37e67e349f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "df6.sample(frac=0.05)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c654ca8-3179-4fa2-9213-7d7029357342",
   "metadata": {},
   "outputs": [],
   "source": [
    "# replace=False - 无放回抽样\n",
    "ignore_rows = np.random.choice(np.arange(1, 1946), size=int(1945 * 0.9), replace=False)\n",
    "pd.read_excel(\n",
    "    'res/2020年销售数据.xlsx',\n",
    "    sheet_name='data',\n",
    "    skiprows=ignore_rows\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2037ed6a-d616-4c67-9f5d-ea517d6e1c6b",
   "metadata": {},
   "source": [
    "### 数据重塑\n",
    "\n",
    "1. 拼接（合并结构一致的数据）\n",
    "2. 合并（事实表连接维度表）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2184fd4-bd44-459f-bda4-6dc11c09c219",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 拼接两个DataFrame - union\n",
    "all_emp_df = pd.concat([emp_df1, emp_df2])\n",
    "all_emp_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05bc65a1-42ac-463c-a089-08fb8dc60855",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 连表 - 连接事实表和维度表 - 用维度把数据分组然后再做聚合\n",
    "# 连接两个DataFrame（内连接、左外连接、右外连接、全外连接）- join\n",
    "# how - 连表方式 - inner、left、right、outer\n",
    "# on - 基于哪个字段连表 - left_on、right_on\n",
    "all_emp_df = pd.merge(all_emp_df, dept_df, how='inner', on='dno')\n",
    "all_emp_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c6a3d52d-a04c-494d-9ee9-2dad9805b1c1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 作业：在jobs目录下有若干个CVS文件，它们的数据结构是一样的，现在需要把所有CSV文件的数据拼接到一个DataFrame中\n",
    "import os\n",
    "\n",
    "dfs = [pd.read_csv(os.path.join('res/jobs', filename))\n",
    "       for filename in os.listdir('res/jobs') \n",
    "       if filename.endswith('.csv')]\n",
    "pd.concat(dfs, ignore_index=True).to_csv('res/all_jobs.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b9ad1e1-fe5d-45a0-8755-ac6720a32ba0",
   "metadata": {},
   "source": [
    "### 数据清洗\n",
    "\n",
    "1. 缺失值\n",
    "2. 重复值\n",
    "3. 异常值\n",
    "4. 预处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45c835c4-559f-45f1-a501-70a8c12bbbb1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 甄别缺失值\n",
    "all_emp_df.isna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd7fbdf8-ebf2-463b-ac3b-cdb24560873a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# all_emp_df['comm'].isna()\n",
    "all_emp_df['comm'].isnull()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a4f16d30-83e9-4761-92a1-780e85e721e1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# all_emp_df['comm'].notna()\n",
    "all_emp_df['comm'].notnull()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f2a153d-ab4a-475e-9ee3-0d623a289f7f",
   "metadata": {},
   "outputs": [],
   "source": [
    "all_emp_df['comm'].notna().value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d388d57-fa1a-405b-880e-9316354a6f05",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 删除空值 - 删除带有空值的行\n",
    "all_emp_df.dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b40fa037-3fab-454e-a300-2e9dcf4b2b60",
   "metadata": {},
   "outputs": [],
   "source": [
    "all_emp_df.dropna(axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67ae21a1-7dc1-496b-85b5-013d79d25a63",
   "metadata": {},
   "outputs": [],
   "source": [
    "all_emp_df.mgr.dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66745379-9db7-42b0-ab6b-a55e870a515b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 填充空值\n",
    "all_emp_df.fillna(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1664115-30e0-4946-ae4b-c919bb319ddc",
   "metadata": {},
   "outputs": [],
   "source": [
    "all_emp_df.comm.fillna(0).astype('i8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1743531-66a2-42a4-8c28-ad268efc848c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将空值下方的非空值向上填充 - backward fill\n",
    "all_emp_df.comm.bfill()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5fcef0a0-ff29-42bd-9955-5a97595390fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将空值上方的非空值向下填充 - forward fill\n",
    "all_emp_df.comm.ffill()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eeeb9be3-802c-44e3-80a0-465aba1a485a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 通过插值算法填充空值 - interpolate\n",
    "all_emp_df['comm'] = all_emp_df.comm.interpolate(method='linear')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1f094c3-1cc2-4826-a04a-24150ea9cef8",
   "metadata": {},
   "outputs": [],
   "source": [
    "all_emp_df['comm'] = all_emp_df.comm.astype('i8')\n",
    "all_emp_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a739242d-ebd2-42d2-9ec7-9a5939cbf74a",
   "metadata": {},
   "outputs": [],
   "source": [
    "all_emp_df['mgr'] = all_emp_df.mgr.fillna(-1).astype('i8')\n",
    "all_emp_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd376d13-2245-48b8-ba14-3315d4c48f9c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 甄别重复值\n",
    "all_emp_df.ename.duplicated()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e107a38-c5e8-4e5e-9e42-71481c54e0d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "all_emp_df.duplicated(['ename', 'job'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "097eaaf2-1112-4e0f-b361-786bf91d6c1f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 统计每个元素出现的频次\n",
    "all_emp_df.ename.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6494bb56-7ac7-47df-a9f1-960b02586e31",
   "metadata": {},
   "outputs": [],
   "source": [
    "all_emp_df.job.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "172e4d9a-63bd-44ca-98ea-e4614c8823ab",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 统计不重复的元素的个数\n",
    "all_emp_df.ename.nunique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6fa062c-d338-407f-8647-e84878a5642e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 删除重复值\n",
    "# keep='first' - 默认值，重复元素保留第一项 - 'last' / False\n",
    "all_emp_df.drop_duplicates(['ename', 'job'], keep='last', inplace=True)\n",
    "all_emp_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "832a2ea2-6941-4364-b143-af7db9ff9701",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 异常值的甄别\n",
    "# 数值判定法（data < Q1 - 1.5 * IQR 或者 data > Q3 + 1.5 * IQR）\n",
    "\n",
    "\n",
    "def find_outliers_by_iqr(data, whis=1.5):\n",
    "    q1, q3 = np.quantile(data, [0.25, 0.75])\n",
    "    iqr = q3 - q1\n",
    "    return data[(data < q1 - whis * iqr) | (data > q3 + whis * iqr)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1cd5d6aa-c60e-483e-995c-a627a0dfec15",
   "metadata": {},
   "outputs": [],
   "source": [
    "temp = np.random.normal(80, 8, 50).round(0)\n",
    "temp = np.append(temp, [120, 160, 200, 40, 20, -50])\n",
    "temp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2121dab4-0efc-4fcd-a5fe-67585552cb53",
   "metadata": {},
   "outputs": [],
   "source": [
    "find_outliers_by_iqr(temp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "da048825-3f88-4009-9db5-159e8e883b10",
   "metadata": {},
   "outputs": [],
   "source": [
    "find_outliers_by_iqr(temp, whis=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0da7034b-2350-43ff-a6eb-9e7f4361bdee",
   "metadata": {},
   "outputs": [],
   "source": [
    "# zscore判定法（三西格玛法则 ---> 68-95-99.7法则）\n",
    "\n",
    "\n",
    "def find_outliers_by_zscore(data, mul=3):\n",
    "    mu, sigma = np.mean(data), np.std(data)\n",
    "    zscore = (data - mu) / sigma\n",
    "    return data[np.abs(zscore) > mul]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e88616c0-a4d8-4fd8-9ec2-e761cb5ba056",
   "metadata": {},
   "outputs": [],
   "source": [
    "find_outliers_by_zscore(temp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c902031c-2f78-4721-9734-5c5b0ca81650",
   "metadata": {},
   "outputs": [],
   "source": [
    "find_outliers_by_zscore(temp, mul=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e295014-d582-4e78-b5b9-6d9f0463ff8d",
   "metadata": {},
   "outputs": [],
   "source": [
    "find_outliers_by_zscore(df6.直接成本)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "97b98c82-fd09-42a9-8a75-a3e71ae10fbc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 根据离群点的行索引删除行\n",
    "df6.drop(index=find_outliers_by_zscore(df6.直接成本).index)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0053ed12-c09f-4331-a6dd-487ff990c680",
   "metadata": {},
   "outputs": [],
   "source": [
    "med_value = np.median(temp)\n",
    "med_value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f02c2985-1b07-4b1c-b248-aa1de9e98451",
   "metadata": {},
   "outputs": [],
   "source": [
    "find_outliers_by_zscore(temp, mul=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "485adc15-f39d-419b-9869-2b366f5d88ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "np.in1d(temp, find_outliers_by_zscore(temp, mul=2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ce92f242-1f0f-476e-ae85-91e1615783ef",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 替换离群点\n",
    "np.place(temp, np.in1d(temp, find_outliers_by_zscore(temp, mul=2)), med_value)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10b0b0bc-f98c-40fe-890f-976df9d9c52b",
   "metadata": {},
   "outputs": [],
   "source": [
    "temp"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d970e838-42f2-44d0-8f2d-07ebbf6de2b0",
   "metadata": {},
   "source": [
    "#### 案例1：招聘数据清洗和预处理\n",
    "\n",
    "1. 数据加载\n",
    "2. 去重\n",
    "3. 数据抽取\n",
    "4. 拆分列\n",
    "5. 替换值\n",
    "6. 数据筛选"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ec417a9-457f-434e-96a6-f4fd35d75987",
   "metadata": {},
   "outputs": [],
   "source": [
    "jobs_df = pd.read_csv('res/all_jobs.csv')\n",
    "jobs_df.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74e0e4a5-3c03-4617-9661-8cfa03b88fd7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 根据URI列去重\n",
    "jobs_df.drop_duplicates('uri', inplace=True)\n",
    "jobs_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6cca7b8b-25f1-46b8-9946-34ba90f42116",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 通过正则表达式从列中提取信息\n",
    "jobs_df[['salary_lower', 'salary_upper']] = jobs_df.salary.str.extract(r'(\\d+)-(\\d+)').astype('i8')\n",
    "jobs_df['salary'] = (jobs_df.salary_lower + jobs_df.salary_upper) / 2\n",
    "jobs_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ffaea2af-09f6-4577-9c0d-024966d6854f",
   "metadata": {},
   "outputs": [],
   "source": [
    "jobs_df.drop(columns=['uri', 'city'], inplace=True)\n",
    "jobs_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d9ba5998-ca1d-44c8-87ca-363356074dd5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 拆分列\n",
    "jobs_df['city'] = jobs_df.site.str.split(expand=True)[0]\n",
    "jobs_df.drop(columns='site', inplace=True)\n",
    "jobs_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "933e9006-4f5e-4238-b6d9-940dfeb6caf1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 字符串正则表达式替换\n",
    "jobs_df['year'] = jobs_df.year.replace(r'5-10年|10年以上', '5年以上', regex=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d10a9c1c-a9d5-49e1-8fdf-a68b5bb3d59a",
   "metadata": {},
   "outputs": [],
   "source": [
    "jobs_df.year.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d248e233-bac5-48d5-8a69-a1f04350867a",
   "metadata": {},
   "outputs": [],
   "source": [
    "jobs_df['edu'] = jobs_df.edu.replace(r'中专|高中', '学历不限', regex=True)\n",
    "jobs_df['edu'] = jobs_df.edu.replace(r'硕士|博士', '研究生', regex=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eec6fbd5-2355-4674-9e5d-7f47a5a808a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "jobs_df.edu.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "352b1921-aa2b-4016-af3e-02032b2a3935",
   "metadata": {},
   "outputs": [],
   "source": [
    "jobs_df['job_name'] = jobs_df.job_name.str.lower()\n",
    "jobs_df = jobs_df[jobs_df.job_name.str.contains('python|数据|产品|运营|data', regex=True)]\n",
    "jobs_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df370013-1278-48d2-9891-8647df3c5e15",
   "metadata": {},
   "outputs": [],
   "source": [
    "jobs_df.to_csv('res/cleand_jobs.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ee07676-737c-420e-b11a-235ff7f2c4c8",
   "metadata": {},
   "source": [
    "#### 案例2：北京积分落户数据预处理\n",
    "\n",
    "1. 加载数据\n",
    "2. 日期时间处理\n",
    "3. 年龄段分箱\n",
    "4. 落户积分归一化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1232d023-7591-47b3-b67b-4920642dd28d",
   "metadata": {},
   "outputs": [],
   "source": [
    "settle_df = pd.read_csv('res/2023年北京积分落户数据.csv', index_col='公示编号')\n",
    "settle_df.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "734eb268-3ad7-4e67-9661-08328075992b",
   "metadata": {},
   "outputs": [],
   "source": [
    "settle_df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63698465-ddcd-430c-bd96-e78abaaebda3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将字符串处理成日期\n",
    "settle_df['出生年月'] = pd.to_datetime(settle_df['出生年月'])\n",
    "settle_df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "989c56c7-85fa-4180-9b86-5247a41cdbab",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将生日换算成年龄\n",
    "settle_df['年龄'] = (pd.to_datetime('2023-01-01') - settle_df.出生年月).dt.days // 365\n",
    "settle_df.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4191c7a2-19fd-4347-ac79-2371c8e59c10",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将年龄划分到年龄段 - 分箱 - 数据桶\n",
    "settle_df['年龄段'] = pd.cut(\n",
    "    settle_df.年龄,\n",
    "    bins=np.arange(35, 61, 5),\n",
    "    labels=['35~39岁', '40~44岁', '45~49岁', '50~54岁', '55~59岁'],\n",
    "    right=False\n",
    ")\n",
    "settle_df.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea2e0c9b-0aa0-41d3-a52a-6926b797465c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 统计每个元素出现的频次\n",
    "temp = settle_df.年龄段.value_counts()\n",
    "temp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30843274-b940-4527-92ed-97db86bb4ec7",
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.cm.Greens(np.linspace(0.9, 0.1, 5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "375dd407-9d0a-4788-a38e-3a37efbb6d3b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 绘制柱状图\n",
    "temp.plot(\n",
    "    kind='bar',      # 图表类型\n",
    "    figsize=(8, 4),  # 图表尺寸\n",
    "    xlabel='',       # 横轴标签\n",
    "    ylabel='Count',  # 纵轴标签\n",
    "    width=0.5,       # 柱子宽度\n",
    "    hatch='//',      # 柱子条纹\n",
    "    color=plt.cm.Greens(np.linspace(0.9, 0.3, temp.size))  # 颜色值\n",
    ")\n",
    "\n",
    "for i in range(temp.size):\n",
    "    # plt.text(横坐标, 纵坐标, 标签内容)\n",
    "    plt.text(i, temp.iloc[i] + 30, temp.iloc[i], ha='center')\n",
    "\n",
    "# 定制横轴的刻度\n",
    "plt.xticks(rotation=0)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e020ba6c-d16d-482f-ad3b-a9e855257b91",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 绘制饼图\n",
    "temp.plot(\n",
    "    kind='pie',\n",
    "    ylabel='',\n",
    "    autopct='%.1f%%',           # 自动计算并显示百分比\n",
    "    wedgeprops={'width': 0.3},  # 环状结构部分的宽度\n",
    "    pctdistance=0.85,           # 百分比到圆心的距离\n",
    "    labeldistance=1.1,          # 标签到圆心的距离\n",
    "    # shadow=True,              # 阴影效果\n",
    "    # startangle=0,             # 起始角度\n",
    "    counterclock=True,          # 是否反时针方向绘制\n",
    ")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e846eec2-6c95-409c-8b15-2b14cab3f57c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# agg - aggregate - 聚合\n",
    "settle_df.积分分值.agg(['mean', 'max', 'min', 'std', 'skew', 'kurt'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1669102-1c03-4751-813c-b241a05718e3",
   "metadata": {},
   "source": [
    "线性归一化：\n",
    "$$\n",
    "x^{\\prime} = \\frac{x - x_{min}}{x_{max} - x_{min}}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e8d9dca7-b976-43ab-96b8-abefca66cc53",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将积分分值处理成0~1范围的值\n",
    "max_score, min_score = settle_df.积分分值.agg(['max', 'min'])\n",
    "max_score, min_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10acd550-8422-4934-b38f-03554f86d305",
   "metadata": {},
   "outputs": [],
   "source": [
    "# map - 映射 - 将指定的函数作用到数据系列的每个元素上\n",
    "# apply - 应用 - 将指定的函数应用到数据系列的每个元素上\n",
    "settle_df['线性归一化积分'] = settle_df.积分分值.map(lambda x: (x - min_score) / (max_score - min_score)).round(2)\n",
    "settle_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55e57b00-cb9e-4c9e-bc59-e99b738e2f5d",
   "metadata": {},
   "source": [
    "zscore标准化：\n",
    "$$\n",
    "x^{\\prime} = \\frac{x - \\mu}{\\sigma}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5fc6260-5337-4161-99f0-d7be43d59361",
   "metadata": {},
   "outputs": [],
   "source": [
    "mu, sigma = settle_df.积分分值.agg(['mean', 'std'])\n",
    "settle_df['zscore评分'] = settle_df.积分分值.apply(lambda x: (x - mu) / sigma)\n",
    "settle_df"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}