{ "cells": [ { "cell_type": "markdown", "id": "c664c108-059f-402a-b216-5ba4caa2d98b", "metadata": {}, "source": [ "## Python数据分析第1天\n", "\n", "### 热身练习\n", "\n", "如下列表保存着本公司从2022年1月到12月五个销售区域(南京、无锡、苏州、徐州、南通)的销售额(以百万元为单位),请利用这些数据完成以下操作:\n", "\n", "```python\n", "sales_month = [f'{i:>2d}月' for i in range(1, 13)]\n", "sales_area = ['南京', '无锡', '苏州', '徐州', '南通']\n", "sales_data = [\n", " [32, 17, 12, 20, 28],\n", " [41, 30, 17, 15, 35],\n", " [35, 18, 13, 11, 24],\n", " [12, 42, 44, 21, 34],\n", " [29, 11, 42, 32, 50],\n", " [10, 15, 11, 12, 26],\n", " [16, 28, 48, 22, 28],\n", " [31, 40, 45, 30, 39],\n", " [25, 41, 47, 42, 47],\n", " [47, 21, 13, 49, 48],\n", " [41, 36, 17, 36, 22],\n", " [22, 25, 15, 20, 37]\n", "]\n", "```\n", "\n", "1. 统计本公司每个月的销售额。\n", "2. 统计本公司销售额的月环比。\n", "3. 统计每个销售区域全年的销售额。\n", "4. 按销售额从高到低排序销售区域及其销售额。\n", "5. 统计全年最高的销售额出现在哪个月哪个区域。\n", "6. 找出哪个销售区域的业绩最不稳定。" ] }, { "cell_type": "code", "execution_count": null, "id": "f9d87cfc-deb0-46eb-b98c-2799a4908bc8", "metadata": {}, "outputs": [], "source": [ "sales_month = [f'{i:>2d}月' for i in range(1, 13)]\n", "sales_area = ['南京', '无锡', '苏州', '徐州', '南通']\n", "sales_data = [\n", " [32, 17, 12, 20, 28],\n", " [41, 30, 17, 15, 35],\n", " [35, 18, 13, 11, 24],\n", " [12, 42, 44, 21, 34],\n", " [29, 11, 42, 32, 50],\n", " [10, 15, 11, 12, 26],\n", " [16, 28, 48, 22, 28],\n", " [31, 40, 45, 30, 39],\n", " [25, 41, 47, 42, 47],\n", " [47, 21, 13, 49, 48],\n", " [41, 36, 17, 36, 22],\n", " [22, 25, 15, 20, 37]\n", "]" ] }, { "cell_type": "code", "execution_count": null, "id": "dc581dfc-9108-46fa-ace2-60ace650434e", "metadata": {}, "outputs": [], "source": [ "# 魔法指令 - %whos - 查看变量\n", "%whos" ] }, { "cell_type": "code", "execution_count": null, "id": "a50e4c3e-6dc1-426f-977b-aef9a5c9a02f", "metadata": {}, "outputs": [], "source": [ "print = 100" ] }, { "cell_type": "code", "execution_count": null, "id": "4c0b54ca-1556-4a14-9a6a-b6bd6af5d822", "metadata": {}, "outputs": [], "source": [ "# 魔法指令 - %xdel - 删除变量\n", "%xdel print" ] }, { "cell_type": "code", "execution_count": null, "id": "fe8eb05f-f45b-491a-b98e-6f6c924997ff", "metadata": {}, "outputs": [], "source": [ "# 1. 统计本公司每个月的销售额。\n", "monthly_sales = []\n", "for i, month in enumerate(sales_month):\n", " monthly_sales.append(sum(sales_data[i]))\n", " print(f'{month}销售额: {monthly_sales[i]}百万')" ] }, { "cell_type": "code", "execution_count": null, "id": "53e6bf88-e6a9-4ac9-a7fe-bd1d18ff88f5", "metadata": {}, "outputs": [], "source": [ "# 2. 统计本公司销售额的月环比。\n", "for i in range(1, len(monthly_sales)):\n", " temp = (monthly_sales[i] - monthly_sales[i - 1]) / monthly_sales[i - 1]\n", " print(f'{sales_month[i]}: {temp:.2%}')" ] }, { "cell_type": "code", "execution_count": null, "id": "f5a130d6-b781-4ee3-a96b-d1fe5e3b4b90", "metadata": {}, "outputs": [], "source": [ "# 3. 统计每个销售区域全年的销售额。\n", "arealy_sales = {}\n", "for j, area in enumerate(sales_area):\n", " temp = [sales_data[i][j] for i in range(len(sales_month))]\n", " arealy_sales[area] = sum(temp)\n", " print(f'{area}: {arealy_sales[area]}')" ] }, { "cell_type": "code", "execution_count": null, "id": "a7bd0510-5e68-4e58-ac3b-6c531f7abccb", "metadata": {}, "outputs": [], "source": [ "# 4. 按销售额从高到低排序销售区域及其销售额。\n", "sorted_keys = sorted(arealy_sales, key=lambda x: arealy_sales[x], reverse=True)\n", "for key in sorted_keys:\n", " print(f'{key}: {arealy_sales[key]}')" ] }, { "cell_type": "code", "execution_count": null, "id": "b4b2f3e8-c5c2-481e-b277-9623d30892ac", "metadata": {}, "outputs": [], "source": [ "# 5. 统计全年最高的销售额出现在哪个月哪个区域。\n", "max_value = sales_data[0][0]\n", "max_i, max_j = 0, 0\n", "for i in range(len(sales_month)):\n", " for j in range(len(sales_area)):\n", " temp = sales_data[i][j]\n", " if temp > max_value:\n", " max_value = temp\n", " max_i, max_j = i, j\n", "print(sales_month[max_i], sales_area[max_j])" ] }, { "cell_type": "markdown", "id": "647d0a87-b672-4e0c-81cc-a3bbb76dca11", "metadata": {}, "source": [ "总体方差:\n", "$$\n", "\\sigma^{2} = \\frac{1}{N} \\sum_{i=1}^{N}(x_{i} - \\mu)^{2}\n", "$$\n", "\n", "样本方差:\n", "$$\n", "s^{2} = \\frac{1}{n - 1} \\sum_{i=1}^{n}(x_{i} - \\bar{x})^{2}\n", "$$" ] }, { "cell_type": "code", "execution_count": null, "id": "b43fb247-32fc-4e10-a9ee-488fd1f56a9a", "metadata": {}, "outputs": [], "source": [ "# 6. 找出哪个销售区域的业绩最不稳定。\n", "import statistics as stats\n", "\n", "arealy_vars = []\n", "for j, area in enumerate(sales_area):\n", " temp = [sales_data[i][j] for i in range(len(sales_month))]\n", " arealy_vars.append(stats.pvariance(temp))\n", "sales_area[arealy_vars.index(max(arealy_vars))]" ] }, { "cell_type": "markdown", "id": "3ea677d0-7a33-43e5-b10b-ddfcb82f7f6a", "metadata": {}, "source": [ "### 三大神器\n", "\n", "1. numpy - Numerical Python - 核心是`ndarray`类型,可以用来表示N维数组,提供了一系列处理数据的运算、函数和方法。\n", "2. pandas - Panel Data Set - 封装了和数据分析(加载、重塑、清洗、预处理、透视、呈现)相关的类型、函数和诸多的方法,为数据分析提供了一站式解决方案。它的核心有三个数据类型,分别是:`Series`、`DataFrame`、`Index`。\n", "3. matplotlib - 封装了各种常用的统计图表,帮助我们实现数据呈现。\n", "4. scipy - Scientific Python - 针对NumPy进行了很好的补充,提供了高级的数据运算的函数和方法。\n", "5. scikit-learn - 封装了常用的机器学习(分类、聚类、回归等)算法,除此之外,还提供了数据预处理、特征工程、模型验证相关的函数和方法。\n", "6. sympy - Symbolic Python - 封装了符号运算相关操作。" ] }, { "cell_type": "code", "execution_count": null, "id": "0db758cc-d83c-47c4-9a0b-c7ef5abd6c18", "metadata": {}, "outputs": [], "source": [ "# 魔法指令 - %pip - 调用包管理工具pip\n", "# %pip install numpy pandas matplotlib openpyxl" ] }, { "cell_type": "code", "execution_count": null, "id": "8eb6970b-3907-4b84-af60-67cbf67f2e74", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "plt.rcParams['font.sans-serif'].insert(0, 'SimHei')\n", "plt.rcParams['axes.unicode_minus'] = False" ] }, { "cell_type": "code", "execution_count": null, "id": "5fb76dec-cd51-4e79-9bd2-3b210ae20522", "metadata": {}, "outputs": [], "source": [ "np.__version__" ] }, { "cell_type": "code", "execution_count": null, "id": "e6369df9-7577-496c-bfc1-2fce096c0162", "metadata": {}, "outputs": [], "source": [ "pd.__version__" ] }, { "cell_type": "code", "execution_count": null, "id": "eb5733cd-38f7-4afd-b45b-70c1439ab36b", "metadata": {}, "outputs": [], "source": [ "# 将嵌套列表处理成二维数组\n", "data = np.array(sales_data)\n", "data" ] }, { "cell_type": "code", "execution_count": null, "id": "da304104-8cf0-4425-b3b4-dcb148ac4b3a", "metadata": {}, "outputs": [], "source": [ "# 沿着1轴求和(每个月的销售额)\n", "data.sum(axis=1)" ] }, { "cell_type": "code", "execution_count": null, "id": "1507ac63-f53b-4e36-a7fb-b9c636fd81ea", "metadata": {}, "outputs": [], "source": [ "# 沿着0轴求和(每个区域的销售)\n", "data.sum(axis=0)" ] }, { "cell_type": "code", "execution_count": null, "id": "26be450d-44ba-4d83-9351-c52a13c2c338", "metadata": {}, "outputs": [], "source": [ "# 总体方差\n", "data.var(axis=0).round(1)" ] }, { "cell_type": "code", "execution_count": null, "id": "81e5b2a0-c86e-4720-909f-ce8b1b6fdd58", "metadata": {}, "outputs": [], "source": [ "# 样本方差\n", "data.var(axis=0, ddof=1).round(1)" ] }, { "cell_type": "code", "execution_count": null, "id": "ba4e0f0a-e711-4041-8834-1e3be86ce8a4", "metadata": {}, "outputs": [], "source": [ "# 构造DataFrame对象(处理二维数据)\n", "df = pd.DataFrame(data, columns=sales_area, index=sales_month)\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "9d1a6a43-6dfc-41e3-98c8-be2681e0d547", "metadata": {}, "outputs": [], "source": [ "# 求和(默认沿着0轴)\n", "df.sum()" ] }, { "cell_type": "code", "execution_count": null, "id": "a478ec0e-499f-4e31-b8c2-ba45e691b834", "metadata": {}, "outputs": [], "source": [ "# 排序\n", "df.sum().sort_values(ascending=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "6f221833-855c-45ad-91b2-e3f4da627704", "metadata": {}, "outputs": [], "source": [ "# 求和(指定沿着1轴)\n", "df.sum(axis=1)" ] }, { "cell_type": "code", "execution_count": null, "id": "80df8865-4ea0-4c72-a581-215cd953cfbe", "metadata": {}, "outputs": [], "source": [ "# 计算月环比\n", "df.sum(axis=1).pct_change()" ] }, { "cell_type": "code", "execution_count": null, "id": "ea4579c3-11cd-4179-9c96-8dbe9a033da2", "metadata": {}, "outputs": [], "source": [ "df['合计'] = df.sum(axis=1)\n", "df['月环比'] = df['合计'].pct_change()\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "3c660052-dded-4a0a-8b72-7747d3cae816", "metadata": {}, "outputs": [], "source": [ "# 渲染DataFrame\n", "df.style.format(\n", " formatter={'月环比': '{:.2%}'},\n", " na_rep='------'\n", ").bar(\n", " subset='合计'\n", ").background_gradient(\n", " 'RdYlBu', subset='月环比'\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "a092c12c-dab6-4272-b1cd-5218998fcd90", "metadata": {}, "outputs": [], "source": [ "# 将DataFrame输出到Excel文件\n", "df.to_excel('sales.xlsx', sheet_name='data')" ] }, { "cell_type": "code", "execution_count": null, "id": "54c3f505-e866-4c4e-a3f8-f55a71a95c3f", "metadata": {}, "outputs": [], "source": [ "# 魔法指令 - %config - 修改配置\n", "# %config InlineBackend.figure_format = 'svg'\n", "get_ipython().run_line_magic('config', 'InlineBackend.figure_format = \"svg\"')" ] }, { "cell_type": "code", "execution_count": null, "id": "3951055d-d5d2-4e4e-bbe7-a1b40a6731e0", "metadata": {}, "outputs": [], "source": [ "# 绘制柱状图\n", "plt.figure(figsize=(8, 4), dpi=200)\n", "df.plot(ax=plt.gca(), kind='bar', y='合计', legend=False)\n", "plt.xticks(rotation=0)\n", "plt.savefig('aa.png')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "8a5236f7-072b-466c-9be3-afbab394f5cb", "metadata": {}, "source": [ "### 魔法指令" ] }, { "cell_type": "code", "execution_count": null, "id": "d5c6a18b-2863-4855-8ef7-2c0aa99b7d5c", "metadata": {}, "outputs": [], "source": [ "# 查看当前工作路径 - print working directory\n", "%pwd" ] }, { "cell_type": "code", "execution_count": null, "id": "80a9f9e0-1528-40cf-910c-f3c8e5e7e3b9", "metadata": {}, "outputs": [], "source": [ "# 查看指定路径文件列表 - list directory contents\n", "%ls" ] }, { "cell_type": "code", "execution_count": null, "id": "620a54ed-9c29-4058-9d20-c4df72ba4c62", "metadata": {}, "outputs": [], "source": [ "# 执行系统命令\n", "%system date" ] }, { "cell_type": "code", "execution_count": null, "id": "659215ed-113a-4d8f-9036-0fcf47c96021", "metadata": {}, "outputs": [], "source": [ "# 保存运行过的代码\n", "%save temp.py" ] }, { "cell_type": "code", "execution_count": null, "id": "8fc9c4e4-1423-40f3-b4ee-db2ba2e5d125", "metadata": {}, "outputs": [], "source": [ "# 加载指定文件内容\n", "%load temp.py" ] }, { "cell_type": "code", "execution_count": null, "id": "58a08283-561c-43d4-8db6-74cde401b8a9", "metadata": {}, "outputs": [], "source": [ "# 统计代码执行时间\n", "%timeit (1, 2, 3, 4, 5)" ] }, { "cell_type": "code", "execution_count": null, "id": "22a271ab-3f5c-4167-b89e-66a31e891cbd", "metadata": {}, "outputs": [], "source": [ "# 查看历史输入\n", "%hist" ] }, { "cell_type": "code", "execution_count": null, "id": "d4ffa792-f1a0-4be9-b2aa-642ee0b9a1ae", "metadata": {}, "outputs": [], "source": [ "# 查看魔法指令\n", "%lsmagic" ] }, { "cell_type": "markdown", "id": "a15db907-c068-41d7-a24c-8f1c5c20d4ec", "metadata": {}, "source": [ "### 获取帮助" ] }, { "cell_type": "code", "execution_count": null, "id": "5e037694-9357-46b9-864a-c5f93e1aa8c8", "metadata": {}, "outputs": [], "source": [ "np.random?" ] }, { "cell_type": "code", "execution_count": null, "id": "11a97abd-d73d-493e-b727-9c4ded3e5060", "metadata": {}, "outputs": [], "source": [ "np.random.normal?" ] }, { "cell_type": "code", "execution_count": null, "id": "66503921-cd69-4394-80ea-7fecf6ecdc33", "metadata": {}, "outputs": [], "source": [ "np.random.r*?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 5 }