橙色云资讯 - 工业互联网行业信息门户

如何创建中风预测模型？

磐创AI 2021-07-15

2898 字丨阅读本文需 18 分钟

介绍中风是一种可导致人死亡的一种严重的疾病，如果及时治疗，可以挽救一个人的生命并善待他们。可能有 n 种因素会导致中风，在本文，将尝试分析其中的一些。从Kaggle获取了数据集。它有 11 个变量和 5110 个观测值。导入库为了完成任何任务，需要工具，而在 python 中有很多工具。让我们从导入所需的库开始。import pandas as pd
import numpy as np
import matplotlib．pyplot as plt
import seaborn as sns
from sklearn．preprocessing import LabelEncoder
from sklearn．feature＿selection import SelectKBest， f＿classif
from sklearn．model＿selection import train＿test＿split
from sklearn．metrics import accuracy＿score， f1＿score，classification＿report，precision＿score，recall＿score
from imblearn．over＿sampling import SMOTE
from sklearn．linear＿model import LogisticRegression
from sklearn．ensemble import RandomForestClassifier
from sklearn．svm import SVC
from xgboost import XGBClassifier
读取CSV读取包含我们数据的 CSV 文件。在此 CSV 的帮助下，我们将尝试了解该模式并创建我们的预测模型。data＝pd．read＿csv（＇healthcare－dataset－stroke－data．csv＇）
data．head（10）
＃＃ Displaying top 10 rows
data．info（）
＃＃ Showing information about datase
data．describe（）
＃＃ Showing data＇s statistical features
头：

信息：RangeIndex： 5110 entries， 0 to 5109
Data columns （total 12 columns）：
＃ Column Non－Null Count Dtype
－－－－－－－－－－－－－－－－－－－－－－－－－－－－
0 id 5110 non－null int64
1 gender 5110 non－null object
2 age 5110 non－null float64
3 hypertension 5110 non－null int64
4 heart＿disease 5110 non－null int64
5 ever＿married 5110 non－null object
6 work＿type 5110 non－null object
7 Residence＿type 5110 non－null object
8 avg＿glucose＿level 5110 non－null float64
9 bmi 4909 non－null float64
10 smoking＿status 5110 non－null object
11 stroke 5110 non－null int64
dtypes： float64（3）， int64（4）， object（5）
描述：

EDAIDID 只不过是分配给每个患者的唯一编号，用于跟踪他们，使他们独一无二。不需要ID，它是完全无用的，所以让我们删除它。data．drop（＂id＂，inplace＝True，axis＝1）
性别此属性说明患者的性别。让我们看看性别如何影响中风率和性别对比。print（＇Unique values＇，data［＇gender＇］．unique（））
print（＇Value Counts＇，data［＇gender＇］．value＿counts（））
＃ Above codes will help to give us information about it＇s unique values and count of each value．
sns．countplot（data＝data，x＝＇gender＇）
＃ Helps to plot a count plot which will help us to see count of values in each unique category．
sns．countplot（data＝data，x＝＇gender＇，hue＝＇stroke＇）
＃ This plot will help to analyze how gender will affect chances of stroke．
Unique values
［＇Male＇＇Female＇＇Other＇］
Value Counts
Female 2994
Male 2115
Other 1
性别图：

中风性别：

观察：似乎数据集不平衡。无论如何，不同性别之间的中风率没有太大的区别年龄在这里，年龄不是一个数字，它是重要的因素之一，或者我们可以说它是一个非常关键的因素。让我们分析一下我们的数据，看看它的实际影响有多大。data［＇age＇］．nunique（）
＃ Returns number of unique values in this attribute
sns．displot（data［＇age＇］）
＃ This will plot a distribution plot of variable age
plt．figure（figsize＝（15，7））
sns．boxplot（data＝data，x＝＇stroke＇，y＝＇age＇）
＃ Above code will plot a boxplot of variable age with respect of target attribute stroke
唯一值的数量：104
分布图：

年龄和中风：

观察：60岁以上的人容易患中风。一些异常值可以被视为 20 岁以下的人中风，这可能是有效数据，因为中风还取决于我们的饮食和生活习惯。另一项观察结果是没有中风的人也包括年龄＞ 60 岁的人。高血压高血压可能导致中风。让我们看看它是怎么回事。data［＇age＇］．nunique（）
＃ Returns number of unique values in this attribute
sns．displot（data［＇age＇］）
＃ This will plot a distribution plot of variable age
plt．figure（figsize＝（15，7））
sns．boxplot（data＝data，x＝＇stroke＇，y＝＇age＇）
＃ Above code will plot a boxplot of variable age with respect of target attribute stroke
唯一值和值计数：Value Count ［0 1］
Value Counts
0 4612
1 498
计数图：

高血压和中风：

观察：高血压在年轻人中很少见，在老年人中很常见。高血压会导致中风。根据我们的数据，高血压的情况不是很清楚。关于高血压患者的数据相当少。心脏病如果不采取适当的护理，患有心脏病的人患中风的风险往往更高。print（＇Unique Value＇，data［＇heart＿disease＇］．unique（））
print（＇Value Counts＇，data［＇heart＿disease＇］．value＿counts（））
＃ Above code will return unique value for heart disease attribute and its value counts
sns．countplot（data＝data，x＝＇heart＿disease＇）
＃ Will plot a counter plot of variable heart diseases
唯一值和计数：Unique Value
［1 0］
Value Counts
0 4834
1 276
计数图：

心脏病伴中风：

观察：由于数据集不平衡，所以有点难以理解。但根据这个图，我们可以说心脏病不会影响中风。结过婚该属性将告诉我们患者是否结过婚。让我们看看它会如何影响中风。print（＇Unique Values＇，data［＇ever＿married＇］．unique（））
print（＇Value Counts＇，data［＇ever＿married＇］．value＿counts（））
＃ Above code will show us number unique values of attribute and its count
sns．countplot（data＝data，x＝＇ever＿married＇）
＃ Counter plot of ever married attribute
sns．countplot（data＝data，x＝＇ever＿married＇，hue＝＇stroke＇）
＃ Ever married with respect of stroke
唯一值和计数：Unique Values
［＇Yes＇＇No＇］
Value Counts
Yes 3353
No 1757
计数图：

曾与中风结婚：

观察：已婚人士的中风率更高。工作类型此属性包含有关患者从事何种工作的数据。不同类型的工作有不同类型的问题和挑战，这可能是兴奋、刺激、压力等的可能原因。压力从来都不利于健康，让我们看看这个变量如何影响中风的几率。print（＇Unique Value＇，data［＇work＿type＇］．unique（））
print（＇Value Counts＇，data［＇work＿type＇］．value＿counts（））
＃ Above code will return unique values of attributes and its count
sns．countplot（data＝data，x＝＇work＿type＇）
＃ Above code will create a count plot
sns．countplot（data＝data，x＝＇work＿type＇，hue＝＇stroke＇）
＃ Above code will create a count plot with respect to stroke
唯一值和计数：Unique Value
［＇Private＇＇Self－employed＇＇Govt＿job＇＇children＇＇Never＿worked＇］
Value Counts
Private 2925
Self－employed 819
children 687
Govt＿job 657
Never＿worked 22
计数图：

工作类型和行程：

观察：在私营部门工作的人患中风的风险更高。从未工作过的人中风率非常低。居住类型该属性告诉我们患者的住所是城市或农村。print（＇Unique Values＇，data［＇Residence＿type＇］．unique（））
print（＂Value Counts＂，data［＇Residence＿type＇］．value＿counts（））
＃ Above code will return unique values of variable and its count
sns．countplot（data＝data，x＝＇Residence＿type＇）
＃ This will create a counter plot
sns．countplot（data＝data，x＝＇Residence＿type＇，hue＝＇stroke＇）
＃ Residence Type with respect to stroke
唯一值和计数：Unique Values
［＇Urban＇＇Rural＇］
Value Counts
Urban 2596
Rural 2514
计数图：

居住类型和行程：

观察：这个属性没有用。正如我们所看到的，这两个属性值没有太大区别。也许我们必须丢弃它。平均血糖水平讲述患者体内的平均血糖水平。让我们看看这是否会影响中风的机会data［＇avg＿glucose＿level＇］．nunique（）
＃ Number of unique values
sns．displot（data［＇avg＿glucose＿level＇］）
＃ Distribution of avg＿glucose＿level
sns．boxplot（data＝data，x＝＇stroke＇，y＝＇avg＿glucose＿level＇）
＃ Avg＿glucose＿level and Stroke
唯一值和计数：3979
分布图：

葡萄糖和中风：

观察：从上图中，我们可以看到中风患者的平均血糖水平超过 100。在没有中风的病人中有一些明显的异常值但也有一些可能是真实的记录。体重指数身体质量指数是一种基于身高和体重的身体脂肪量度，适用于成年男性和女性。让我们看看它如何影响中风的机会。data［＇bmi＇］．isna（）．sum（）
＃ Returns number null values
data［＇bmi＇］．fillna（data［＇bmi＇］．mean（），inplace＝True）
＃ Filling null values with average value
data［＇bmi＇］．nunique（）
＃ Returns number of unique values in that attribute
sns．displot（data［＇bmi＇］）
＃ Distribution of bmi
sns．boxplot（data＝data，x＝＇stroke＇，y＝＇bmi＇）
＃ BMI with respect to Stroke
空值：201
唯一值和计数：419
分布图：

BMI 和中风：

观察：因此，没有关于 BMI 如何影响中风几率的突出观察。吸烟状况这些属性告诉我们患者是否吸烟。吸烟有害健康，并可能导致心脏病。让我们看看在我们的数据的情况下结果如何。print（＇Unique Values＇，data［＇smoking＿status＇］．unique（））
print（＇Value Counts＇，data［＇smoking＿status＇］．value＿counts（））
＃ Returns unique values and its count
sns．countplot（data＝data，x＝＇smoking＿status＇）
＃ Count plot of smoking status
sns．countplot（data＝data，x＝＇smoking＿status＇，hue＝＇stroke＇）
＃ Smoking Status with respect to Stroke
唯一值和计数：Unique Values
［＇formerly smoked＇＇never smoked＇＇smokes＇＇Unknown＇］
Value Counts
never smoked 1892
Unknown 1544
formerly smoked 885
smokes 789
计数图：

吸烟和中风：

观察：根据这些图，我们可以看到，无论吸烟状况如何，中风的几率都没有太大差异。中风我们的目标变量。它告诉我们患者是否有中风的机会。print（＇Unique Value＇，data［＇stroke＇］．unique（））
print（＇Value Counts＇，data［＇stroke＇］．value＿counts（））
＃ Returns Unique Value and its count
sns．countplot（data＝data，x＝＇stroke＇）
＃ Count Plot of Stroke
唯一值和计数：Unique Value
［1 0］
Value Counts
0 4861
1 249
计数图：

特征工程标签编码我们的数据集是分类数据和数值数据的混合体，由于 ML 算法理解数值性质的数据，让我们使用标签编码器将分类数据编码为数值数据。标签编码器是一种将分类数据转换为数字数据的技术。它按升序取值并将其转换为从 0 到 n－1 的数字数据。cols＝data．select＿dtypes（include＝［＇object＇］）．columns
print（cols）
＃ This code will fetech columns whose data type is object．
le＝LabelEncoder（）
＃ Initializing our Label Encoder object
data［cols］＝data［cols］．apply（le．fit＿transform）
＃ Transfering categorical data into numeric
print（data．head（10））
列：Index（［＇gender＇，＇ever＿married＇，＇work＿type＇，＇Residence＿type＇，
＇smoking＿status＇］

相关性plt．figure（figsize＝（15，10））
sns．heatmap（data．corr（），annot＝True，fmt＝＇．2＇）观察：显示出一些有效相关性的变量是：年龄、高血压、心脏病、已婚、平均血糖水平。为了安全起见，让我们使用 SelectKBest 和 F＿Classif 检查我们的特征。classifier ＝ SelectKBest（score＿func＝f＿classif，k＝5）
fits ＝ classifier．fit（data．drop（＇stroke＇，axis＝1），data［＇stroke＇］）
x＝pd．DataFrame（fits．scores＿）
columns ＝ pd．DataFrame（data．drop（＇stroke＇，axis＝1）．columns）
fscores ＝ pd．concat（［columns，x］，axis＝1）
fscores．columns ＝［＇Attribute＇，＇Score＇］
fscores．sort＿values（by＝＇Score＇，ascending＝False）

在上面的结果中，我们可以看到年龄是一个高度相关的变量，然后它会下降。我将阈值分数保持为 50。导致我们在热图中获得相同的特征。cols＝fscores［fscores［＇Score＇］＞50］［＇Attribute＇］
print（cols）
1 age
2 hypertension
3 heart＿disease
4 ever＿married
7 avg＿glucose＿level
拆分数据现在，让我们将特征分成训练和测试集，以训练和测试我们的分类模型。train＿x，test＿x，train＿y，test＿y＝train＿test＿split（data［cols］，data［＇stroke＇］，random＿state＝1255，test＿size＝0．25）
＃Splitting data
train＿x．shape，test＿x．shape，train＿y．shape，test＿y．shape
＃ Shape of data
Result：
（（3832， 5），（1278， 5），（3832，），（1278，））
平衡数据集众所周知，我们的数据集是不平衡的。所以让我们平衡我们的数据。我们将为此使用 SMOTE 方法。它将用类似于次要类的记录填充我们的数据。通常，我们在整个数据集上执行此操作，但由于次要类别的记录非常少，因此我将其应用于训练和测试数据。早些时候，我尝试通过仅对训练数据集的数据进行重新采样来实现，但效果不佳，因此我尝试了这种方法并获得了不错的结果。smote＝SMOTE（）
train＿x，train＿y＝smote．fit＿resample（train＿x，train＿y）
test＿x，test＿y＝smote．fit＿resample（test＿x，test＿y）
数据的形状：print（train＿x．shape，train＿y．shape，test＿x．shape，test＿y．shape）
（（7296， 5），（7296，），（2426， 5），（2426，））
模型创建让我们从创建模型开始。创建了几个模型，即＊逻辑回归、随机森林分类器、SVC 和 XGBClassifier。＊其中 XGBClassifier 模型的表现非常出色。所以在这个博客中，我只是要添加 XGBClassifier 但你可以在这里检查其他模型的性能。XGB分类器xgc＝XGBClassifier（objective＝＇binary：logistic＇，n＿estimators＝100000，max＿depth＝5，learning＿rate＝0．001，n＿jobs＝－1）
xgc．fit（train＿x，train＿y）
predict＝xgc．predict（test＿x）
print（＇Accuracy －－＞＇，accuracy＿score（predict，test＿y））
print（＇F1 Score －－＞＇，f1＿score（predict，test＿y））
print（＇Classification Report －－＞＇，classification＿report（predict，test＿y））
在平衡数据集中，我们依赖准确性，但这里我们有一个不平衡数据集，我将使用 f1 分数。对于一个好的分类器来说，拥有好的准确率和召回率分数会很棒。在所有模型中，XGBClassifier 取得了不错的成绩。所以作为模型，我选择 XGBClassifier。结束语所以在这个小项目中，我们看到了一些可能导致中风的因素。其中年龄高度相关，其次是高血压、心脏病、平均血糖水平和是否结婚。XGBClassifier表现出色。某些变量存在异常值，我保留它的原因是因为这些东西要么取决于其他因素，要么可能有这种记录。例如，由于一个人年轻或没有任何心脏病，BMI 可能很高，但仍然没有中风。

免责声明：凡注明来源本网的所有作品，均为本网合法拥有版权或有权使用的作品，欢迎转载，注明出处本网。非本网作品均来自其他媒体，转载目的在于传递更多信息，并不代表本网赞同其观点和对其真实性负责。如您发现有任何侵权内容，请依照下方联系方式进行沟通，我们将第一时间进行处理。

0赞好资讯，需要你的鼓励

来自：磐创AI

0 0

参与评论

登录后参与讨论 0/1000

下一篇微软免费杀软Defender痛失0.5分跌出杀软排行榜满分行列

全球杀毒软件这么多，哪些软件的表现是最好...

2022-06-10

如何创建中风预测模型？

参与评论

协同+研发

400-800-1557

我是需求方

我是服务商

交易保障

帮助中心

工程社区

如何创建中风预测模型？

参与评论

为你推荐

建立卷积神经网络模型

机器学习初学者指南：机器学习黑客马拉松竞赛经验分享

学习使用计算机视觉进行人脸检测

日本干细胞-干细胞带你远离“三高”的烦恼

端到端深度学习项目：第1部分

使用自动编码器进行图像去噪 - 深度学习项目的初学者指南

使用 CNN 进行图像分类 - 理解计算机视觉

使用卷积神经网络进行图像分类

血液净化 启动修复健康的第一步

【姚式麻醉学】妊娠期高血压疾病

【姚氏麻醉学】肥胖管理（一）

为什么张一鸣说“同理心”，是产品经理最重要的素质？

COVID-19：使用深度学习的医学诊断

gcc编译时，链接器安排的【虚拟地址】是如何计算出来的？

【姚氏麻醉学】读书笔记 day10 颈动脉内膜剥脱术（二）

比GDB更方便的代码调试工具：CGDB

【姚氏麻醉学】妊娠期高血压疾病（一）

在 R 中使用 Keras 构建深度学习图像分类器

高血压患者中约91%伴有高Hcy！中国式高血压防治方案，能救命！

国产唯一，全球唯二：魅丽纬叶肾动脉射频消融系统获得FDA“突破性设备”认证，引领微创介入治疗高血压的颠覆性技术趋势

使用深度学习进行脑肿瘤检测和定位：第 1 部分

日本血液净化-血脂8.2却不吃药，这些人是怎么做到的？

使用python+Keras检测年龄和性别

2022年最新深度学习入门指南

人工神经网络训练图像分类器

肾移植的麻醉管理

【健康科普】同型半胱氨酸，是这些疾病的“幕后杀手”

250小时强化道路振动对燃料电池气密性、极化性能、OCV、均一性、阻抗状态的影响和失效位置规律

车到病出的创维，那些亚健康和老年人为何不买？

【洞察】我国高血压患者数量庞大 抗高血压药物市场发展空间广阔

相关推荐

协同+研发

400-800-1557

我是需求方

我是服务商

交易保障

帮助中心

工程社区

血液净化启动修复健康的第一步

【洞察】我国高血压患者数量庞大抗高血压药物市场发展空间广阔