{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Aula 10 - Qui Quadrado e Regressão Logística" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "À partir da base de dados juri.xlsx faça as seguintes análises:\n", "\n", "- Considere: veredito é uma variável categórica onde 0 = não pena de morte e 1 = pena de morte\n", "- As variáveis independentes são escalas lineares de valores de 0 a 10." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import pingouin\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "juri = pd.read_excel('../data/juri.xlsx')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subjectverdictdangerrehabpunishgendetspecdetincap
010222207
120090682
23163210104
341132321
4500741110
\n", "
" ], "text/plain": [ " subject verdict danger rehab punish gendet specdet incap\n", "0 1 0 2 2 2 2 0 7\n", "1 2 0 0 9 0 6 8 2\n", "2 3 1 6 3 2 10 10 4\n", "3 4 1 1 3 2 3 2 1\n", "4 5 0 0 7 4 1 1 10" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "juri.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PARTE 1 - Modelo de Regressão Logística Nominal\n", "\n", "Previsão do veredito à partir das 6 variáveis \n", "\n", "Responda:\n", "\n", "1. Qual a significância do modelo (Prob>ChiSq)?\n", "2. Considerando os coeficientes de regressão, quais variáveis são significativas como forma de prever o veredito? Quais os respectivos coeficientes e significâncias?\n", "3. Execute os testes de Wald para as variáveis (Triângulo vermelho ao lado de 'Nominal Logistic Fit for verdict'). Qual o Qui-quadrado (Wald ChiSquare) de 'danger'?\n", "4. De acordo com este modelo, qual o incremento de probabilidade do veredito de pena de morte para cada unidade de periculosidade (rehab)?" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namescoefsezpvalCI[2.5%]CI[97.5%]
0Intercept-1.750.92-1.910.06-3.550.05
1danger0.290.093.160.000.110.48
2rehab-0.190.08-2.310.02-0.35-0.03
3punish0.070.070.990.32-0.070.21
4gendet0.190.082.400.020.030.34
5specdet0.010.080.080.94-0.150.16
6incap0.000.080.050.96-0.150.15
\n", "
" ], "text/plain": [ " names coef se z pval CI[2.5%] CI[97.5%]\n", "0 Intercept -1.75 0.92 -1.91 0.06 -3.55 0.05\n", "1 danger 0.29 0.09 3.16 0.00 0.11 0.48\n", "2 rehab -0.19 0.08 -2.31 0.02 -0.35 -0.03\n", "3 punish 0.07 0.07 0.99 0.32 -0.07 0.21\n", "4 gendet 0.19 0.08 2.40 0.02 0.03 0.34\n", "5 specdet 0.01 0.08 0.08 0.94 -0.15 0.16\n", "6 incap 0.00 0.08 0.05 0.96 -0.15 0.15" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Regressão Logística Múltipla Binária\n", "X = juri[['danger', 'rehab', 'punish', 'gendet', 'specdet', 'incap']] # Variáveis preditoras\n", "y = juri['verdict'] # Variável dependente binária [0, 1]\n", "lom = pingouin.logistic_regression(X, y, remove_na=True)\n", "lom.round(2)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.pointplot(\n", " data=juri[['danger', 'rehab', 'punish', 'gendet', 'specdet', 'incap']],\n", " errorbar=(\"pi\", 100), capsize=.4, join=False,)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " test lambda chi2 dof pval cramer power\n", "0 pearson 1.000000 18.923347 10.0 0.041247 0.435010 0.868967\n", "1 cressie-read 0.666667 19.633344 10.0 0.032918 0.443095 0.883531\n", "2 log-likelihood 0.000000 23.423537 10.0 0.009287 0.483979 0.940215\n", "3 freeman-tukey -0.500000 NaN 10.0 NaN NaN NaN\n", "4 mod-log-likelihood -1.000000 inf 10.0 0.000000 inf 1.000000\n", "5 neyman -2.000000 NaN 10.0 NaN NaN NaN\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/marianne/code/Analise_Estatistica/venv/lib/python3.10/site-packages/pingouin/contingency.py:150: UserWarning: Low count on observed frequencies.\n", " warnings.warn(\"Low count on {} frequencies.\".format(name))\n", "/home/marianne/code/Analise_Estatistica/venv/lib/python3.10/site-packages/pingouin/contingency.py:150: UserWarning: Low count on expected frequencies.\n", " warnings.warn(\"Low count on {} frequencies.\".format(name))\n", "/home/marianne/code/Analise_Estatistica/venv/lib/python3.10/site-packages/scipy/stats/_stats_py.py:7169: RuntimeWarning: divide by zero encountered in power\n", " terms = f_obs * ((f_obs / f_exp)**lambda_ - 1)\n", "/home/marianne/code/Analise_Estatistica/venv/lib/python3.10/site-packages/scipy/stats/_stats_py.py:7169: RuntimeWarning: invalid value encountered in multiply\n", " terms = f_obs * ((f_obs / f_exp)**lambda_ - 1)\n", "/home/marianne/code/Analise_Estatistica/venv/lib/python3.10/site-packages/scipy/stats/_stats_py.py:7166: RuntimeWarning: divide by zero encountered in divide\n", " terms = 2.0 * special.xlogy(f_exp, f_exp / f_obs)\n" ] } ], "source": [ "# Teste de independência Chi-quadrado: A variável x independe de y?\n", "expected, observed, stats = pingouin.chi2_independence(data=juri, x='verdict',\n", "y='danger', correction=False)\n", "print(stats)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Visualização exploratoria de proporções - gráfico de barras 100% empilhadas\n", "props = juri.groupby('verdict')['danger'].value_counts(normalize=True)\n", "wide_props = props.unstack()\n", "wide_props.plot(kind=\"bar\", stacked=True)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Visualização exploratoria de proporções - gráfico de barras 100% empilhadas\n", "props = juri.groupby('danger')['verdict'].value_counts(normalize=True)\n", "wide_props = props.unstack()\n", "wide_props.plot(kind=\"bar\", stacked=True)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " test lambda chi2 dof pval cramer power\n", "0 pearson 1.0 18.923347 10.0 0.041247 0.43501 0.868967\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/marianne/code/Analise_Estatistica/venv/lib/python3.10/site-packages/pingouin/contingency.py:150: UserWarning: Low count on observed frequencies.\n", " warnings.warn(\"Low count on {} frequencies.\".format(name))\n", "/home/marianne/code/Analise_Estatistica/venv/lib/python3.10/site-packages/pingouin/contingency.py:150: UserWarning: Low count on expected frequencies.\n", " warnings.warn(\"Low count on {} frequencies.\".format(name))\n", "/home/marianne/code/Analise_Estatistica/venv/lib/python3.10/site-packages/scipy/stats/_stats_py.py:7169: RuntimeWarning: divide by zero encountered in power\n", " terms = f_obs * ((f_obs / f_exp)**lambda_ - 1)\n", "/home/marianne/code/Analise_Estatistica/venv/lib/python3.10/site-packages/scipy/stats/_stats_py.py:7169: RuntimeWarning: invalid value encountered in multiply\n", " terms = f_obs * ((f_obs / f_exp)**lambda_ - 1)\n", "/home/marianne/code/Analise_Estatistica/venv/lib/python3.10/site-packages/scipy/stats/_stats_py.py:7166: RuntimeWarning: divide by zero encountered in divide\n", " terms = 2.0 * special.xlogy(f_exp, f_exp / f_obs)\n" ] } ], "source": [ "# Teste de dupla independência Chi-quadrado: As variáveis x e y são independentes entre si?\n", "expected, observed, stats = pingouin.chi2_independence(data=juri, x='verdict', y='danger')\n", "print(stats[stats['test'] == 'pearson'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PARTE 2: Modelo de Regressão Logística Nominal\n", "\n", "Previsão do veredito à partir das variáveis periculosidade (danger), reabilitação (rehab) e dissuasão geral (gendet)\n", "\n", "Responda:\n", "1. Qual a significância do modelo (Prob>ChiSq)?\n", "2. Considerando os coeficientes de regressão, quais os coeficientes e significâncias das variáveis? São iguais ou diferentes do modelo anterior? Porquê?\n", "3. De acordo com este modelo, qual o incremento de probabilidade do veredito de pena de morte para cada unidade de periculosidade (rehab)?" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namescoefsezpvalCI[2.5%]CI[97.5%]
0Intercept-1.350.68-1.970.05-2.69-0.01
1danger0.280.093.150.000.110.45
2rehab-0.180.08-2.240.02-0.34-0.02
3gendet0.190.082.460.010.040.34
\n", "
" ], "text/plain": [ " names coef se z pval CI[2.5%] CI[97.5%]\n", "0 Intercept -1.35 0.68 -1.97 0.05 -2.69 -0.01\n", "1 danger 0.28 0.09 3.15 0.00 0.11 0.45\n", "2 rehab -0.18 0.08 -2.24 0.02 -0.34 -0.02\n", "3 gendet 0.19 0.08 2.46 0.01 0.04 0.34" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Regressão Logística Múltipla Binária\n", "X = juri[['danger', 'rehab', 'gendet']] # Variáveis preditoras\n", "y = juri['verdict'] # Variável dependente binária [0, 1]\n", "lom = pingouin.logistic_regression(X, y, remove_na=True)\n", "lom.round(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "https://medium.com/@ginoasuncion/visualizing-logistic-regression-results-using-a-forest-plot-in-python-bc7ba65b55bb" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.6 ('venv': venv)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "17a21100134af3592d09988eee871935e79466d45670c914d929fa5f969b25f9" } } }, "nbformat": 4, "nbformat_minor": 2 }