### Final changes to project

master Ryan Stewart 11 months ago
parent
commit
e9cf46acfa
2 changed files with 97 additions and 192 deletions
1. 202
project.html
2. 87
project.rmd

#### 87 project.rmd View File

 @ -7,7 +7,7 @@ output: html_document   # Introduction    Crime is an issue that has always plagued the world. Many families have been affected by it, and politicians are often asked to solve various issues related to crime. Not to mention other issues like young poverty, poor education or high unemployment. As statisticians, it is important to be able to take in data and be able to create a useful analysis of it in order to answer big questions like these. Although this data set used is not with real data, it provides a realistic scenario to analyze different situations relevant a summary of different rates on a state by state basis.  Societal issues are an important topic of conversation. Many families are afflicted by issues such as crime, poverty, unemployment, and plethora of other issues. Politicians are often asked to solve these issues. As statisticians, it is important to be able to take in data and be able to create a useful analysis of it in order to better the lives of others. Although this data set used does not contain real data, it provides a realistic scenario to analyze different situations on a state by state basis.   #### Import libraries   @ -17,14 +17,11 @@ library(janitor)      Read in the data   {r} df = read.csv("crime.csv") %>% clean_names() summary(df)    # Analysis of the dataset # Format of the Dataset   Before we ask any questions about the data, we must understand what all of the columns in the data set mean. Below is a description of each column in the data set.   @ -47,28 +44,30 @@ Before we ask any questions about the data, we must understand what all of the c   Although it did not get used in this analysis, there are the same columns with 10 appended to the end of it to signify data 10 years later.   # Analysis Questions     ## Do states in the south have a higher crime rate?   This is analyzed by using a two sample t.test with and alpha of 0.95 between the states deemed as southern United States and northern United States. Variance was checked between data sets of the states to set the t.test to a more accurate command. A two sample t.test was used with a p-critical value of 0.05 between northern and southern United States. Variance was checked between data sets of the states to set the t.test to a more accurate command.   {r} # Separating results from the original data set # Separate results from the original data set s = split(df, df$southern)  northern = data.frame(s) southern = data.frame(s)   # test for similar variance var.test(x = northern$X0.crime_rate, y = southern$X1.i_crime_rate) # Test for similar variance var.test(x = northern$X0.crime_rate, y = southern$X1.crime_rate)     This variance test shows a fairly large p-value of 0.1824 compared to the standard 0.05, an indication that the null hypothesis of the true ration of variances being equal to one cannot be rejected. Therefore it will be assumed that variances are equal.  This variance test shows a fairly large p-value. It is much larger than the chosen p-critical value of 0.05. This indicates that the null hypothesis of the true ratio of variances being equal to one cannot be rejected. Therefore it will be assumed that the variances are equal.   {r} t.test(x = northern$X0.i_crime_rate, y = southern$X1.i_crime_rate, var.equal = TRUE) t.test(x = northern$X0.crime_rate, y = southern$X1.crime_rate, var.equal = TRUE)     This two sample t-test shows how close the means of these two categories (northern and southern states) are as they have no significant difference between them. The mean of northern states crime rate value is 103, while the mean of the southern states crime rate value is 101. It should also be mentioned that the p-value of 0.8488 shows to provide no strong evidence that the true difference in means is not a zero value.  This two sample t-test shows how close the means of these two categories (northern and southern states) are. They have no significant difference between them. The mean of the northern states' crime rate value is 103, while the mean of the southern states' crime rate value is 101. It should also be mentioned that the p-value of is significantly larger than the chosen p-critical value. This means there is no strong evidence that the true difference in means is not a zero value.     ## Is there a relationship between high youth unemployment and southern states? @ -84,96 +83,96 @@ chisq.test(df$southern, df$youth_unemployment)   ## Is there any indication that more males commit crimes?   This is done by using a two sample t.test between the states indicated to have a larger male population and states without this fact. Variance was checked between data sets to set the t.test to a more accurate command. A two sample t.test was used between the states indicated to have a larger male population and states with a smaller male population. Variance was checked between data sets to set the t.test to a more accurate command.   {r} s2 = split(df, df$more_males) less = data.frame(s2) more = data.frame(s2)   var.test(x = less$X0.i_crime_rate, y = more$X1.i_crime_rate)  var.test(x = less$X0.crime_rate, y = more$X1.crime_rate)      This variance test shows a fairly large p-value of 0.3943 compared to the standard 0.05, an indication that the null hypothesis of the true ration of variances being equal to one cannot be rejected. Therefore it will be assumed that variances are equal.  This variance test shows a fairly large p-value compared to the chosen p-critical value of 0.05. This indicates that the null hypothesis of the true ratio of variances being equal to one cannot be rejected. Therefore it will be assumed that variances are equal.   {r} t.test(x = less$X0.i_crime_rate, y = more$X1.i_crime_rate, var.equal = TRUE) t.test(x = less$X0.crime_rate, y = more$X1.crime_rate, var.equal = TRUE)     From the output of this t-test it is implied that there are different means between states with less males and more males. However, the p-value received shows that there is no significant statistical difference between them to show that the mean is not the same. Therefore, states with higher or lower male ratios does not indicate crime rate.  From the output of this t-test it is implied that there are different means between states with less males and more males. However, the p-value received shows that there is no significant statistical difference between them. Therefore, states with higher or lower male ratios does not indicate crime rate.     ## What factors affect crime rate?   In order to see which factors affect crime rate, a multiple regression test was used to see which independent variables affected crime rate. The following factors were analysed: * education level * police expenditure * youth unemployment * state size * average wage per week In order to see which factors affect crime rate, a multiple regression test was used. The following factors were analysed:   * Education Level * Police Expenditure * Youth Unemployment * State Size * Average Wage per Week   {r} mdl3 = lm(i_crime_rate ~ education+expenditure_year0+youth_unemployment+state_size+wage, df) mdl3 = lm(crime_rate ~ education+expenditure_year0+youth_unemployment+state_size+wage, df) summary(mdl3)     Looking at the results, it is presented that very little affects the crime rate because most p-values are significantly higher than our critical p value of 0.05. However, one result stood out and has a very strong relationship with crime rate. The results of this test show expenditure has a clear proportional relationship with crime rate given it had a p-value of 0.000241. This test shows that the higher the police expenditure, the higher the crime rate. However, it would not make sense to think that higher police budgets leads to more crime. What this test most likely indicates is that when more money is spent on crime, the police are more likely to catch more crime that already occurred but otherwise would go unreported. Using this result, further testing could be used to justify what a reasonable police budget is.  Looking at the results, it is presented that very little affects the crime rate. Most p-values are significantly higher than our p-critical value of 0.05. However, one result stood out and has a very strong relationship with crime rate. The results of this test show expenditure has a clear directly proportional relationship with crime rate, given it has a p-value much smaller than our p-critical. This test shows that the higher the police expenditure, the higher the crime rate. However, it would not make sense to think that higher police budgets leads to more crime. What this test most likely indicates, is that when more money is given to police departments, the police are more likely to catch crime that already occurred, but otherwise would go unreported. Using this result, further testing could be used to justify what a reasonable police budget is.   {r} plot(mdl3)     For the assumptions of this data, homoscedasticity can be questioned from the changing residual values on the residuals vs fitted graph, but it does stay fairly close to the zero line. The normality of the errors can be questioned more as there is less of a closeness to the expected line in the Q-Q plot. There are a possibility of three outliers.    Here is an example plot  For the assumptions of this data, homoscedasticity can be questioned from the changing residual values on the residuals vs fitted graph. However, it does stay fairly close to the zero line. The normality of the errors can be questioned, because there is less of a closeness to the expected line in the Q-Q plot. There is a possibility of about three outliers.    {r} ggplot(df, aes(x=expenditure_year0, y= i_crime_rate)) + geom_point() + theme_minimal() + geom_smooth(method=lm) ggplot(df, aes(x=expenditure_year0, y= crime_rate)) + geom_point() + theme_minimal() + geom_smooth(method=lm)    This plot shows an expected increase of crime rate with increased expenditure on police enforcement. This plot shows a linear regression of the expected increase of crime rate with increased expenditure on police enforcement.     ## How does average education level of people in the area affect the amount of crime that occurs? ## How does the average education level affect the amount of crime that occurs? {r} ggplot(df, aes(education, i_crime_rate)) + geom_point() + theme_minimal() ggplot(df, aes(education, crime_rate)) + geom_point() + theme_minimal()     As you can see by looking at this graph, there is little to no correlation at all between any of these points. Creating any type of model would be ineffective. Therefore, it can be inferred education level is not a good indicator of predicting an effect on crime rate.  As you can see by looking at this scatterplot, there is little to no correlation at all between any of these points. Creating any type of model would be ineffective. Therefore, it can be inferred education level is not a good indicator of predicting an effect on crime rate.     ## Comparison of education and poverty: ## Comparison of education and poverty   For this comparison a linear regression model was used.   {r} # plotting education vs. below wage value for state with a fitted line and confidence intervals for expected value # Plotting education vs. below wage value for state with a fitted line and confidence intervals for expected value ggplot(df, aes(x=education, y= below_wage)) + geom_point() + theme_minimal() + geom_smooth(method=lm)    Here, it can be seen that as the value of education goes higher, the value of lower wage jobs decreases. Here, it can be seen that the more years of average education goes higher, the number of people below median wage decreases.   {r} #modeling of data  # Modeling of data  mdl1=lm(below_wage~education,data=df) summary(mdl1)   If null hypothesis is that $\beta_1$ = 0, showing no correlation, the p-value would be evaluated highly. In this case, the p-value of 6.70e-11 is low enough to reject that null hypothesis. Instead, $\beta_1$ is a value of -20.947, showing both great significance for change for increase education and a negative expected fitted line.    If the null hypothesis is that $\beta_1 = 0$, showing no correlation, the p-value would be evaluated highly. In this case, the p-value is practically zero, so the null hypothesis is rejected. Instead, $\beta_1 = -20.947$, showing both great significance for change for increased education and a negative expected fitted line.   {r} #plotting to test if linear regression is a correct model # Plotting to test if linear regression is a correct model plot(mdl1)     It should be noted that the assumptions of homoscedasticity and errors of a normal distribution may be challenged as shown by the residuals vs. fitted graph, that shows changing residuals away from the zero line and that the Normal Q-Q is not closely following the line as would be liked. There are a possibility of three outliers.  It should be noted that the assumptions of homoscedasticity and errors of a normal distribution may be challenged as shown by the residuals vs. fitted graph. This shows changing residuals away from the zero line and that the Normal Q-Q is not closely following the line as wanted. There is also a possibility of about three outliers.     # Conclusion    From this analysis of the data set, it is shown that there is no significant difference between crime rates in northern and southern states, high youth unemployment does not contribute to crime rates, and there is no significant evidence that more males lead to higher crime rates.  From this analysis of the data set, it is shown that there is no significant difference between crime rates in northern and southern states. High youth unemployment does not contribute to crime rates. There is no significant evidence that states with more males leads to higher crime rates, and it also seems that education has no relation with crime rate. In addition, education does have an effect on poverty, poverty and crime rate may not be closely related.    In terms of find a true cause for a higher crime rate, expenditure on police may be an significant factor. It also seems that by this data, education has no rule for crime rate and based on the fact that education does have an effect on poverty, poverty and crime rate may not be as close as suspected.  From reviewing the findings, it is shown that there exists a correlation with higher crime rate and higher police expenditure. As stated before, it is unlikely that increased police spending actually increases the amount of crimes that occur, but a potential cause for this increase could be that police are able to find more crime that would already occur. In order to further test this theory, another study should be conducted with more appropriate parameters included.    Overall, not many correlations could be made. However, the significance of the discovery that was found can be used in the political world. Decisions on funding for the police should take into account what was found and used wisely for the benefit of citizens. Expenditure on police in relation to crime rate was found, but that itself may have more factors to look at, such as more police in relation to increase in petty crimes or false arrests.   Overall, not many correlations could be made. However, this is still a significant discovery that can be useful for politicians or others interested in this data. Decisions on funding for the police should take into account what was found and be used wisely for the benefit of the public.    In order to move forward with the work done here, many more factors would like to be considered. This may include mental health, political setting, types of communities (i.e. urban, country), and population density. There is much more to learn about the causes of crime that is relevant to the safety of the citizens.   In order to move forward with the work done here, many more factors should be considered. This may include mental health, political setting, different types of communities (i.e. urban, country), and population density. There is much more to learn about how societal issues are related and how to treat them, but through statistical analysis there can be a much better understanding of where to look.`