Browse Source

Made some edits to the writeup

master
Ryan Stewart 11 months ago
parent
commit
389109ec45
  1. 189
      project.html
  2. 79
      project.rmd

189
project.html
File diff suppressed because it is too large
View File

79
project.rmd

@ -5,8 +5,8 @@ date: "7/28/2021"
output: html_document
---
####Introduction
Crime is a constant topic that appears from the political world to families that have been affected by it. As with many vague and reoccurring topics, there have been speculations made from stereotypes and the question is asked to how it comes to be. In this data set provided (SHOULD WE TELL WHERE IT IS FROM OR EVEN PUT A CITATION?) some stereotypes will be tested for their significance and theories to causes of education will be looked into using the R programming language.
# Introduction
Crime is an issue that has always plagued the world. Many families have been affected by it, and politicians are often asked to solve various issues with crime and reduce it. As with many vague and reoccurring topics, there have been speculations made from stereotypes and the question is asked to how it comes to be. In this data set provided (SHOULD WE TELL WHERE IT IS FROM OR EVEN PUT A CITATION?) some stereotypes will be tested for their significance and theories to causes of education will be looked into using the R programming language.
# Import libraries
```{r}
@ -20,18 +20,22 @@ df = read.csv("crime.csv") %>% clean_names()
summary(df)
```
####Questions to Answer
## Analysis of the dataset
### Do states in the south have a higher crime rate?
We have analyzed this by using a two sample t.test with and alpha of 0.95 between the states deemed as southern United States and northern United States. Variance was checked between data sets to set the t.test to a more accurate command.
### 1. Do states in the south have a higher crime rate?
This is done by using a two sample t.test between the states deemed as southern United States and northern United States. Variance was checked between data sets to set the t.test to a more accurate command.
```{r}
#Separating results from the original data set
# Separating results from the original data set
s = split(df, df$southern)
northern = data.frame(s[1])
southern = data.frame(s[2])
var.test(x = northern$X0.crime_rate, y = southern$X1.crime_rate) #test for similar variance
# test for similar variance
var.test(x = northern$X0.crime_rate, y = southern$X1.crime_rate)
```
This variance test shows a fairly large p-value of 0.1824 compared to the standard 0.05, an indication that the null hypothesis of the true ration of variances being equal to one cannot be rejected. Therefore it will be assumed that variances are equal.
```{r}
@ -40,16 +44,21 @@ t.test(x = northern$X0.crime_rate, y = southern$X1.crime_rate, var.equal = TRUE)
This two sample t-test shows how close the means of these two categories (northern and southern states) have no significant difference between them. The mean of northern states crime rate value is 103, while the mean of the southern states crime rate value is 101. It should also be mentioned that the p-value of 0.8488 shows to provide no strong evidence that the true difference in means is not a zero value.
###1. Is there a relationship between high youth unemployment and southern states?
(CAN YOU PUT WHAT THIS IS?)
### Is there a relationship between high youth unemployment and southern states?
To analyze whether their is a relationship between high youth unemployment and southern states, we use a chi squared test. Given a p value of less than 0.05, we would see that there is a clear correlation between this data.
```{r}
chisq.test(df$southern, df$youth_unemployment)
```
Using the chi squared test, it can see that there is not a statistically significant difference between southern and northern states in youth unemployment. This is because the p value is nearly 3 times greater than our critical p value of 0.05.
After analyzing the results of the test, we can see that there is not a statistically significant difference between southern and northern states in youth unemployment. This is because the p value is nearly 3 times greater than our critical p value of 0.05.
###1. Is there any indication that more males commit crimes?
### Is there any indication that more males commit crimes?
This is done by using a two sample t.test between the states indicated to have a larger male population and states without this fact. Variance was checked between data sets to set the t.test to a more accurate command.
```{r}
s2 = split(df, df$more_males)
less = data.frame(s2[1])
@ -62,15 +71,23 @@ var.test(x = less$X0.crime_rate, y = more$X1.crime_rate)
```{r}
t.test(x = less$X0.crime_rate, y = more$X1.crime_rate, var.equal = TRUE)
```
While the means of the two variables between less males and more males in state seem to have very different means (99 and 113 respectively) for the value(IS SAYING VALUE OKAY? CAUSE I ACTUALLY DONT KNOW WHAT THE VALUE IS) of crime rate, there is a large p-value of 0.3414 to not reject the null hypothesis of a zero difference in means, and within the 95% confidence interval of this t-test the value of zero does lie there to be a possibility.
From the output of this t-test we can see that there are different means between states with less males and more males. However, the p value we receive shows that there is no significant statistical difference between them to show that the mean is not the same. Therefore, states with higher or lower male ratios does not indicate crime rate.
### What factors affect crime rate?
In order to see which factors affect crime rate, we did a multiple regression test test to see which independent variables affected crime rate. We analyzed the following factors:
* education level
* police expenditure
* youth unemployment
* state size
* average wage per week
###1. See what, if anything, affects crime rate
```{r}
mdl3 = lm(i_crime_rate ~ education+expenditure_year0+youth_unemployment+state_size+wage,df)
mdl3 = lm(crime_rate ~ education+expenditure_year0+youth_unemployment+state_size+wage, df)
summary(mdl3)
```
By looking at the p value, we can see that very little affects the crime rate, but only thing that has a very strong relationship with crime rate is expenditure with a p-value of 0.000241. This test shows that the more people spend on police, the more crime they find in those areas.
Looking at the results, we can see that very little affects the crime rate because most p values are significantly higher than our critical p value of 0.05. However, one result stood out and has a very strong relationship with crime rate. The results of this test show expenditure has a clear proportional relationship with crime rate given it had a p-value of 0.000241. This test shows that the higher the police expenditure, the higher the crime rate. However, it would not make sense to think that higher police budgets leads to more crime. What this test most likely indicates is that when more money is spent on crime, the police are more likely to catch more crime that already occurred but otherwise would go unreported. Using this result, further testing could be used to justify what a reasonable police budget is.
```{r}
plot(mdl3)
@ -78,36 +95,26 @@ plot(mdl3)
For the assumptions of this data, homoscedasticity can be questioned from the changing residual values on the residuals vs fitted graph, but it does stay fairly close to the zero line. The normality of the errors can be questioned more as there is less of a closeness to the expected line in the Q-Q plot. There are a possibility of three outliers.
###1. Is crime reported more often in places that spend more on police?
For this comparison a linear regression model was used.
```{r}
ggplot(df, aes(x=expenditure_year0, y= i_crime_rate)) + geom_point() + theme_minimal() + geom_smooth(method=lm)
```
This plot shows an expected increase of crime rate with increased expenditure on police enforcement.
```{r}
mdl2=lm(i_crime_rate~expenditure_year0,data=df)
summary(mdl2)
```
If our null hypothesis is $\beta_1$ = 0 to indicate no correlation between these two variables, it can be rejected based on the small p-value of 1.03e-05. Based on this summary, the value of crime rate should increase by 0.6283 with each value increase of expenditure on police.
Here is an example plot
```{r}
plot(mdl2)
ggplot(df, aes(x=expenditure_year0, y= crime_rate)) + geom_point() + theme_minimal() + geom_smooth(method=lm)
```
In this experiment, there was an assumed homoscedasticity and errors of a normal distribution. Based on the residual v fitted and Normal Q-Q plots, those assumptions have the right to be questioned, but not as drastically as each plot does show values close the expected/wanted line. There is a possibility of three outliers.
This plot shows an expected increase of crime rate with increased expenditure on police enforcement.
###1. How does average education level of people in the area affect the amount of crime that occurs?
### How does average education level of people in the area affect the amount of crime that occurs?
```{r}
ggplot(df, aes(education, i_crime_rate)) + geom_point() + theme_minimal()
ggplot(df, aes(education, crime_rate)) + geom_point() + theme_minimal()
```
As you can see by looking at this graph, there is little to no correlation at all between any of these points. Creating any type of model would be ineffective. Therefore, it can be inferred education level is not a good indicator of predicting an effect on crime rate.
###1. Comparison of education and poverty:
### Comparison of education and poverty:
For this comparison a linear regression model was used.
```{r}
#plotting education vs. below wage value for state with a fitted line and confidence intervals for expected value
# plotting education vs. below wage value for state with a fitted line and confidence intervals for expected value
ggplot(df, aes(x=education, y= below_wage)) + geom_point() + theme_minimal() + geom_smooth(method=lm)
```
Here, it can be seen that as the value of education goes higher, the value of lower wage jobs decreases.
@ -126,6 +133,8 @@ plot(mdl1)
It should be noted that the assumptions of homoscedasticity and errors of a normal distribution may be challenged as shown by the residuals vs. fitted graph, that shows changing residuals away from the zero line and that the Normal Q-Q is not closely following the line as would be liked. There are a possibility of three outliers.
####Conclusion
The conclusion that can be made from this data set include the true significance of stereotypes, such as there is no significant difference between crime rates in northern and southern states, based on the last point high youth unemployment does not contribute to crime rates, and there is no significant evidence that more males lead to higher crime rates.
In terms of find a true cause for a higher crime rate, expenditure on police could be the lead cause. This goes with the conclusion that expenditure on police and crime rate in general may have a significant correlation with the null hypothesis of no correlation being ignored. It also seems that by this data, education has no rule for crime rate and based on the fact that education does have an effect on poverty, poverty and crime rate may not be as close as suspected.
# Conclusion
From this analysis of the data set, we see there is no significant difference between crime rates in northern and southern states, high youth unemployment does not contribute to crime rates, and there is no significant evidence that more males lead to higher crime rates.
(I think we should rewrite this conclusion) In terms of find a true cause for a higher crime rate, expenditure on police could be the lead cause. This goes with the conclusion that expenditure on police and crime rate in general may have a significant correlation with the null hypothesis of no correlation being ignored. It also seems that by this data, education has no rule for crime rate and based on the fact that education does have an effect on poverty, poverty and crime rate may not be as close as suspected.
Loading…
Cancel
Save