by Jennifer Berk, Rick Cavolo, and Sarah Young
Raising over 26 million dollars last year and reaching thousands of students a week, DonorsChoose.org is a nonprofit crowd funding site for education. It was founded in 2000 and its mission is to engage "...the public in public schools by giving people a simple, accountable and personal way to address education inequity." DonorsChoose.org accomplishes their mission by providing a website and tools for crowd funding projects proposed by teachers and vetted by the website. This year, DonorChoose.org has released data from their first 10 years in a contest called "Hacking Education." DonorsChoose.org hopes to get interesting data analyses and useful web apps from this contest. This analysis was prepared in the Data Mining class taught by Professor Matt Taddy at the University of Chicago Booth School of Business, in order to contribute to the data analyses portion of this contest.
The process from idea to completion for a teacher using DonorsChoose.org follows the following flow. A teacher has a need, let's say 10 microscopes. They join DonorsChoose.org and submit an essay about their students (freshmen in biology), their project (science labs) and their need (10 microscopes). They submit the dollar amount they need to purchase the microscopes and the project expiration, and they spread the word about their project to friends and family. If their project becomes fully funded before the expiration, DonorsChoose.org will purchase the requested items (microscopes) and ship them to the school, however if the project doesn't get fully funded the money is redirected in one of three ways: the donor can choose a new project, the site can choose a new project for the donor, the teacher can choose a new project for the donor.
Our interest is in helping identify trends and perhaps predict fully funded projects. If we were able to give the teachers and DonorsChoose.org a better understanding of what projects get funded, what schools get funded, what areas of study get funded etc, they would be able to target their projects and essays to get more fully funded projects. Another interest we have is in using the words in the essay to predict the subject or items, and linking that with what we learn in the fundedness study to make suggestions to the teachers posting their projects.
The data for this project was downloaded from the DonorsChoose Developer site (http://developer.donorschoose.org/the-data) on May 23, 2011.
The data we are using includes data regarding the projects, the donations and the individual teacher essays written and posted for each project. The data sets as posted are very clean, so making the projects and donations data ready for analysis involved (1) dropping projects with funding status "live", since we don't know whether they will be fully funded, and making a Boolean fundedness responses variable instead of one with several factor levels (2) setting variable classes to factor/logical/date as appropriate, reordering levels for poverty and grade level variables, and correcting zip codes to have leading zeroes and a mistyped state from "La" to "LA," and (3) building additional variables to use in visualizations and models.
Variables created about projects and donations:
To model whether a project would be fully funded, we used a subset of this data, not including any information about project donations since teachers and DonorsChoose don't have that information before a project is posted. We considered including geographic information (school state) but determined that our analyses didn't run in reasonable time if we included that variable with 50+ factor levels. The variables used were:
To get a good sense of this data, we decided to visualize some of it. Because of our interest in differentiating between projects that reached fully funded status and those that did not we first tried to plot our funded factor against a variety of variables. This first plot is of the projects requested per state vs the funded factor. While we hoped to see more of a trend, we think that fully funded projects probably differ more on a regional or district level.
Fig 1: Fully funded by State of Project Request (sorted by number of projects within state)
This second graph is the number of projects posted by the school vs the funded factor. We expected the downward trend that is present in this plot, as experience in this case is crucial, the first time a school posts a DonorsChoose project is different from the tenth of fiftieth project posting.
Fig. 2: Fully Funded by School's Number of DonorsChoose Projects
Another interesting variable to explore is the total amount donated per project, so we graphed a histogram of it. This gives us an idea of both the average and dispersion of the amount donated per project by both fully funded and not. Overall the average (median) project raised about $350, but this plot allows the distinct differences between fully fundedness.
Fig 3: Histogram of log(Total Donations per Project)
Using this total donated variable we decided to see if there was variation across states. Using the same sort (the rank of number of schools participating from most to least, left to right), we can see there is not too much variation between state on the average amount, but the states with more schools utilizing DonorsChoose do have longer tails.
Fig 4: Boxplot of log(Total Donations) by Number of Projects per State
Another way to understand the state-by-state donations is to plot them on a map. The following is of donations to schools with each point representing a location of the schools.
Fig 5: Map of United States by Donation to School
Next we wanted to understand the average dollar amount requested by teachers and its dispersion. We had two variables for the amount requested, one that included the dollar amount with "optional support" to DonorsChoose and one without it. This optional support is a suggested amount that teachers can add onto their requests that will go to support the DonorsChoose organization and all of its efforts. After looking at some of the plots of both with and without optional support, we decided there was very little difference, so for presentation purposes we will only show excluding optional support. The overall average (median) amount requested about $380. This is quite similar to the average amount raised. We again plotted both fully funded and not fully funded, and this shows that on average the higher amounts requested were the ones not funded.
Fig 6: Histogram of log(Total Amount Requested)
Knowing the dispersion of both the total amount requested and the total amount donated per project we tried to visualize this data by fully funded projects. These two plots show that funded projects on average are projects that had a smaller amount requested (target amount) and a larger amount of total donations.
Fig 7: log($ Requested) by Fully Funded; and log(Total Donations) by Fully Funded
To get an idea of how funding has changed over time the plotted a time trend of the fully funded variable.
Fig. 8: % Projects Funded over Time
Finally, we wanted to visualize some of our factor variables against some of these main variables. The first two graphs have charter school by fully funded and special teacher by fully funded. Both of these factors, when true, see increases in fully funded projects. The next two graphs are of fully funded projects by school metro and fully funded by school poverty level. With the metro area of the school, more urban schools and less rural schools have fully funded projects. With the poverty level, the schools that are high poverty level schools firstly are the biggest proportion of schools, and secondly have more fully funded projects than not. Interestingly, in seeing these two charts side-by-side you can see that more than just the urban schools are probably high poverty rate schools.
Fig 9: Fully Funded by Charter School; and Fully Funded by Special Teacher
Fig 10: School Metro by Fully Funded; and School Poverty Level by Fully Funded
The essay data was processed to extract summary level data, extract topics by fitting a latent topic model, and determine document-topic weights for these extracted topics. While most of the processing was completed in R, a pre-processing step using a text editor removed the project and teacher IDs and replaced various punctuation marks with spaces. This pre-processing had the effect of converting a CSV file into a text file with one line per project that contained all of the essays concatenated together. We later discovered that non-ASCII characters such as curly quotes were not removed from the file and remained in the data. Terms including these characters were ignored during a later step since the error was discovered after 48 hours of continuous processing and time was not available to redo the analysis. If we were to rerun the data cleaning again, we would remove/replace these characters and keep the terms in our analysis.
Following the pre-processing, the data were processed in R (the code used for this processing is documented in Appendix B). Many of the steps were repeated multiple times on subsets of the data to allow execution to complete on a machine with only 4 GB of memory. We calculated statistics such as word count, syllable count, and sentence count in order to calculate a Flesch-Kincaid grade level for each essay. After that, punctuation and numbers were removed and all words were set to lower case. The results were stemmed using the Snowball stemmer, based on the Porter stemmer. The entries to construct a simple triplet matrix that holds a document-term matrix were stored in a file before processing the next entry.
After processing was completed, the top 1,000 terms were determined by a ranking of average TF_IDF values for the terms. We selected a sample of 3,000 essays at random (1% of all essays) to estimate 50 latent topics. These latent topics were then used to calculate the document-topic weights for all 300,000 essays by treating 10,000 essays at a time to decrease execution time. The document-topic weights, word counts, and Flesch-Kincaid grade level for each project's essays were stored for use in logistic regression and topic prediction.
We built a variety of models to understand the variables that might affect a project being fully funded.
First was a full logistic regression model with no interactions, plus a model using false discovery rate control (FDR) to cut down overfitting and hopefully improve predictions.
The full model regression output (only variables significant at 5% level):
Estimate Std. Error z value Pr(>|z|) (Intercept) 7.280e+05 3.939e+05 1.848 0.064564 . primary_focus_areaHistory & Civics 3.239e-01 1.543e-01 2.100 0.035753 * primary_focus_areaMath & Science 2.519e-01 1.166e-01 2.160 0.030802 * resource_usageenrichment -3.051e+00 9.841e-01 -3.100 0.001937 ** resource_usageessential -3.036e+00 9.844e-01 -3.084 0.002043 ** resource_typeOther -2.573e-01 1.166e-01 -2.206 0.027396 * resource_typeSupplies -2.613e-01 9.560e-02 -2.733 0.006274 ** resource_typeTechnology -3.711e-01 1.005e-01 -3.692 0.000222 *** resource_typeVisitors -2.191e+00 6.756e-01 -3.242 0.001185 ** grade_levelGrades 9-12 2.336e-01 9.146e-02 2.554 0.010653 * total_price_excluding_optional_support -6.131e-03 2.005e-03 -3.059 0.002222 ** total_price_including_optional_support 4.518e-03 1.661e-03 2.719 0.006538 ** eligible_double_your_impact_matchTRUE 2.140e-01 5.981e-02 3.578 0.000347 *** eligible_almost_home_matchTRUE 3.268e-01 8.787e-02 3.719 0.000200 *** date_posted 2.313e-04 1.074e-04 2.152 0.031368 * essaymethodnewTRUE -2.785e-01 9.024e-02 -3.087 0.002024 ** school_funded 1.634e+00 1.554e-01 10.521 < 2e-16 *** teacher_funded 9.379e+00 1.491e-01 62.912 < 2e-16 *** teacher_projectfrequency -8.182e-02 6.112e-03 -13.386 < 2e-16 *** word_count -4.769e-04 1.680e-04 -2.838 0.004536 **
The largest coefficient is for teacher_funded, while the negative coefficients for resource_usage and resource_type are because a small number of "unknown" resource_usage projects have high fundedness and because the intercept of resource_type (Books) has the highest fundedness.
Fig 11: The FDR model
Estimate Std. Error z value Pr(>|z|) (Intercept) -2.159e+00 6.094e-02 -35.435 < 2e-16 *** resource_usageenrichment -1.025e+00 5.237e-02 -19.570 < 2e-16 *** resource_usageessential -1.078e+00 5.243e-02 -20.564 < 2e-16 *** resource_typeOther -4.780e-01 2.362e-02 -20.233 < 2e-16 *** resource_typeSupplies -2.347e-01 1.689e-02 -13.901 < 2e-16 *** resource_typeTechnology -5.728e-01 1.811e-02 -31.627 < 2e-16 *** resource_typeTrips 4.282e-02 6.362e-02 0.673 0.501 resource_typeVisitors -5.430e-01 1.244e-01 -4.365 1.27e-05 *** total_price_excluding_optional_support -8.807e-04 1.470e-05 -59.910 < 2e-16 *** eligible_double_your_impact_matchTRUE 1.644e-01 1.504e-02 10.931 < 2e-16 *** eligible_almost_home_matchTRUE 2.739e-01 2.642e-02 10.369 < 2e-16 *** essaymethodnewTRUE -8.179e-02 1.550e-02 -5.275 1.32e-07 *** school_funded 5.800e-01 4.267e-02 13.592 < 2e-16 *** teacher_funded 7.109e+00 3.580e-02 198.542 < 2e-16 *** teacher_projectfrequency -1.154e-02 3.235e-04 -35.681 < 2e-16 *** word_count -1.989e-04 4.206e-05 -4.730 2.25e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Null deviance: 335280 on 273994 degrees of freedom Residual deviance: 171633 on 273979 degrees of freedom R2 = 1 - (171633/335280) = 0.4881
Fig 12: Comparison of fits using full and FDR models
Fig 13: ROC curve for FDR model (high sensitivity and specificity)
Then we used LASSO so we could include interactions with the special-teacher summary variable (the data set is too big to include all interactions given the computing power and memory we had available). The first variables to become significant at very high penalty values were resource type levels Books and Technology, the proportion of that school's and that teacher's projects to be fully funded, and three of the essay topics. Topic 9 is a technology topic (computer, projector, laptop), Topic 32 seems to be furniture-related (rug, carpet, chair), and Topic 34 is about the DonorsChoose website and logistics (wwwdonorschooseorg, html, href, onclick). Most interactions were not very significant.
Fig 14: Comparison of most significant loading magnitudes by LASSO penalty level
Fig 15: ROC curves for s=0.0005 (left) and s=0.01 (right)
These are surprisingly similar, and the one with more variables included, s=0.01, is actually worse.
We attempted to find a lower-dimensional model using principal component analysis (PCA), and found that the first four components were a fairly good basis, using both visual determination of where the scree plot leveled off and the heuristic of taking the components with variances above 1.
Fig 16: Principal components
The first principal component correlates very well with number of donors - the PC1/2 graph on the left is colored with red dots for projects with more than 3 donors (mean is 3.48). The second and third principal components both correlate with whether the project is fully funded or not - the PC1/2 graph on the right and the PC3/4 one below are both colored with red dots for fully funded projects. It's not immediately clear what PC4 is measuring.
Fig 17: First and second principal components
FIg 18: Third and fourth principal components
Running a regression of fundedness on the first four principal components:
Estimate Std. Error z value Pr(>|z|) (Intercept) 1.273329 0.006011 211.83 <2e-16 *** pc[, 1:4]PC1 0.762647 0.004548 167.67 <2e-16 *** pc[, 1:4]PC2 -0.854576 0.004754 -179.75 <2e-16 *** pc[, 1:4]PC3 -0.744529 0.005432 -137.06 <2e-16 *** pc[, 1:4]PC4 -0.373431 0.005358 -69.69 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Null deviance: 335280 on 273994 degrees of freedom Residual deviance: 218218 on 273990 degrees of freedom R2 = 1 - (218218/335280) = 0.3491
Fig 19: Fit using regression on four principal components
Fig 20: ROC curve
Last we built a tree model and trimmed it. Here the most important variable was the average fundedness of a teacher's projects. A teacher who has had projects funded before may have built up a roster of regular donors, in addition to seeing what project characteristics have led to funded projects in the past.
Fig 21: Untrimmed tree
Fig 22: Trimmed tree
Fig 23: ROC curve
Considerably worse than the models above, and not smooth because it's a tree model
We tested the predictive power of these models by running an out of sample prediction. For this prediction problem, FDR and LASSO (even lasso.small, which uses a very high penalty) are have a lower mean squared error and are better than PCA (with four principal components) and trees.
Fig 24: Out of sample prediction
Our biggest finding is that teachers' experience with getting projects funded is hugely important in whether their next project will get funded. This is true across all the models we looked at. Other factors such as type of resource requested and specific essay topics were significant but less important, and these additional explanatory variables were somewhat different across our different models.
Our out of sample prediction showed similar predictive power for FDR and varying-penalty LASSO models. Because we had a large number of variables and observations and limited computing power, we weren't able to use the full power of LASSO by including all interactions. For future predictions, we expect the best model to use would be a LASSO model with more interactions included (but still using very few variables).
Besides using an analysis of the DonorsChoose essays to predict which projects get funded, we also wanted to explore how DonorsChoose could use the essays to automatically identify the primary focus area of the project or the primary subject. In addition, we looked for trends in the language used in essays to explore how submissions have evolved over time.
We first explored and visualized the identified latent topics using graphical techniques. After that, we analyzed the essay data using multinomial logistic modeling and decision trees to develop a prediction model for primary focus area (as identified by the teacher) based on the latent topic scores, word count, and reading grade level of the essay. In addition, we explored the results of these prediction methods to better understand the relationship between the latent topics and primary focus areas. Finally, we explored macro trends in latent topics by utilizing partial least squares.
We developed 50 latent topics from the essays. Given more time and computing power, we would have run the analysis multiple times and varied the number of latent topics to determine which number made the most sense. Below is a plot that shows the document-topic weights averaged across documents, which shows the relative usage of the topics in the dataset. The grand average weight for the 50 topics is 0.02.
Fig 25: Plot of Average Document-Topic Weight by Topic Number
The following word maps visualize the importance of various words in a few select topics. The size of the word is relative to the importance of the term. We used www.wordle.net to create these maps.
Fig 26: Topic 1: Reading is Fun
Fig 27: Topic 2: Yes We Can
Fig 28: Topic 4: Time Matters
Fig 29: Topic 6: Global Community
Fig 30: Topic 18: Science is Cool
Fig 31: Topic 19: Math Solves Problems
Using these topics, we originally attempted to predict the primary subject area of the project, as seen in our word map titles. After running the initial multinomial logistic model, which regressed topics onto the project subject, we decided that, given the number of similar categories and poor results of this analysis, we should attempt to predict the primary focus area instead. The plot below shows the range of probability estimates for predicting the primary subject area.
Fig 32: Predicting primary subject area
In contrast, regressing the topic areas onto the focus area of the project produced the following fitted probabilities.
Fig 33: Predicting focus area
While this plot shows that a few of the focus areas are difficult to predict accurately, others such as Math & Science and Literacy & Language projects are relatively easy to predict using the identified latent topics. The following table reports the loadings of a subset of the topics. The most important word in the topic is used to represent the topic.
Factor | Applied Learning | Health & Sports | History & Civics | Literacy & Language | Math & Science | Music & The Arts | Special Needs |
---|---|---|---|---|---|---|---|
book | -0.59 | -1.5 | -0.37 | 0.84 | -0.66 | -0.69 | 0.39 |
can | -0.04 | 0.04 | -0.03 | -0.05 | 0.02 | -0.01 | -0.03 |
year | 0.13 | 0.18 | -0.03 | -0.02 | -0.05 | 0.06 | -0.12 |
scienc | -0.33 | -0.2 | -0.79 | -0.97 | 1.36 | -0.7 | 0 |
math | -0.31 | -0.62 | -0.76 | -0.71 | 1.17 | -0.84 | 0.08 |
magazin | -0.27 | -0.35 | 0.8 | 0.03 | -0.31 | -0.22 | -0.16 |
special | -0.19 | 0.02 | -0.29 | -0.2 | -0.27 | -0.31 | 0.87 |
music | -0.07 | 0.14 | -0.23 | -0.29 | -0.39 | 1.15 | -0.05 |
ball | 0.05 | 0.84 | -0.46 | -0.45 | -0.02 | -0.31 | 0.15 |
art | 0.03 | -0.11 | -0.2 | -0.16 | -0.44 | 0.96 | 0.06 |
The following word maps visualize the loadings of each of these topics utilizing the most important word in each topic. The size of the word indicates the magnitude of the loading for that topic and focus area. Green words are positive loadings and red words are negative loadings. Also included in these loadings are the effect of word count and the Flesch-Kincaid grade level for the essay.
Fig 34: Word plot for Applied Learning
Fig 35: Word plot for Health & Sports
Fig 36: Word plot for History & Civics
Fig 37: Word plot for Literacy and Language
Fig 38: Word plot for Math & Science
Fig 39: Word plot for Music & Arts
Fig 40: Word plot for Special Needs
The multinomial logistic regression was able to correctly classify 64.43% of primary focus areas correctly with classification being determined by the area with highest calculated probability.
For comparison purposes, a decision tree for classification was also developed on the same data.
Fig 41: Classification tree
This tree correctly classifies 70.66% of projects based on essays. The output for this tree is available in Appendix D.
Fig 42: Deviance vs nodes, based on cross-validation with 10 folds
Based on this graph, the tree was pruned to have 10 leaves instead of 12. The resulting tree, which is plotted below, correctly classified 67.51% of projects.
Fig 43: Classification tree
Based on the results of these tests, it appears that decision trees are better able to classify results and provide some intuition as to how the classification is completed. One drawback of using decision trees is that, no matter the tree size, projects will not be classified as an "Applied Learning" project. This may indicate that applied learning is used for projects that cannot be classified as one of the other topics. If more time was available, these methods would be measured using out of sample prediction to select the appropriate classification method out of: decision trees, dyna-trees, multinomial logistic regression, and K-nearest neighbors.
We also used the topic data to attempt to identify trends in essay writing over time. This analysis was completed by using partial least squares with time regressed onto each of the topic weights. The following is a plot of the first two partial-least squares components and the resulting correlation between the fitted values and the date of submission.
Fig 44: Partial least squares components
From this plot, we see that there is little additional information available in the residuals. In addition, we see a dramatic shift in topics in 2007. The topics that exhibited the largest changes over time are displayed in the following word plot. Green topics have increased in importance over time while red topics have been used less over time.
Fig 45: Topic frequency changes
Based on this analysis, we plotted latent topics over time to look for changes in the usage of certain words by teachers applying for funding through DonorsChoose. The average weight of a topic in a document is 0.02 since there are 50 topics.
Fig 46: Document-topic weights, Topic 34
This plot shows that language related to the DonorsChoose website dropped abruptly in the second quarter of 2007. During that time, the website was being modified in preparation for its national launch and we hypothesize that changes to the website removed the need to discuss the logistical details that may have been a part of successful essays in the past.
Fig 47: Document-topic weights, Topic 6
In contrast to the changes in the DonorsChoose topic, use of Topic 6, which we've called the 'Global Community' has steadily increased over the eight years of essays submitted to DonorsChoose. Topic 7, which appears to describe an eagerness to learn, has also increased over time in a similar fashion.
With all of this information from our analyses, we have been able to see trends that cross both essay and donation data. We believe that this data will be helpful to both the teachers using DonorsChoose.org and the website itself.
The biggest factor in our donation prediction was the teacher having a high proportion of successfully funded projects. Similarly, the school having a high proportion of fully funded project also had predictive power. However, more projects per teacher is associated with lower likely of successfully funded projects, perhaps because the teacher's network of potential donors is being spread too thin. These facts imply there are variables we currently do not have access to that might help predict a teacher's success. Independent from the school/ teacher funded factors, on average projects were more likely to be funded if they were requesting books or supplies. Projects requesting technology were the least likely to be fully funded, even accounting for the amount requested.
Another finding was that concise essays (lower word count) were more likely to be funded. Given that books are the most funded resource, it was not surprising that Reading was the most important topic (Topic 1). Based on our topics analysis, we discovered an increase use overtime of topics 6 & 7 (which focused on community and aspirations).
Given these types of trends our advice to teachers would be:
Our advice to DonorsChoose.org would be:
We would recommend in the future focusing efforts in the following areas:
Prior to starting our analysis, we interviewed Annie, a teacher who uses DonorsChoose.org to obtain materials for her classroom. Over time, she has noticed that the website has moved from a free-form system to one with more step-by-step examples and forms for teachers applying for funding. The site provides guidance on how to present your request so that the items needed and reason for the proposal are clear. In addition, the site used to suggest a donation of $5 but now leaves the donation field blank: this change has driven larger average donations.
Based on her experience, technology, math, and science get funded more often than books, speakers, and physical education supplies. This means that she needs to be more creative with history and social studies projects. Also, less expensive projects are more likely to be funded and get funded faster.
She now only posts when she is able to get a matching donation from a corporate sponsor. Most of her donations come from friends and family. She uses Facebook to drive awareness for her projects using the DonorsChoose application on Facebook. This application posts on each donor's wall as well as the teacher's wall. These posts allow friends of friends to learn about the project and has driven donations. In addition, she has received funding from people in Texas. The three Teach for America teachers that she had at her school last year have also had great success posting their projects on Facebook.
She has had a great deal of success funding her Model UN and has also received funding for laptops and prep books. She noted that her department needs technology so they have an initiative to make laptop labs via donations with a goal of creating 5 labs with 20 laptops each. Without DonorsChoose, there is no way that she would have been able to get laptops for her students.
Most first-time donors seem to know a teacher that has posted a project but many of them seem to come back to search for and fund interesting projects. Any donations over $100 get hand written notes from students. These thank you notes are well received and help to drive repeat donations.
library(RWekajars) library(RWeka) library(Snowball) library(tm) library(textir) # Open custom functions setwd("~/My Dropbox/DMF/R Code") source("syllable_count_funct.R") setwd("~/41201-Data_Mining/Final/Data") # Reading in text files essay_in <- file("essays_Processed_v2.txt", "r", raw=TRUE) #open file for reading essay_out <- file("essays_stemmed.txt", "w+", raw=TRUE) #open file for writing and truncate stats_out <- file("essay_stats.csv", "w+", raw=TRUE) #open file for writing and truncate contents shortstop <- scan("shortstoplist.txt", 'character', quiet=TRUE) # Initialize variables count = 0 essays_dat <- readLines(essay_in, n=1) # discard first row with headers writeLines(essays_dat, essay_out) # write first line for consistency close(essay_out) stats_out <- file("essay_stats.csv", "w+", raw=TRUE) #open file for writing and truncate contents t <- data.frame(sentence_count=sentences, word_count=words, syllable_count=syllables) write.csv(t, stats_out, row.names=FALSE, col.names=TRUE) # write first line for consistency close(stats_out) # Cycle until end of file is reached, processing n projects per pass while(length(essays_dat <- readLines(essay_in, n=1000))>0) { # read essays for 10,000 projects # Pre-processing statistics captured sentences <- SentenceCount(essays_dat) # Stores the number of sentences in the essay # Initial Processing essays <- Corpus(VectorSource(essays_dat)) # Put documents in a structure for processing essays <- tm_map(essays, stripWhitespace) # Remove extra spaces essays <- tm_map(essays, removeNumbers) # Remove all numbers essays <- tm_map(essays, tolower) # Set all documents ot lower case essays <- tm_map(essays, removePunctuation) # Remove all punctuation # Intermediate Statistics Captured dtm_t <- DocumentTermMatrix(essays) # constructs a Document - Term matrix words <- row_sums(dtm_t) # counts the total number of words used #counts the number of syllables in each word and multiples by usage of each word syllables <- (as.matrix(dtm_t) %*% SyllableCount(colnames(dtm_t))) # Final processing - removing stop words and stemming essays <- tm_map(essays, removeWords, words=shortstop) # Remove stopwords essays <- tm_map(essays, stemDocument) # Stem the documents essays <- tm_map(essays, stripWhitespace) # Remove extra spaces # Store stemmed essays in a new file # open file for writing, appending to the end of the file essay_out <- file("essays_stemmed.txt", "a", raw=TRUE) writeLines(as.character(essays), essay_out) close(essay_out) # Store statistics in a CSV for reading later #open file for writing and truncate contents stats_out <- file("essay_stats.csv", "a", raw=TRUE) #store results in a dataframe t <- data.frame(sentence_count=sentences, word_count=words, syllable_count=syllables) write.csv(t, stats_out, row.names=FALSE, col.names=FALSE) # rewrite the file close(stats_out) print((count <- count + 1)) } ## Note: gradelevel <- ((0.39 * words / sentences) + (11.8 * syllables / words) - 15.59) ### Post stemming processing # Read in Project IDs essays_IDs <- read.csv("essay_ids.csv", colClasses = c( "character" , "character" ) ) # Read in processed and stemmed documents #open file for writing and truncate contents essay_out <- file("essays_stemmed.txt", "r", raw=TRUE) dtm_out <- file("sparse_matrix.csv", "w", raw=TRUE) #open file for writing and truncate contents essay <- readLines(essay_out, n=1) # Discard header row #Read first record and initialize variables ### i = document index, j = term index, v = count in doc, terms = dictionary of terms essays_dat <- readLines(essay_out, n=1) # read an essay writeLines("i,j,v", dtm_out) close(dtm_out) t <- termFreq(Corpus(VectorSource(essays_dat))[[1]]) # calculate the term frequency i <- rep.int(1,length(t)) # set the document dimmension j <- c(1:length(t)) # set the document dimmension v <- as.vector(t) # set the frequency counts terms <- c(1:length(t)) # build a vector of term indicies names(terms) = names(t) # name the indicies doc_num = 2 # counter for the number of documents processed ### Build a document term matrix to perform TD-IDF while(length(essays_dat <- readLines(essay_out, n=1))>0) { # read essays for 10,000 projects at a time t <- termFreq(Corpus(VectorSource(essays_dat))[[1]]) # calculate the term frequencies i <- c(i, rep.int(doc_num,length(t))) # set the document equal to the current document number old <- t[names(t) %in% names(terms)] # identify the old terms j <- c(j, as.vector(terms[name=names(old)])) # use the dictionary value to set j new <- t[!(names(t) %in% names(terms))] # identify the new terms if(length(new)>0) { j <- c(j, (max(terms)+1):(max(terms)+length(new))) # store new term indices terms <- c(terms, (max(terms)+1):(max(terms)+length(new))) # create new term indices # store the new term names names(terms) <- c(names(terms[1:(length(terms)-length(new))]), names(new)) } v <- c(v, as.vector(old), as.vector(new)) # store the term frequency for the document if(!(length(new)+length(old)==length(t))) { print(doc_num)} if(!(length(i)==length(j))) { print(doc_num)} doc_num = doc_num + 1 # cycle to the next document number if((doc_num %% 1000) == 0) { dtm_out <- file("sparse_matrix.csv", "a", raw=TRUE) # open file for writing writeLines(paste(i,j,v, sep=","), dtm_out) close(dtm_out) write.csv(data.frame(term=names(terms), index=as.vector(terms)), "terms.csv") i <- j <- v <- NULL print(c(doc_num, "Done"), quote=FALSE) } } dtm_out <- file("sparse_matrix.csv", "a", raw=TRUE) #open file for writing and truncate contents writeLines(paste(i,j,v, sep=","), dtm_out) close(dtm_out) write.csv(data.frame(term=names(terms), index=as.vector(terms)), "terms.csv") ### Read files in to construct a document-term matrix, trimmed to the 1,000 most important terms # Read in the terms terms_df <- read.csv("terms.csv", header=TRUE, colClasses=c("double", "character", "double")) # Read in the doc-term matrix dtm_df <- read.csv("sparse_matrix.csv", header=TRUE, colClasses=c("double", "double", "double")) # Create the doc-term matrix dtm <- simple_triplet_matrix(dtm_df$i, dtm_df$j, dtm_df$v, dimnames=list(NULL, terms_df$term)) rm(dtm_df, terms_df) # Remove temp variables from memory tfm <-tfidf(dtm) # calculate the td-idf csums <- col_sums(tfm) # Calculates the column sums from TF-IDF keep <- (csums >= sort(csums, decreasing = TRUE)[1000]) # Keep the top 1000 terms fast processing # dimnames(dtm)[[2]][keep] # see terms being kept dtm <- dtm[,keep] # remove terms that have been set to zero rm(csums, keep, tfm) # Remove temp variables from memory gc() # Run garbage collection # save(dtm, file="dtm.dat") ### Run the topic extraction and save a topics data file dtm <- dtm[row_sums(dtm)>0,] # remove documents without any words keep <- sample(1:nrow(dtm), 3000) # Create a sample of 3000 documents for topic extraction dtms <- dtm[keep,] # Store these in a seperate variable top <- topics(dtms, 50) #calculate topics for the trimmed data set # summary(top,10) # display topics save(list("top"), file="topics.dat") # Save the topics variable ### Code to find the loads for each document - run in chunks to fit in memory load("topics.dat") # Load the topics variable from memory pred <- NULL # A variable to store the matrix of predictions batch <- 10000 for(n in 1:(1+ (nrow(dtm) %/% batch))) { # Iterate over each 1000 start <- (n-1)*batch+1 end <- min(n*batch, nrow(dtm)) dtms <- dtm[start:end, ] predt <- predict(top, dtms) pred <- rbind(pred, predt) } save(pred, file="predict.dat")
Topic % Usage: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 4.9 3.2 2.8 2.7 2.7 2.6 2.5 2.5 2.5 2.4 2.4 2.3 2.3 2.3 2.3 2.3 2.2 2.2 2.1 2.1 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 2.1 2.1 2.0 2.0 1.9 1.9 1.8 1.8 1.8 1.8 1.8 1.7 1.7 1.7 1.7 1.6 1.5 1.5 1.5 1.5 41 42 43 44 45 46 47 48 49 50 1.5 1.5 1.4 1.3 1.3 1.3 1.3 1.3 1.2 1.1
Top 15 phrases by topic-over-null log odds:
Tree for Predicting Project Funding Before Pruning:
Regression tree: tree(formula = funded ~ ., data = projects[train, c(8, 11:19, 22:26, 35:92)]) Variables actually used in tree construction: [1] "teacher_funded" "total_price_excluding_optional_support" [3] "total_price_including_optional_support" Number of terminal nodes: 6 Residual mean deviance: 0.0393 = 72.86 / 1854 Distribution of residuals: Min. 1st Qu. Median Mean 3rd Qu. Max. -0.9167 0.0000 0.0000 0.0000 0.0000 0.9801 node), split, n, deviance, yval * denotes terminal node 1) root 1860 273.400 0.8210 2) teacher_funded < 0.683333 418 86.390 0.2919 4) teacher_funded < 0.366667 201 3.920 0.0199 * 5) teacher_funded > 0.366667 217 53.830 0.5438 10) total_price_excluding_optional_support < 254.345 48 3.667 0.9167 * 11) total_price_excluding_optional_support > 254.345 169 41.600 0.4379 * 3) teacher_funded > 0.683333 1442 36.050 0.9743 6) teacher_funded < 0.933036 166 28.750 0.7771 12) total_price_including_optional_support < 672.85 127 13.980 0.8740 * 13) total_price_including_optional_support > 672.85 39 9.692 0.4615 * 7) teacher_funded > 0.933036 1276 0.000 1.0000 *
Tree for Predicting Project Funding After Pruning:
Regression tree: snip.tree(tree = tr, nodes = c(5, 3)) Variables actually used in tree construction: [1] "teacher_funded" Number of terminal nodes: 3 Residual mean deviance: 0.04915 = 2480 / 50470 Distribution of residuals: Min. 1st Qu. Median Mean 3rd Qu. Max. -0.96250 0.03752 0.03752 0.00000 0.03752 0.99980 node), split, n, deviance, yval * denotes terminal node 1) root 50470 7675.0000 0.8129000 2) teacher_funded < 0.585714 9849 1553.0000 0.1962000 4) teacher_funded < 0.225 5789 0.9998 0.0001727 * 5) teacher_funded > 0.225 4060 1013.0000 0.4756000 * 3) teacher_funded > 0.585714 40621 1467.0000 0.9625000 *
Tree for Predicting Project Focus Area Before Pruning:
Classification tree: tree(formula = focus ~ d, mincut = 1) Variables actually used in tree construction: [1] "d.book" "d.scienc" "d.math" "d.ball" "d.music" "d.art" [7] "d.magazin" "d.special" Number of terminal nodes: 12 Residual mean deviance: 1.9 = 520400 / 274000 Misclassification error rate: 0.2934 = 80386 / 273995 1) root 273995 851600 Literacy & Language ( 0.0840015 0.0267925 0.0557565 0.4438074 0.2394095 0.0914652 0.0587675 ) 2) t.book < 0.0316299 177635 607100 Math & Science ( 0.1114476 0.0402511 0.0595097 0.2576576 0.3312129 0.1293777 0.0705435 ) 4) t.scienc < 0.0538795 148229 525600 Literacy & Language ( 0.1275864 0.0465159 0.0679624 0.3039554 0.2223789 0.1510231 0.0805780 ) 8) t.math < 0.0521466 124446 439700 Literacy & Language ( 0.1447535 0.0543449 0.0794642 0.3490028 0.1070987 0.1771130 0.0882230 ) 16) t.ball < 0.0591003 113674 378200 Literacy & Language ( 0.1447561 0.0078294 0.0854285 0.3777381 0.1093038 0.1887679 0.0861763 ) 32) t.music < 0.0755316 103603 341000 Literacy & Language ( 0.1557580 0.0082334 0.0926228 0.4093993 0.1191182 0.1225351 0.0923332 ) 64) t.art < 0.060396 92189 290700 Literacy & Language ( 0.1667661 0.0089599 0.1002723 0.4447711 0.1311111 0.0508737 0.0972459 ) 128) t.magazin < 0.0488028 80659 242700 Literacy & Language ( 0.1793600 0.0098811 0.0380367 0.4709580 0.1422284 0.0533480 0.1061878 ) 256) t.special < 0.0469463 68022 193400 Literacy & Language ( 0.1911293 0.0098645 0.0411485 0.5055864 0.1529064 0.0594073 0.0399577 )* 257) t.special > 0.0469463 12637 34920 Special Needs ( 0.1160085 0.0099707 0.0212867 0.2845612 0.0847511 0.0207328 0.4626889 )* 129) t.magazin > 0.0488028 11530 29680 History & Civics ( 0.0786644 0.0025152 0.5356461 0.2615785 0.0533391 0.0335646 0.0346921 )* 65) t.art > 0.060396 11414 23960 Music & The Arts ( 0.0668477 0.0023655 0.0308393 0.1237077 0.0222534 0.7013317 0.0526546 ) * 33) t.music > 0.0755316 10071 11720 Music & The Arts ( 0.0315758 0.0036739 0.0114189 0.0520306 0.0083408 0.8701221 0.0228379 ) * 17) t.ball > 0.0591003 10772 30760 Health & Sports ( 0.1447271 0.5452098 0.0165243 0.0457668 0.0838284 0.0541218 0.1098218 ) * 9) t.math > 0.0521466 23783 34400 Math & Science ( 0.0377581 0.0055502 0.0077787 0.0682420 0.8255897 0.0145062 0.0405752 ) * 5) t.scienc > 0.0538795 29406 33850 Math & Science ( 0.0300959 0.0086717 0.0169013 0.0242808 0.8798204 0.0202680 0.0199619 ) * 3) t.book > 0.0316299 96360 164400 Literacy & Language ( 0.0334060 0.0019822 0.0488377 0.7869655 0.0701743 0.0215753 0.0370589 ) 6) t.scienc < 0.0541045 91154 139900 Literacy & Language ( 0.0344472 0.0020295 0.0498936 0.8187244 0.0344362 0.0224455 0.0380236 ) 12) t.magazin < 0.0556062 79116 103700 Literacy & Language ( 0.0367182 0.0020729 0.0106679 0.8512185 0.0352900 0.0232949 0.0407377 ) 24) t.math < 0.0489715 76092 88280 Literacy & Language ( 0.0368896 0.0021159 0.0108553 0.8723125 0.0135231 0.0238396 0.0404642 ) * 25) t.math > 0.0489715 3024 6153 Math & Science ( 0.0324074 0.0009921 0.0059524 0.3204365 0.5830026 0.0095899 0.0476190 ) * 13) t.magazin > 0.0556062 12038 24180 Literacy & Language ( 0.0195215 0.0017445 0.3076923 0.6051670 0.0288254 0.0168633 0.0201861 ) * 7) t.scienc > 0.0541045 5206 9152 Math & Science ( 0.0151748 0.0011525 0.0303496 0.2308874 0.6959278 0.0063388 0.0201690 ) *
Tree for Predicting Project Focus Area After Pruning:
Classification tree: snip.tree(tree = topic_tree, nodes = c(3, 64)) Variables actually used in tree construction: [1] "d.book" "d.scienc" "d.math" "d.ball" "d.music" "d.art" Number of terminal nodes: 7 Residual mean deviance: 2.153 = 589800 / 274000 Misclassification error rate: 0.3249 = 89012 / 273995 node), split, n, deviance, yval, (yprob) * denotes terminal node 1) root 273995 851600 Literacy & Language ( 0.084002 0.026792 0.055756 0.443807 0.239409 0.091465 0.058767 ) 2) d.book < 0.0316299 177635 607100 Math & Science ( 0.111448 0.040251 0.059510 0.257658 0.331213 0.129378 0.070544 ) 4) d.scienc < 0.0538795 148229 525600 Literacy & Language ( 0.127586 0.046516 0.067962 0.303955 0.222379 0.151023 0.080578 ) 8) d.math < 0.0521466 124446 439700 Literacy & Language ( 0.144754 0.054345 0.079464 0.349003 0.107099 0.177113 0.088223 ) 16) d.ball < 0.0591003 113674 378200 Literacy & Language ( 0.144756 0.007829 0.085429 0.377738 0.109304 0.188768 0.086176 ) 32) d.music < 0.0755316 103603 341000 Literacy & Language ( 0.155758 0.008233 0.092623 0.409399 0.119118 0.122535 0.092333 ) 64) d.art < 0.060396 92189 290700 Literacy & Language ( 0.166766 0.008960 0.100272 0.444771 0.131111 0.050874 0.097246 ) * 65) d.art > 0.060396 11414 23960 Music & The Arts ( 0.066848 0.002366 0.030839 0.123708 0.022253 0.701332 0.052655 ) * 33) d.music > 0.0755316 10071 11720 Music & The Arts ( 0.031576 0.003674 0.011419 0.052031 0.008341 0.870122 0.022838 ) * 17) d.ball > 0.0591003 10772 30760 Health & Sports ( 0.144727 0.545210 0.016524 0.045767 0.083828 0.054122 0.109822 ) * 9) d.math > 0.0521466 23783 34400 Math & Science ( 0.037758 0.005550 0.007779 0.068242 0.825590 0.014506 0.040575 ) * 5) d.scienc > 0.0538795 29406 33850 Math & Science ( 0.030096 0.008672 0.016901 0.024281 0.879820 0.020268 0.019962 ) * 3) d.book > 0.0316299 96360 164400 Literacy & Language ( 0.033406 0.001982 0.048838 0.786966 0.070174 0.021575 0.037059 ) *