DonorsChoose.org Data Mining

by Jennifer Berk, Rick Cavolo, and Sarah Young

I. Introduction

Raising over 26 million dollars last year and reaching thousands of students a week, DonorsChoose.org is a nonprofit crowd funding site for education. It was founded in 2000 and its mission is to engage "...the public in public schools by giving people a simple, accountable and personal way to address education inequity." DonorsChoose.org accomplishes their mission by providing a website and tools for crowd funding projects proposed by teachers and vetted by the website. This year, DonorChoose.org has released data from their first 10 years in a contest called "Hacking Education." DonorsChoose.org hopes to get interesting data analyses and useful web apps from this contest. This analysis was prepared in the Data Mining class taught by Professor Matt Taddy at the University of Chicago Booth School of Business, in order to contribute to the data analyses portion of this contest.

The process from idea to completion for a teacher using DonorsChoose.org follows the following flow. A teacher has a need, let's say 10 microscopes. They join DonorsChoose.org and submit an essay about their students (freshmen in biology), their project (science labs) and their need (10 microscopes). They submit the dollar amount they need to purchase the microscopes and the project expiration, and they spread the word about their project to friends and family. If their project becomes fully funded before the expiration, DonorsChoose.org will purchase the requested items (microscopes) and ship them to the school, however if the project doesn't get fully funded the money is redirected in one of three ways: the donor can choose a new project, the site can choose a new project for the donor, the teacher can choose a new project for the donor.

Our interest is in helping identify trends and perhaps predict fully funded projects. If we were able to give the teachers and DonorsChoose.org a better understanding of what projects get funded, what schools get funded, what areas of study get funded etc, they would be able to target their projects and essays to get more fully funded projects. Another interest we have is in using the words in the essay to predict the subject or items, and linking that with what we learn in the fundedness study to make suggestions to the teachers posting their projects.

II. The Fundedness Data

The data for this project was downloaded from the DonorsChoose Developer site (http://developer.donorschoose.org/the-data) on May 23, 2011.

The data we are using includes data regarding the projects, the donations and the individual teacher essays written and posted for each project. The data sets as posted are very clean, so making the projects and donations data ready for analysis involved (1) dropping projects with funding status "live", since we don't know whether they will be fully funded, and making a Boolean fundedness responses variable instead of one with several factor levels (2) setting variable classes to factor/logical/date as appropriate, reordering levels for poverty and grade level variables, and correcting zip codes to have leading zeroes and a mistyped state from "La" to "LA," and (3) building additional variables to use in visualizations and models.

Variables created about projects and donations:

To model whether a project would be fully funded, we used a subset of this data, not including any information about project donations since teachers and DonorsChoose don't have that information before a project is posted. We considered including geographic information (school state) but determined that our analyses didn't run in reasonable time if we included that variable with 50+ factor levels. The variables used were:

To get a good sense of this data, we decided to visualize some of it. Because of our interest in differentiating between projects that reached fully funded status and those that did not we first tried to plot our funded factor against a variety of variables. This first plot is of the projects requested per state vs the funded factor. While we hoped to see more of a trend, we think that fully funded projects probably differ more on a regional or district level.

Fig 1: Fully funded by State of Project Request (sorted by number of projects within state)

This second graph is the number of projects posted by the school vs the funded factor. We expected the downward trend that is present in this plot, as experience in this case is crucial, the first time a school posts a DonorsChoose project is different from the tenth of fiftieth project posting.

Fig. 2: Fully Funded by School's Number of DonorsChoose Projects

Another interesting variable to explore is the total amount donated per project, so we graphed a histogram of it. This gives us an idea of both the average and dispersion of the amount donated per project by both fully funded and not. Overall the average (median) project raised about $350, but this plot allows the distinct differences between fully fundedness.

Fig 3: Histogram of log(Total Donations per Project)

Using this total donated variable we decided to see if there was variation across states. Using the same sort (the rank of number of schools participating from most to least, left to right), we can see there is not too much variation between state on the average amount, but the states with more schools utilizing DonorsChoose do have longer tails.

Fig 4: Boxplot of log(Total Donations) by Number of Projects per State

Another way to understand the state-by-state donations is to plot them on a map. The following is of donations to schools with each point representing a location of the schools.

Fig 5: Map of United States by Donation to School

Next we wanted to understand the average dollar amount requested by teachers and its dispersion. We had two variables for the amount requested, one that included the dollar amount with "optional support" to DonorsChoose and one without it. This optional support is a suggested amount that teachers can add onto their requests that will go to support the DonorsChoose organization and all of its efforts. After looking at some of the plots of both with and without optional support, we decided there was very little difference, so for presentation purposes we will only show excluding optional support. The overall average (median) amount requested about $380. This is quite similar to the average amount raised. We again plotted both fully funded and not fully funded, and this shows that on average the higher amounts requested were the ones not funded.

Fig 6: Histogram of log(Total Amount Requested)

Knowing the dispersion of both the total amount requested and the total amount donated per project we tried to visualize this data by fully funded projects. These two plots show that funded projects on average are projects that had a smaller amount requested (target amount) and a larger amount of total donations.

Fig 7: log($ Requested) by Fully Funded; and log(Total Donations) by Fully Funded

To get an idea of how funding has changed over time the plotted a time trend of the fully funded variable.

Fig. 8: % Projects Funded over Time

Finally, we wanted to visualize some of our factor variables against some of these main variables. The first two graphs have charter school by fully funded and special teacher by fully funded. Both of these factors, when true, see increases in fully funded projects. The next two graphs are of fully funded projects by school metro and fully funded by school poverty level. With the metro area of the school, more urban schools and less rural schools have fully funded projects. With the poverty level, the schools that are high poverty level schools firstly are the biggest proportion of schools, and secondly have more fully funded projects than not. Interestingly, in seeing these two charts side-by-side you can see that more than just the urban schools are probably high poverty rate schools.

Fig 9: Fully Funded by Charter School; and Fully Funded by Special Teacher

Fig 10: School Metro by Fully Funded; and School Poverty Level by Fully Funded

III. The Essay Data

The essay data was processed to extract summary level data, extract topics by fitting a latent topic model, and determine document-topic weights for these extracted topics. While most of the processing was completed in R, a pre-processing step using a text editor removed the project and teacher IDs and replaced various punctuation marks with spaces. This pre-processing had the effect of converting a CSV file into a text file with one line per project that contained all of the essays concatenated together. We later discovered that non-ASCII characters such as curly quotes were not removed from the file and remained in the data. Terms including these characters were ignored during a later step since the error was discovered after 48 hours of continuous processing and time was not available to redo the analysis. If we were to rerun the data cleaning again, we would remove/replace these characters and keep the terms in our analysis.

Following the pre-processing, the data were processed in R (the code used for this processing is documented in Appendix B). Many of the steps were repeated multiple times on subsets of the data to allow execution to complete on a machine with only 4 GB of memory. We calculated statistics such as word count, syllable count, and sentence count in order to calculate a Flesch-Kincaid grade level for each essay. After that, punctuation and numbers were removed and all words were set to lower case. The results were stemmed using the Snowball stemmer, based on the Porter stemmer. The entries to construct a simple triplet matrix that holds a document-term matrix were stored in a file before processing the next entry.

After processing was completed, the top 1,000 terms were determined by a ranking of average TF_IDF values for the terms. We selected a sample of 3,000 essays at random (1% of all essays) to estimate 50 latent topics. These latent topics were then used to calculate the document-topic weights for all 300,000 essays by treating 10,000 essays at a time to decrease execution time. The document-topic weights, word counts, and Flesch-Kincaid grade level for each project's essays were stored for use in logistic regression and topic prediction.

IV. Fully Funded Analysis

We built a variety of models to understand the variables that might affect a project being fully funded.

First was a full logistic regression model with no interactions, plus a model using false discovery rate control (FDR) to cut down overfitting and hopefully improve predictions.

The full model regression output (only variables significant at 5% level):

                                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)                             7.280e+05  3.939e+05   1.848 0.064564 .  
primary_focus_areaHistory & Civics      3.239e-01  1.543e-01   2.100 0.035753 *  
primary_focus_areaMath & Science        2.519e-01  1.166e-01   2.160 0.030802 *  
resource_usageenrichment               -3.051e+00  9.841e-01  -3.100 0.001937 ** 
resource_usageessential                -3.036e+00  9.844e-01  -3.084 0.002043 ** 
resource_typeOther                     -2.573e-01  1.166e-01  -2.206 0.027396 *  
resource_typeSupplies                  -2.613e-01  9.560e-02  -2.733 0.006274 ** 
resource_typeTechnology                -3.711e-01  1.005e-01  -3.692 0.000222 ***
resource_typeVisitors                  -2.191e+00  6.756e-01  -3.242 0.001185 ** 
grade_levelGrades 9-12                  2.336e-01  9.146e-02   2.554 0.010653 *  
total_price_excluding_optional_support -6.131e-03  2.005e-03  -3.059 0.002222 ** 
total_price_including_optional_support  4.518e-03  1.661e-03   2.719 0.006538 ** 
eligible_double_your_impact_matchTRUE   2.140e-01  5.981e-02   3.578 0.000347 ***
eligible_almost_home_matchTRUE          3.268e-01  8.787e-02   3.719 0.000200 ***
date_posted                             2.313e-04  1.074e-04   2.152 0.031368 *  
essaymethodnewTRUE                     -2.785e-01  9.024e-02  -3.087 0.002024 ** 
school_funded                           1.634e+00  1.554e-01  10.521  < 2e-16 ***
teacher_funded                          9.379e+00  1.491e-01  62.912  < 2e-16 ***
teacher_projectfrequency               -8.182e-02  6.112e-03 -13.386  < 2e-16 ***
word_count                             -4.769e-04  1.680e-04  -2.838 0.004536 **

The largest coefficient is for teacher_funded, while the negative coefficients for resource_usage and resource_type are because a small number of "unknown" resource_usage projects have high fundedness and because the intercept of resource_type (Books) has the highest fundedness.

Fig 11: The FDR model

                                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)                            -2.159e+00  6.094e-02 -35.435  < 2e-16 ***
resource_usageenrichment               -1.025e+00  5.237e-02 -19.570  < 2e-16 ***
resource_usageessential                -1.078e+00  5.243e-02 -20.564  < 2e-16 ***
resource_typeOther                     -4.780e-01  2.362e-02 -20.233  < 2e-16 ***
resource_typeSupplies                  -2.347e-01  1.689e-02 -13.901  < 2e-16 ***
resource_typeTechnology                -5.728e-01  1.811e-02 -31.627  < 2e-16 ***
resource_typeTrips                      4.282e-02  6.362e-02   0.673    0.501    
resource_typeVisitors                  -5.430e-01  1.244e-01  -4.365 1.27e-05 ***
total_price_excluding_optional_support -8.807e-04  1.470e-05 -59.910  < 2e-16 ***
eligible_double_your_impact_matchTRUE   1.644e-01  1.504e-02  10.931  < 2e-16 ***
eligible_almost_home_matchTRUE          2.739e-01  2.642e-02  10.369  < 2e-16 ***
essaymethodnewTRUE                     -8.179e-02  1.550e-02  -5.275 1.32e-07 ***
school_funded                           5.800e-01  4.267e-02  13.592  < 2e-16 ***
teacher_funded                          7.109e+00  3.580e-02 198.542  < 2e-16 ***
teacher_projectfrequency               -1.154e-02  3.235e-04 -35.681  < 2e-16 ***
word_count                             -1.989e-04  4.206e-05  -4.730 2.25e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

    Null deviance: 335280  on 273994  degrees of freedom
Residual deviance: 171633  on 273979  degrees of freedom

R2 = 1 - (171633/335280) = 0.4881

Fig 12: Comparison of fits using full and FDR models

Fig 13: ROC curve for FDR model (high sensitivity and specificity)

Then we used LASSO so we could include interactions with the special-teacher summary variable (the data set is too big to include all interactions given the computing power and memory we had available). The first variables to become significant at very high penalty values were resource type levels Books and Technology, the proportion of that school's and that teacher's projects to be fully funded, and three of the essay topics. Topic 9 is a technology topic (computer, projector, laptop), Topic 32 seems to be furniture-related (rug, carpet, chair), and Topic 34 is about the DonorsChoose website and logistics (wwwdonorschooseorg, html, href, onclick). Most interactions were not very significant.

Fig 14: Comparison of most significant loading magnitudes by LASSO penalty level

Fig 15: ROC curves for s=0.0005 (left) and s=0.01 (right)
These are surprisingly similar, and the one with more variables included, s=0.01, is actually worse.

We attempted to find a lower-dimensional model using principal component analysis (PCA), and found that the first four components were a fairly good basis, using both visual determination of where the scree plot leveled off and the heuristic of taking the components with variances above 1.

Fig 16: Principal components

The first principal component correlates very well with number of donors - the PC1/2 graph on the left is colored with red dots for projects with more than 3 donors (mean is 3.48). The second and third principal components both correlate with whether the project is fully funded or not - the PC1/2 graph on the right and the PC3/4 one below are both colored with red dots for fully funded projects. It's not immediately clear what PC4 is measuring.

Fig 17: First and second principal components

FIg 18: Third and fourth principal components

Running a regression of fundedness on the first four principal components:

              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.273329   0.006011  211.83   <2e-16 ***
pc[, 1:4]PC1  0.762647   0.004548  167.67   <2e-16 ***
pc[, 1:4]PC2 -0.854576   0.004754 -179.75   <2e-16 ***
pc[, 1:4]PC3 -0.744529   0.005432 -137.06   <2e-16 ***
pc[, 1:4]PC4 -0.373431   0.005358  -69.69   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Null deviance: 335280  on 273994  degrees of freedom
Residual deviance: 218218  on 273990  degrees of freedom

R2 = 1 - (218218/335280) = 0.3491

Fig 19: Fit using regression on four principal components

Fig 20: ROC curve

Last we built a tree model and trimmed it. Here the most important variable was the average fundedness of a teacher's projects. A teacher who has had projects funded before may have built up a roster of regular donors, in addition to seeing what project characteristics have led to funded projects in the past.

Fig 21: Untrimmed tree

Fig 22: Trimmed tree

Fig 23: ROC curve
Considerably worse than the models above, and not smooth because it's a tree model

We tested the predictive power of these models by running an out of sample prediction. For this prediction problem, FDR and LASSO (even lasso.small, which uses a very high penalty) are have a lower mean squared error and are better than PCA (with four principal components) and trees.

Fig 24: Out of sample prediction

Our biggest finding is that teachers' experience with getting projects funded is hugely important in whether their next project will get funded. This is true across all the models we looked at. Other factors such as type of resource requested and specific essay topics were significant but less important, and these additional explanatory variables were somewhat different across our different models.

Our out of sample prediction showed similar predictive power for FDR and varying-penalty LASSO models. Because we had a large number of variables and observations and limited computing power, we weren't able to use the full power of LASSO by including all interactions. For future predictions, we expect the best model to use would be a LASSO model with more interactions included (but still using very few variables).

V. Essay Subject Analysis

Besides using an analysis of the DonorsChoose essays to predict which projects get funded, we also wanted to explore how DonorsChoose could use the essays to automatically identify the primary focus area of the project or the primary subject. In addition, we looked for trends in the language used in essays to explore how submissions have evolved over time.

We first explored and visualized the identified latent topics using graphical techniques. After that, we analyzed the essay data using multinomial logistic modeling and decision trees to develop a prediction model for primary focus area (as identified by the teacher) based on the latent topic scores, word count, and reading grade level of the essay. In addition, we explored the results of these prediction methods to better understand the relationship between the latent topics and primary focus areas. Finally, we explored macro trends in latent topics by utilizing partial least squares.

We developed 50 latent topics from the essays. Given more time and computing power, we would have run the analysis multiple times and varied the number of latent topics to determine which number made the most sense. Below is a plot that shows the document-topic weights averaged across documents, which shows the relative usage of the topics in the dataset. The grand average weight for the 50 topics is 0.02.

Fig 25: Plot of Average Document-Topic Weight by Topic Number

The following word maps visualize the importance of various words in a few select topics. The size of the word is relative to the importance of the term. We used www.wordle.net to create these maps.

Fig 26: Topic 1: Reading is Fun

Fig 27: Topic 2: Yes We Can

Fig 28: Topic 4: Time Matters

Fig 29: Topic 6: Global Community

Fig 30: Topic 18: Science is Cool

Fig 31: Topic 19: Math Solves Problems

Using these topics, we originally attempted to predict the primary subject area of the project, as seen in our word map titles. After running the initial multinomial logistic model, which regressed topics onto the project subject, we decided that, given the number of similar categories and poor results of this analysis, we should attempt to predict the primary focus area instead. The plot below shows the range of probability estimates for predicting the primary subject area.

Fig 32: Predicting primary subject area

In contrast, regressing the topic areas onto the focus area of the project produced the following fitted probabilities.

Fig 33: Predicting focus area

While this plot shows that a few of the focus areas are difficult to predict accurately, others such as Math & Science and Literacy & Language projects are relatively easy to predict using the identified latent topics. The following table reports the loadings of a subset of the topics. The most important word in the topic is used to represent the topic.

FactorApplied LearningHealth & SportsHistory & CivicsLiteracy & LanguageMath & ScienceMusic & The ArtsSpecial Needs
book-0.59-1.5-0.370.84-0.66-0.690.39
can-0.040.04-0.03-0.050.02-0.01-0.03
year0.130.18-0.03-0.02-0.050.06-0.12
scienc-0.33-0.2-0.79-0.971.36-0.70
math-0.31-0.62-0.76-0.711.17-0.840.08
magazin-0.27-0.350.80.03-0.31-0.22-0.16
special-0.190.02-0.29-0.2-0.27-0.310.87
music-0.070.14-0.23-0.29-0.391.15-0.05
ball0.050.84-0.46-0.45-0.02-0.310.15
art0.03-0.11-0.2-0.16-0.440.960.06

The following word maps visualize the loadings of each of these topics utilizing the most important word in each topic. The size of the word indicates the magnitude of the loading for that topic and focus area. Green words are positive loadings and red words are negative loadings. Also included in these loadings are the effect of word count and the Flesch-Kincaid grade level for the essay.

Fig 34: Word plot for Applied Learning

Fig 35: Word plot for Health & Sports

Fig 36: Word plot for History & Civics

Fig 37: Word plot for Literacy and Language

Fig 38: Word plot for Math & Science

Fig 39: Word plot for Music & Arts

Fig 40: Word plot for Special Needs

The multinomial logistic regression was able to correctly classify 64.43% of primary focus areas correctly with classification being determined by the area with highest calculated probability.

For comparison purposes, a decision tree for classification was also developed on the same data.

Fig 41: Classification tree

This tree correctly classifies 70.66% of projects based on essays. The output for this tree is available in Appendix D.

Fig 42: Deviance vs nodes, based on cross-validation with 10 folds

Based on this graph, the tree was pruned to have 10 leaves instead of 12. The resulting tree, which is plotted below, correctly classified 67.51% of projects.

Fig 43: Classification tree

Based on the results of these tests, it appears that decision trees are better able to classify results and provide some intuition as to how the classification is completed. One drawback of using decision trees is that, no matter the tree size, projects will not be classified as an "Applied Learning" project. This may indicate that applied learning is used for projects that cannot be classified as one of the other topics. If more time was available, these methods would be measured using out of sample prediction to select the appropriate classification method out of: decision trees, dyna-trees, multinomial logistic regression, and K-nearest neighbors.

We also used the topic data to attempt to identify trends in essay writing over time. This analysis was completed by using partial least squares with time regressed onto each of the topic weights. The following is a plot of the first two partial-least squares components and the resulting correlation between the fitted values and the date of submission.

Fig 44: Partial least squares components

From this plot, we see that there is little additional information available in the residuals. In addition, we see a dramatic shift in topics in 2007. The topics that exhibited the largest changes over time are displayed in the following word plot. Green topics have increased in importance over time while red topics have been used less over time.

Fig 45: Topic frequency changes

Based on this analysis, we plotted latent topics over time to look for changes in the usage of certain words by teachers applying for funding through DonorsChoose. The average weight of a topic in a document is 0.02 since there are 50 topics.

Fig 46: Document-topic weights, Topic 34

This plot shows that language related to the DonorsChoose website dropped abruptly in the second quarter of 2007. During that time, the website was being modified in preparation for its national launch and we hypothesize that changes to the website removed the need to discuss the logistical details that may have been a part of successful essays in the past.

Fig 47: Document-topic weights, Topic 6

In contrast to the changes in the DonorsChoose topic, use of Topic 6, which we've called the 'Global Community' has steadily increased over the eight years of essays submitted to DonorsChoose. Topic 7, which appears to describe an eagerness to learn, has also increased over time in a similar fashion.

VI. Conclusion

With all of this information from our analyses, we have been able to see trends that cross both essay and donation data. We believe that this data will be helpful to both the teachers using DonorsChoose.org and the website itself.

The biggest factor in our donation prediction was the teacher having a high proportion of successfully funded projects. Similarly, the school having a high proportion of fully funded project also had predictive power. However, more projects per teacher is associated with lower likely of successfully funded projects, perhaps because the teacher's network of potential donors is being spread too thin. These facts imply there are variables we currently do not have access to that might help predict a teacher's success. Independent from the school/ teacher funded factors, on average projects were more likely to be funded if they were requesting books or supplies. Projects requesting technology were the least likely to be fully funded, even accounting for the amount requested.

Another finding was that concise essays (lower word count) were more likely to be funded. Given that books are the most funded resource, it was not surprising that Reading was the most important topic (Topic 1). Based on our topics analysis, we discovered an increase use overtime of topics 6 & 7 (which focused on community and aspirations).

Given these types of trends our advice to teachers would be:

Our advice to DonorsChoose.org would be:

We would recommend in the future focusing efforts in the following areas:

A. Notes From an Interview With a Teacher Who Uses DonorsChoose

Prior to starting our analysis, we interviewed Annie, a teacher who uses DonorsChoose.org to obtain materials for her classroom. Over time, she has noticed that the website has moved from a free-form system to one with more step-by-step examples and forms for teachers applying for funding. The site provides guidance on how to present your request so that the items needed and reason for the proposal are clear. In addition, the site used to suggest a donation of $5 but now leaves the donation field blank: this change has driven larger average donations.

Based on her experience, technology, math, and science get funded more often than books, speakers, and physical education supplies. This means that she needs to be more creative with history and social studies projects. Also, less expensive projects are more likely to be funded and get funded faster.

She now only posts when she is able to get a matching donation from a corporate sponsor. Most of her donations come from friends and family. She uses Facebook to drive awareness for her projects using the DonorsChoose application on Facebook. This application posts on each donor's wall as well as the teacher's wall. These posts allow friends of friends to learn about the project and has driven donations. In addition, she has received funding from people in Texas. The three Teach for America teachers that she had at her school last year have also had great success posting their projects on Facebook.

She has had a great deal of success funding her Model UN and has also received funding for laptops and prep books. She noted that her department needs technology so they have an initiative to make laptop labs via donations with a goal of creating 5 labs with 20 laptops each. Without DonorsChoose, there is no way that she would have been able to get laptops for her students.

Most first-time donors seem to know a teacher that has posted a project but many of them seem to come back to search for and fund interesting projects. Any donations over $100 get hand written notes from students. These thank you notes are well received and help to drive repeat donations.

B. Code Utilized to Process Essays in R

library(RWekajars)
library(RWeka)
library(Snowball)
library(tm)
library(textir)

# Open custom functions
setwd("~/My Dropbox/DMF/R Code")
source("syllable_count_funct.R")
setwd("~/41201-Data_Mining/Final/Data")

# Reading in text files
essay_in <- file("essays_Processed_v2.txt", "r", raw=TRUE) #open file for reading
essay_out <- file("essays_stemmed.txt", "w+", raw=TRUE) #open file for writing and truncate
stats_out <- file("essay_stats.csv", "w+", raw=TRUE) #open file for writing and truncate contents
shortstop <- scan("shortstoplist.txt", 'character', quiet=TRUE)

# Initialize variables
count = 0
essays_dat <- readLines(essay_in, n=1) # discard first row with headers
writeLines(essays_dat, essay_out) # write first line for consistency
close(essay_out)
stats_out <- file("essay_stats.csv", "w+", raw=TRUE) #open file for writing and truncate contents
t <- data.frame(sentence_count=sentences, word_count=words, syllable_count=syllables)
write.csv(t, stats_out, row.names=FALSE, col.names=TRUE) # write first line for consistency
close(stats_out)

# Cycle until end of file is reached, processing n projects per pass
while(length(essays_dat <- readLines(essay_in, n=1000))>0) {   # read essays for 10,000 projects

  # Pre-processing statistics captured
  sentences <- SentenceCount(essays_dat) # Stores the number of sentences in the essay

  # Initial Processing
  essays <- Corpus(VectorSource(essays_dat)) # Put documents in a structure for processing 
  essays <- tm_map(essays, stripWhitespace) # Remove extra spaces
  essays <- tm_map(essays, removeNumbers) # Remove all numbers
  essays <- tm_map(essays, tolower) # Set all documents ot lower case
  essays <- tm_map(essays, removePunctuation) # Remove all punctuation
  
  # Intermediate Statistics Captured
  dtm_t <- DocumentTermMatrix(essays) # constructs a Document - Term matrix
  words <- row_sums(dtm_t) # counts the total number of words used
    #counts the number of syllables in each word and multiples by usage of each word
  syllables <- (as.matrix(dtm_t) %*% SyllableCount(colnames(dtm_t))) 
  
  # Final processing - removing stop words and stemming
  essays <- tm_map(essays, removeWords, words=shortstop)   # Remove stopwords
  essays <- tm_map(essays, stemDocument)   # Stem the documents
  essays <- tm_map(essays, stripWhitespace) # Remove extra spaces

  # Store stemmed essays in a new file
    # open file for writing, appending to the end of the file
  essay_out <- file("essays_stemmed.txt", "a", raw=TRUE) 
  writeLines(as.character(essays), essay_out)
  close(essay_out)
  
  # Store statistics in a CSV for reading later
    #open file for writing and truncate contents
  stats_out <- file("essay_stats.csv", "a", raw=TRUE) 
    #store results in a dataframe
  t <- data.frame(sentence_count=sentences, word_count=words, syllable_count=syllables) 
  write.csv(t, stats_out, row.names=FALSE, col.names=FALSE) # rewrite the file
  close(stats_out)
  
  print((count <- count + 1))
}

## Note: gradelevel <- ((0.39 * words / sentences) + (11.8 * syllables / words) - 15.59)


### Post stemming processing

# Read in Project IDs
essays_IDs <- read.csv("essay_ids.csv", colClasses = c( "character" , "character" ) )

# Read in processed and stemmed documents
  #open file for writing and truncate contents
essay_out <- file("essays_stemmed.txt", "r", raw=TRUE) 
dtm_out <- file("sparse_matrix.csv", "w", raw=TRUE) #open file for writing and truncate contents
essay <- readLines(essay_out, n=1) # Discard header row

#Read first record and initialize variables
### i = document index, j = term index, v = count in doc, terms = dictionary of terms
essays_dat <- readLines(essay_out, n=1) # read an essay
writeLines("i,j,v", dtm_out)
close(dtm_out)
t <- termFreq(Corpus(VectorSource(essays_dat))[[1]]) # calculate the term frequency
i <- rep.int(1,length(t)) # set the document dimmension
j <- c(1:length(t)) # set the document dimmension
v <- as.vector(t) # set the frequency counts
terms <- c(1:length(t)) # build a vector of term indicies
names(terms) = names(t) # name the indicies
doc_num = 2 # counter for the number of documents processed

### Build a document term matrix to perform TD-IDF
while(length(essays_dat <- readLines(essay_out, n=1))>0) {   # read essays for 10,000 projects at a time
  t <- termFreq(Corpus(VectorSource(essays_dat))[[1]]) # calculate the term frequencies
  i <- c(i, rep.int(doc_num,length(t))) # set the document equal to the current document number

  old <- t[names(t) %in% names(terms)] # identify the old terms
  j <- c(j, as.vector(terms[name=names(old)])) # use the dictionary value to set j
  
  new <- t[!(names(t) %in% names(terms))] # identify the new terms
  if(length(new)>0) {
    j <- c(j, (max(terms)+1):(max(terms)+length(new))) # store new term indices
    terms <- c(terms, (max(terms)+1):(max(terms)+length(new))) # create new term indices
      # store the new term names
    names(terms) <- c(names(terms[1:(length(terms)-length(new))]), names(new)) 
  }

  v <- c(v, as.vector(old), as.vector(new)) # store the term frequency for the document
  
  if(!(length(new)+length(old)==length(t))) { print(doc_num)}
  if(!(length(i)==length(j))) { print(doc_num)}
  
  doc_num = doc_num + 1 # cycle to the next document number
  if((doc_num %% 1000) == 0) {
    dtm_out <- file("sparse_matrix.csv", "a", raw=TRUE) # open file for writing
    writeLines(paste(i,j,v, sep=","), dtm_out)
    close(dtm_out)
    write.csv(data.frame(term=names(terms), index=as.vector(terms)), "terms.csv")
    i <- j <- v <- NULL
    print(c(doc_num, "Done"), quote=FALSE)
  }
}

dtm_out <- file("sparse_matrix.csv", "a", raw=TRUE) #open file for writing and truncate contents
writeLines(paste(i,j,v, sep=","), dtm_out)
close(dtm_out)
write.csv(data.frame(term=names(terms), index=as.vector(terms)), "terms.csv")


### Read files in to construct a document-term matrix, trimmed to the 1,000 most important terms

  # Read in the terms
terms_df <- read.csv("terms.csv", header=TRUE, colClasses=c("double", "character", "double")) 
  # Read in the doc-term matrix
dtm_df <- read.csv("sparse_matrix.csv", header=TRUE, colClasses=c("double", "double", "double")) 
  # Create the doc-term matrix
dtm <- simple_triplet_matrix(dtm_df$i, dtm_df$j, dtm_df$v, dimnames=list(NULL, terms_df$term)) rm(dtm_df, terms_df) # Remove temp variables from memory
tfm <-tfidf(dtm) # calculate the td-idf
csums <- col_sums(tfm) # Calculates the column sums from TF-IDF
keep <- (csums >= sort(csums, decreasing = TRUE)[1000]) # Keep the top 1000 terms fast processing
# dimnames(dtm)[[2]][keep] # see terms being kept
dtm <- dtm[,keep] # remove terms that have been set to zero
rm(csums, keep, tfm) # Remove temp variables from memory
gc() # Run garbage collection
# save(dtm, file="dtm.dat")


### Run the topic extraction and save a topics data file

dtm <- dtm[row_sums(dtm)>0,] # remove documents without any words
keep <- sample(1:nrow(dtm), 3000) # Create a sample of 3000 documents for topic extraction
dtms <- dtm[keep,] # Store these in a seperate variable
top <- topics(dtms, 50) #calculate topics for the trimmed data set
# summary(top,10) # display topics
save(list("top"), file="topics.dat") # Save the topics variable


### Code to find the loads for each document - run in chunks to fit in memory

load("topics.dat") # Load the topics variable from memory

pred <- NULL # A variable to store the matrix of predictions
batch <- 10000
for(n in 1:(1+ (nrow(dtm) %/% batch))) { # Iterate over each 1000
  start <- (n-1)*batch+1
  end <- min(n*batch, nrow(dtm))
  dtms <- dtm[start:end, ]
  predt <- predict(top, dtms)
  pred <- rbind(pred, predt)
}

save(pred, file="predict.dat")

C. Latent Topic Output

Topic % Usage: 

  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
4.9 3.2 2.8 2.7 2.7 2.6 2.5 2.5 2.5 2.4 2.4 2.3 2.3 2.3 2.3 2.3 2.2 2.2 2.1 2.1 
 21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
2.1 2.1 2.0 2.0 1.9 1.9 1.8 1.8 1.8 1.8 1.8 1.7 1.7 1.7 1.7 1.6 1.5 1.5 1.5 1.5 
 41  42  43  44  45  46  47  48  49  50 
1.5 1.5 1.4 1.3 1.3 1.3 1.3 1.3 1.2 1.1 

Top 15 phrases by topic-over-null log odds:

  1. 'book', 'chapter', 'librari', 'read', 'reader', 'nonfict', 'seri', 'select', 'aloud', 'lifelong', 'genr', 'hook', 'choos', 'interest', 'fiction'
  2. 'can', 'want', 'huge', 'know', 'let', 'sure', 'actual', 'see', 'more', 'best', 'make', 'noth', 'possibl', 'just', 'alreadi'
  3. 'get', 'dont', 'out', 'kid', 'could', 'seen', 'someon', 'walk', 'happen', 'wait', 'cant', 'instead', 'pick', 'run', 'everyon'
  4. 'year', 'last', 'two', 'month', 'three', 'past', 'four', 'old', 'five', 'start', 'alreadi', 'end', 'next', 'left', 'longer'
  5. 'incom', 'low', 'veri', 'much', 'realli', 'thank', 'consider', 'quit', 'generos', 'donor', 'generous', 'consid', 'appreci', 'few', 'lot'
  6. 'live', 'chang', 'care', 'dream', 'peopl', 'life', 'realiti', 'societi', 'real', 'citizen', 'passion', 'inspir', 'realiz', 'generat', 'role'
  7. 'learn', 'tag', 'touch', 'enthusiast', 'style', 'eager', 'best', 'sens', 'environ', 'basi', 'through', 'way', 'daili', 'enhanc', 'readi'
  8. 'materi', 'budget', 'item', 'cut', 'suppli', 'district', 'purchas', 'fund', 'supplement', 'due', 'necessari', 'extra', 'bright', 'list', 'pack'
  9. 'comput', 'projector', 'laptop', 'lcd', 'technolog', 'netbook', 'internet', 'powerpoint', 'onlin', 'websit', 'softwar', 'centuri', 'web', 'screen', 'smart'
  10. 'abl', 'classroom', 'should', 'have', 'energet', 'everyday', 'type', 'add', 'due', 'differ', 'imagin', 'various', 'outsid', 'without', 'donat'
  11. 'success', 'achiev', 'goal', 'increas', 'academ', 'succeed', 'motiv', 'potenti', 'reach', 'confid', 'improv', 'accomplish', 'strong', 'foundat', 'greater'
  12. 'level', 'grade', 'third', 'fourth', 'fifth', 'below', 'sixth', 'higher', 'second', 'abov', 'grader', 'lower', 'elementari', 'behind', 'progress'
  13. 'scienc', 'plant', 'microscop', 'garden', 'scientist', 'butterfli', 'cycl', 'anim', 'dissect', 'observ', 'biolog', 'investig', 'water', 'environment', 'earth'
  14. 'love', 'wonder', 'amaz', 'again', 'rememb', 'everyth', 'magic', 'favorit', 'put', 'grader', 'smile', 'break', 'truli', 'curious', 'whi'
  15. 'time', 'day', 'dure', 'minut', 'big', 'period', 'spend', 'hour', 'everi', 'lost', 'daili', 'each', 'amount', 'week', 'quick'
  16. 'children', 'parent', 'child', 'home', 'famili', 'preschool', 'night', 'send', 'kindergarten', 'pre', 'hous', 'friend', 'earli', 'singl', 'age'
  17. 'find', 'face', 'right', 'difficult', 'extrem', 'too', 'obstacl', 'feel', 'ive', 'poor', 'found', 'challeng', 'frustrat', 'seem', 'although'
  18. 'game', 'skill', 'puzzl', 'practic', 'reinforc', 'card', 'fact', 'build', 'play', 'fun', 'master', 'comprehens', 'taught', 'strategi', 'multipl'
  19. 'math', 'calcul', 'algebra', 'mathemat', 'fraction', 'manipul', 'solv', 'geometri', 'graph', 'count', 'concept', 'measur', 'pattern', 'number', 'shape'
  20. 'colleg', 'high', 'graduat', 'cours', 'scholar', 'prepar', 'chemistri', 'rate', 'career', 'obtain', 'compet', 'tradit', 'attend', 'poverti', 'expect'
  21. 'listen', 'center', 'headphon', 'cassett', 'phonic', 'fluent', 'player', 'tape', 'station', 'letter', 'rhyme', 'alphabet', 'audio', 'fluenci', 'along'
  22. 'group', 'small', 'set', 'individu', 'guid', 'town', 'block', 'whole', 'six', 'class', 'meet', 'busi', 'manag', 'effect', 'each'
  23. 'opportun', 'give', 'chanc', 'door', 'explor', 'open', 'experi', 'experienc', 'otherwis', 'familiar', 'joy', 'never', 'given', 'gift', 'eye'
  24. 'magazin', 'histori', 'subscript', 'map', 'geographi', 'global', 'histor', 'news', 'cultur', 'event', 'studi', 'sourc', 'social', 'countri', 'non'
  25. 'provid', 'resourc', 'request', 'limit', 'avail', 'check', 'lack', 'unabl', 'poverti', 'addit', 'near', 'number', 'strong', 'area', 'due'
  26. 'special', 'disabl', 'sensori', 'autism', 'instruct', 'inclus', 'educ', 'assist', 'general', 'differenti', 'regular', 'vari', 'peer', 'emot', 'function'
  27. 'divers', 'background', 'wide', 'socio', 'varieti', 'multi', 'serv', 'econom', 'expand', 'socioeconom', 'rang', 'disadvantag', 'popul', 'knowledg', 'collect'
  28. 'novel', 'citi', 'inner', 'boy', 'girl', 'middl', 'cannot', 'rais', 'club', 'afford', 'visit', 'charact', 'graphic', 'charter', 'pride'
  29. 'elmo', 'lesson', 'model', 'exampl', 'visual', 'attent', 'overhead', 'poster', 'wall', 'display', 'dvd', 'front', 'worksheet', 'clear', 'plan'
  30. 'camera', 'digit', 'photographi', 'yearbook', 'photo', 'photograph', 'camcord', 'memori', 'film', 'trip', 'flip', 'captur', 'moment', 'document', 'video'
  31. 'languag', 'english', 'dictionari', 'spanish', 'vocabulari', 'esl', 'bilingu', 'speak', 'spell', 'learner', 'word', 'sight', 'speech', 'refer', 'nativ'
  32. 'rug', 'carpet', 'chair', 'comfort', 'sit', 'tabl', 'bean', 'floor', 'place', 'space', 'seat', 'clean', 'bag', 'safe', 'desk'
  33. 'activ', 'engag', 'particip', 'meaning', 'interact', 'involv', 'fun', 'hand', 'energet', 'partner', 'bore', 'incorpor', 'focus', 'approach', 'various'
  34. 'wwwdonorschooseorg', 'html', 'fulfillmenthtm', 'href', 'fulfillwindow', 'onclick', 'openwindowhttp', 'fals', 'http', 'ship', 'fulfil', 'cost', 'return', 'target', 'carolina'
  35. 'project', 'share', 'creat', 'product', 'proud', 'produc', 'pride', 'member', 'draw', 'idea', 'name', 'dedic', 'staff', 'forward', 'communiti'
  36. 'lunch', 'free', 'reduc', 'receiv', 'african', 'qualifi', 'hispan', 'money', 'percent', 'american', 'approxim', 'popul', 'over', 'price', 'major'
  37. 'work', 'assign', 'drive', 'complet', 'homework', 'hard', 'machin', 'save', 'flash', 'finish', 'job', 'step', 'report', 'until', 'worksheet'
  38. 'curriculum', 'local', 'within', 'shown', 'base', 'implement', 'train', 'servic', 'program', 'research', 'design', 'approach', 'integr', 'report', 'grant'
  39. 'excit', 'york', 'puppet', 'spark', 'dramat', 'new', 'miss', 'bring', 'theater', 'lost', 'add', 'fall', 'forward', 'introduc', 'date'
  40. 'write', 'writer', 'journal', 'workshop', 'publish', 'poetri', 'author', 'stori', 'sentenc', 'illustr', 'edit', 'written', 'tell', 'piec', 'idea'
  41. 'literatur', 'text', 'copi', 'theme', 'relat', 'critic', 'content', 'discuss', 'expos', 'connect', 'relev', 'deal', 'issu', 'aspect', 'rich'
  42. 'sharpen', 'eras', 'pencil', 'board', 'dri', 'marker', 'paper', 'easel', 'whiteboard', 'crayon', 'white', 'glue', 'pen', 'electr', 'stick'
  43. 'recycl', 'organ', 'keep', 'bin', 'storag', 'pocket', 'qualiti', 'chart', 'full', 'pad', 'magnet', 'folder', 'store', 'easili', 'stand'
  44. 'music', 'instrument', 'drum', 'band', 'guitar', 'song', 'danc', 'sing', 'perform', 'cds', 'play', 'classic', 'talent', 'theater', 'movement'
  45. 'question', 'system', 'person', 'binder', 'respons', 'answer', 'immedi', 'track', 'profession', 'reflect', 'solut', 'thought', 'human', 'assess', 'portfolio'
  46. 'tool', 'microphon', 'voic', 'lamin', 'speaker', 'task', 'proper', 'record', 'difficulti', 'simpl', 'light', 'power', 'hear', 'smart', 'easier'
  47. 'ball', 'cook', 'physic', 'healthi', 'sport', 'team', 'health', 'fit', 'recess', 'exercis', 'eat', 'jump', 'cooper', 'equip', 'balanc'
  48. 'art', 'paint', 'artist', 'ipod', 'clay', 'creativ', 'fine', 'motor', 'express', 'techniqu', 'talent', 'draw', 'style', 'piec', 'movement'
  49. 'test', 'state', 'standard', 'score', 'exam', 'cover', 'review', 'financi', 'pass', 'requir', 'notebook', 'basic', 'act', 'textbook', 'reward'
  50. 'printer', 'print', 'self', 'communic', 'ink', 'behavior', 'devic', 'contain', 'esteem', 'posit', 'reward', 'contribut', 'emot', 'pictur', 'color'

D. Decision Tree Output

Tree for Predicting Project Funding Before Pruning:

Regression tree:
tree(formula = funded ~ ., data = projects[train, c(8, 11:19, 
    22:26, 35:92)])
Variables actually used in tree construction:
[1] "teacher_funded"                         "total_price_excluding_optional_support"
[3] "total_price_including_optional_support"
Number of terminal nodes:  6 
Residual mean deviance:  0.0393 = 72.86 / 1854 
Distribution of residuals:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.9167  0.0000  0.0000  0.0000  0.0000  0.9801

node), split, n, deviance, yval
      * denotes terminal node

 1) root 1860 273.400 0.8210  
   2) teacher_funded < 0.683333 418  86.390 0.2919  
     4) teacher_funded < 0.366667 201   3.920 0.0199 *
     5) teacher_funded > 0.366667 217  53.830 0.5438  
      10) total_price_excluding_optional_support < 254.345 48   3.667 0.9167 *
      11) total_price_excluding_optional_support > 254.345 169  41.600 0.4379 *
   3) teacher_funded > 0.683333 1442  36.050 0.9743  
     6) teacher_funded < 0.933036 166  28.750 0.7771  
      12) total_price_including_optional_support < 672.85 127  13.980 0.8740 *
      13) total_price_including_optional_support > 672.85 39   9.692 0.4615 *
     7) teacher_funded > 0.933036 1276   0.000 1.0000 *

Tree for Predicting Project Funding After Pruning:

Regression tree:
snip.tree(tree = tr, nodes = c(5, 3))
Variables actually used in tree construction:
[1] "teacher_funded"
Number of terminal nodes:  3 
Residual mean deviance:  0.04915 = 2480 / 50470 
Distribution of residuals:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.96250  0.03752  0.03752  0.00000  0.03752  0.99980

node), split, n, deviance, yval
      * denotes terminal node

1) root 50470 7675.0000 0.8129000  
  2) teacher_funded < 0.585714 9849 1553.0000 0.1962000  
    4) teacher_funded < 0.225 5789    0.9998 0.0001727 *
    5) teacher_funded > 0.225 4060 1013.0000 0.4756000 *
  3) teacher_funded > 0.585714 40621 1467.0000 0.9625000 *

Tree for Predicting Project Focus Area Before Pruning:

Classification tree:
tree(formula = focus ~ d, mincut = 1)
Variables actually used in tree construction:
[1] "d.book"    "d.scienc"  "d.math"    "d.ball"    "d.music"   "d.art"    
[7] "d.magazin" "d.special"
Number of terminal nodes:  12 
Residual mean deviance:  1.9 = 520400 / 274000 
Misclassification error rate: 0.2934 = 80386 / 273995

  1) root 273995 851600 Literacy & Language ( 0.0840015 0.0267925 0.0557565 0.4438074 0.2394095 
0.0914652 0.0587675 )  
    2) t.book < 0.0316299 177635 607100 Math & Science ( 0.1114476 0.0402511 0.0595097 0.2576576 
0.3312129 0.1293777 0.0705435 )  
      4) t.scienc < 0.0538795 148229 525600 Literacy & Language ( 0.1275864 0.0465159 0.0679624 
0.3039554 0.2223789 0.1510231 0.0805780 )  
        8) t.math < 0.0521466 124446 439700 Literacy & Language ( 0.1447535 0.0543449 0.0794642 
0.3490028 0.1070987 0.1771130 0.0882230 )  
         16) t.ball < 0.0591003 113674 378200 Literacy & Language ( 0.1447561 0.0078294 0.0854285 
0.3777381 0.1093038 0.1887679 0.0861763 )  
           32) t.music < 0.0755316 103603 341000 Literacy & Language ( 0.1557580 0.0082334 
0.0926228 0.4093993 0.1191182 0.1225351 0.0923332 )  
             64) t.art < 0.060396 92189 290700 Literacy & Language ( 0.1667661 0.0089599 
0.1002723 0.4447711 0.1311111 0.0508737 0.0972459 )  
              128) t.magazin < 0.0488028 80659 242700 Literacy & Language ( 0.1793600 0.0098811 
0.0380367 0.4709580 0.1422284 0.0533480 0.1061878 )  
                256) t.special < 0.0469463 68022 193400 Literacy & Language ( 0.1911293 0.0098645 
0.0411485 0.5055864 0.1529064 0.0594073 0.0399577 )*
                257) t.special > 0.0469463 12637  34920 Special Needs ( 0.1160085 0.0099707 
0.0212867 0.2845612 0.0847511 0.0207328 0.4626889 )*
              129) t.magazin > 0.0488028 11530  29680 History & Civics ( 0.0786644 0.0025152 
0.5356461 0.2615785 0.0533391 0.0335646 0.0346921 )*
             65) t.art > 0.060396 11414  23960 Music & The Arts ( 0.0668477 0.0023655 0.0308393 
0.1237077 0.0222534 0.7013317 0.0526546 ) *
           33) t.music > 0.0755316 10071  11720 Music & The Arts ( 0.0315758 0.0036739 0.0114189 
0.0520306 0.0083408 0.8701221 0.0228379 ) *
         17) t.ball > 0.0591003 10772  30760 Health & Sports ( 0.1447271 0.5452098 0.0165243 
0.0457668 0.0838284 0.0541218 0.1098218 ) *
        9) t.math > 0.0521466 23783  34400 Math & Science ( 0.0377581 0.0055502 0.0077787 
0.0682420 0.8255897 0.0145062 0.0405752 ) *
      5) t.scienc > 0.0538795 29406  33850 Math & Science ( 0.0300959 0.0086717 0.0169013 
0.0242808 0.8798204 0.0202680 0.0199619 ) *
    3) t.book > 0.0316299 96360 164400 Literacy & Language ( 0.0334060 0.0019822 0.0488377 
0.7869655 0.0701743 0.0215753 0.0370589 )  
      6) t.scienc < 0.0541045 91154 139900 Literacy & Language ( 0.0344472 0.0020295 0.0498936 
0.8187244 0.0344362 0.0224455 0.0380236 )  
       12) t.magazin < 0.0556062 79116 103700 Literacy & Language ( 0.0367182 0.0020729 0.0106679 
0.8512185 0.0352900 0.0232949 0.0407377 )  
         24) t.math < 0.0489715 76092  88280 Literacy & Language ( 0.0368896 0.0021159 0.0108553 
0.8723125 0.0135231 0.0238396 0.0404642 ) *
         25) t.math > 0.0489715 3024   6153 Math & Science ( 0.0324074 0.0009921 0.0059524 
0.3204365 0.5830026 0.0095899 0.0476190 ) *
       13) t.magazin > 0.0556062 12038  24180 Literacy & Language ( 0.0195215 0.0017445 0.3076923 
0.6051670 0.0288254 0.0168633 0.0201861 ) *
      7) t.scienc > 0.0541045 5206   9152 Math & Science ( 0.0151748 0.0011525 0.0303496 
0.2308874 0.6959278 0.0063388 0.0201690 ) *

Tree for Predicting Project Focus Area After Pruning:

Classification tree:
snip.tree(tree = topic_tree, nodes = c(3, 64))
Variables actually used in tree construction:
[1] "d.book"   "d.scienc" "d.math"   "d.ball"   "d.music"  "d.art"   
Number of terminal nodes:  7 
Residual mean deviance:  2.153 = 589800 / 274000 
Misclassification error rate: 0.3249 = 89012 / 273995

node), split, n, deviance, yval, (yprob)
      * denotes terminal node

 1) root 273995 851600 Literacy & Language ( 0.084002 0.026792 0.055756 0.443807 0.239409 
0.091465 0.058767 )  
   2) d.book < 0.0316299 177635 607100 Math & Science ( 0.111448 0.040251 0.059510 0.257658 
0.331213 0.129378 0.070544 )  
     4) d.scienc < 0.0538795 148229 525600 Literacy & Language ( 0.127586 0.046516 0.067962 
0.303955 0.222379 0.151023 0.080578 )  
       8) d.math < 0.0521466 124446 439700 Literacy & Language ( 0.144754 0.054345 0.079464 
0.349003 0.107099 0.177113 0.088223 )  
        16) d.ball < 0.0591003 113674 378200 Literacy & Language ( 0.144756 0.007829 0.085429 
0.377738 0.109304 0.188768 0.086176 )  
          32) d.music < 0.0755316 103603 341000 Literacy & Language ( 0.155758 0.008233 0.092623 
0.409399 0.119118 0.122535 0.092333 )  
            64) d.art < 0.060396 92189 290700 Literacy & Language ( 0.166766 0.008960 0.100272 
0.444771 0.131111 0.050874 0.097246 ) *
            65) d.art > 0.060396 11414  23960 Music & The Arts ( 0.066848 0.002366 0.030839 
0.123708 0.022253 0.701332 0.052655 ) *
          33) d.music > 0.0755316 10071  11720 Music & The Arts ( 0.031576 0.003674 0.011419 
0.052031 0.008341 0.870122 0.022838 ) *
        17) d.ball > 0.0591003 10772  30760 Health & Sports ( 0.144727 0.545210 0.016524 0.045767 
0.083828 0.054122 0.109822 ) *
       9) d.math > 0.0521466 23783  34400 Math & Science ( 0.037758 0.005550 0.007779 0.068242 
0.825590 0.014506 0.040575 ) *
     5) d.scienc > 0.0538795 29406  33850 Math & Science ( 0.030096 0.008672 0.016901 0.024281 
0.879820 0.020268 0.019962 ) *
   3) d.book > 0.0316299 96360 164400 Literacy & Language ( 0.033406 0.001982 0.048838 0.786966 
0.070174 0.021575 0.037059 ) *