A Tree Grows in Brooklyn: Predicting Health of The Big Apple’s Trees

Eli Fulton
3 min readSep 25, 2020

Like many cities in the United States, New York City has a department of parks and recreation. This department, with the aid of volunteers and private environmental organizations, is responsible for monitoring the health of trees across all five boroughs. In this build, I examine three years of census data (1995, 2005, and 2015) and document my findings on what traits are most important when predicting tree health. This can yield insights to help the parks and recreation department decide whether it’s the trees themselves that should be improved upon, or the locations where those trees are planted.

The main takeaways are:

  • By a very large margin, species is the most important factor when predicting the health of a tree (according to basic feature importances), followed by the tree’s diameter
  • Taking a look at location, the second most important factor, boroughs such as the Bronx and Manhattan appear to be poor locations for trees, while Staten Island appears to be excellent despite small sample size

Modeling and Feature Importances

Across all three separate dataframes, there is a ‘status’ column which describes a tree’s condition. The majority class in it is ‘Good’, which occurs about 40.8 percent of the time, making for a deceptively low baseline in terms of accuracy. I say deceptively because cracking it is actually harder than it looks.

First, after performing a random train-test split, I tried a basic logistic regression model, with ordinal encoding and a SimpleImputer. Even though the bar was only 0.408, the model’s accuracy metric couldn’t crack it when put to the test.

So I tried a RandomForestClassifier model next, again with ordinal encoding and a SimpleImputer. I ended up setting max_depth to 15 after some hand-tinkering so that my model would beat the baseline by a wide margin (0.79 on both training and test data), but would also take a reasonable amount of time and memory to run. After fitting, I plotted a basic feature importances bar graph:

Now we can see that the species of a tree, which corresponds to the spc_common and spc_latin features, is the most important feature for predictions by a country mile. The diameter of the tree, tree_dbh, has the highest importance out of the non-species features.

To The 5 Boroughs

The next-highest important features, boro_ct and borough, relate to the borough that the trees are located in. Here are two folium chloropleths of NYC showing percentage of good trees (Map 1) and percentage of bad trees (Map 2), divided by borough:

Map 1
Map 2

Darker means a higher percentage in both maps, be it good or bad trees.

Sample size is important to keep in mind here; Queens had the highest quantity of trees with 440714, followed by Brooklyn with 250183, the Bronx with 101108, Manhattan with 91292, and finally Staten Island with 74237.

With this perspective, we can see that Staten Island, which is comparable in sample size to Manhattan or even the Bronx, has a significantly higher percentage of good trees and a much lower percentage of bad trees than both of them. More towards the middle of the spectrum in terms of percentage, but with a higher number of trees, Brooklyn has a slight edge on Queens, although Queens has almost twice the number of trees Brooklyn does.

Conclusion

By and large, when it comes to bettering the health of NYC’s trees, planting species that are more suited to the local climate (NYC is notorious for being egregiously hot in the summer and cold in the winter) is the way to go. Bioengineers could use similar models to this one in order to determine the efficacy of new species when they’re planted in urban areas.

--

--