I’ve been setting up a data mining framework for the MIT Open Access collection, and testing some initial simple / naive analysis runs. Below are the results of a word count algorithm run aggregating over the entire content contained in the OA collection. (It is severely truncated because WordPress freaked out when I handed it the whole list… because it’s awesome.) Clearly there are a bunch of additions to go into the stop-words list, and a few interesting blips to investigate.

In the coming months, I’ll be doing a number of EDA projects, discovery interfaces, and complex data objects, graphs, etc. Watch this space for details

