I’ve been setting up a data mining framework for the MIT Open Access collection, and testing some initial simple / naive analysis runs. Below are the results of a word count algorithm run aggregating over the entire content contained in the OA collection. (It is severely truncated because WordPress freaked out when I handed it the whole list… because it’s awesome.) Clearly there are a bunch of additions to go into the stop-words list, and a few interesting blips to investigate.

In the coming months, I’ll be doing a number of EDA projects, discovery interfaces, and complex data objects, graphs, etc. Watch this space for details

0000, 118739

data, 102927

using, 90690

time, 88965

model, 87614

will, 86509

between, 83933

figure, 72830

other, 71166

used, 66665

number, 60452

then, 60077

results, 58688

when, 57849

phys, 55129

cell, 53733

university, 52594

publisher, 52411

function, 51740

analysis, 51676

system, 51500

cells, 51493

state, 49607

high, 49290

first, 48682

section, 48373

2009, 48312

based, 47998

energy, 47963

author, 47323

agreement, 47125

given, 46486

over, 45379

same, 45173

shown, 44474

after, 43440

different, 42725

case, 42213

under, 42157

2010, 41450

however, 38996

2008, 37723

order, 37502

table, 36521

value, 36363

rate, 36062

values, 35830

large, 35773

information, 35627

thus, 35037

while, 34587

2007, 34150

well, 33992

three, 33968

signal, 33591

since, 33233

single, 32920

because, 32812

most, 32721

2000, 32577

null, 32412

manuscript, 31927

distribution, 31920

mass, 31765

physical, 31679

article, 31637

control, 31630

field, 31607

2006, 31349

terms, 30759

observed, 30547

within, 30256

during, 29765

through, 29714

2011, 29693

surface, 29662

work, 29557

models, 29308

events, 29275

level, 29033

research, 28843

similar, 28807

systems, 28729

small, 28725

point, 28582

journal, 28548

study, 28448

protein, 28375

show, 28357

structure, 28316

example, 28267

following, 28111

algorithm, 28075

form, 27731

total, 27418

second, 27113

phase, 26824

above, 26817

2005, 26719

publication, 26705

effect, 26594

power, 26544

line, 26522

those, 26423

without, 26413

available, 26405

expression, 26382

process, 26294

institute, 26146

region, 26070

therefore, 25732

does, 25722

result, 25618

subject, 25527

here, 25523

found, 25349

states, 25314

shows, 25258

problem, 24955

review, 24868

method, 24743

conditions, 24713

network, 24606

mean, 24452

parameters, 24407

range, 24330

space, 24264

theory, 24080

paper, 24052

sample, 23810

could, 23730

2004, 23693

current, 23686

error, 23674

size, 23645

license, 23634

average, 23612

physics, 23150

described, 23143

approach, 23135

methods, 23064

articles, 23034

type, 23022

general, 22909

gene, 22762

massachusetts, 22680

effects, 22572

technology, 22560

frequency, 22550

human, 22526

measured, 22395

higher, 22364

further, 22269

ieee, 22150

many, 22123

note, 22045

obtained, 21896

probability, 21762

tion, 21744

scale, 21667

possible, 21637

science, 21539

even, 21234

temperature, 21142

ctcf_known1, 21124

lower, 21005

date, 20978

performance, 20869

policy, 20853

linear, 20805

factor, 20760

genes, 20624

matrix, 20553

2003, 20531

solution, 20344

initial, 20312

including, 20292

very, 20247

density, 20230

response, 20150

associated, 20108

flow, 19972

defined, 19925

present, 19909

cross, 19781

ratio, 19658

background, 19593

respectively, 19552

must, 19537

before, 19473

standard, 19467

consider, 19248

changes, 19182

relative, 19179

design, 19152

specific, 19119

part, 19094

studies, 19034

change, 18936

2012, 18931

authors, 18925

potential, 18814

corresponding, 18755

like, 18691

group, 18687

experiments, 18670

test, 18620

version, 18442

compared, 18436

length, 18246

2002, 18217

lett, 18182

significant, 18095

measurements, 17905

source, 17858

functions, 17816

additional, 17718

less, 17613

experimental, 17506

important, 17501

long, 17481

binding, 17442

cambridge, 17412

expected, 17275

prime, 17270

provide, 17236

page, 17168

light, 17084

sequence, 17057

channel, 17036

regions, 16959

properties, 16944

vector, 16943

particular, 16926

multiple, 16926

image, 16893

required, 16891

either, 16882

samples, 16835

constant, 16794

activity, 16785

follows, 16740

equation, 16722

increase, 16692

chem, 16684

levels, 16584

several, 16506

known, 16457

parameter, 16442

volume, 16400

whether, 16284

points, 16282

term, 16248

related, 16183

department, 16163

limit, 16121

local, 16070

effective, 16055

limited, 16052

least, 15863

right, 15859

below, 15844

need, 15840

independent, 15802

random, 15800

zero, 15724

published, 15685

bound, 15650

four, 15590

across, 15567

final, 15552

layer, 15552

complex, 15509

along, 15469

noise, 15382

positive, 15304

input, 15302

although, 15273

2001, 15224

much, 15210

fact, 15185

full, 15182

rights, 15175

performed, 15155

consistent, 15136

theorem, 15130

lines, 15029

free, 14966

applied, 14936

cost, 14904

optimal, 14884

find, 14874

water, 14832

obtain, 14774

growth, 14716

termination, 14696

larger, 14682

2500, 14547

event, 14539

uncertainty, 14521

times, 14504

pubmed, 14488

step, 14481

addition, 14448

maximum, 14386

proof, 14313

particle, 14284

condition, 14281

open, 14263

estimate, 14166

proteins, 14146

previous, 14118

target, 14108

individual, 14066

cases, 14063

mice, 14012

center, 13952

distance, 13933

respect, 13871

nature, 13859

left, 13836

behavior, 13764

dependent, 13632

wave, 13597

measurement, 13569

interaction, 13562

proc, 13562

production, 13561

interactions, 13534

make, 13531

apply, 13477

simple, 13468

quantum, 13454

determined, 13427

national, 13422

simulation, 13420

previously, 13387

fixed, 13379

role, 13269

difference, 13261

color, 13210

1000, 13200

node, 13180

measure, 13162

area, 13088

genome, 13056

electron, 12973

resolution, 12971

every, 12936

output, 12869

rates, 12819

networks, 12811

mode, 12793

real, 12760

engineering, 12743

period, 12718

domain, 12594

induced, 12566

next, 12558

nodes, 12522

society, 12522

calculated, 12515

american, 12483

position, 12457

site, 12375

them, 12357

near, 12326

comparison, 12276

factors, 12267

d8cwct, 12255

online, 12248

features, 12246

spin, 12208

dimensional, 12199

variables, 12197

experiment, 12191

boundary, 12173

being, 12146

presented, 12138

development, 12126

global, 12121

provided, 12110

dynamics, 12110

access, 12107

optical, 12053

search, 12020

resulting, 11996

loss, 11976

likely, 11911

components, 11902

generated, 11872

processes, 11861

various, 11852

efficiency, 11836

might, 11782

best, 11779

decay, 11767

made, 11754

future, 11740

sites, 11684

spectrum, 11668

contrast, 11627

graph, 11600

correlation, 11552

images, 11486

negative, 11471

increased, 11453

velocity, 11448

component, 11443

1999, 11442

fraction, 11426

assume, 11404

presence, 11384

variable, 11382

prior, 11370

functional, 11347

width, 11345

reported, 11319

among, 11296

hence, 11272

take, 11263

directly, 11232

significantly, 11211

observations, 11169

selection, 11162

reduced, 11148

transition, 11118

upper, 11112

another, 11106

action, 11048

account, 11007

cancer, 11002

applications, 10978

formation, 10930

distributions, 10912

sets, 10912

increases, 10910

structures, 10908

learning, 10893

lemma, 10890

support, 10880

direction, 10872

1998, 10864

include, 10831

statistical, 10777

object, 10720

materials, 10711

strong, 10693

3333, 10631

copyright, 10623

provides, 10622

good, 10548

equilibrium, 10488

italy, 10485

particles, 10480

side, 10474

estimated, 10471

molecular, 10468

flux, 10457

spatial, 10454

evidence, 10451

path, 10444

direct, 10424

class, 10411

rather, 10409

entity, 10378

basis, 10367

derived, 10358

link, 10351

algorithms, 10349

together, 10336

biol, 10332

product, 10293

edge, 10277

errors, 10269

detection, 10221

original, 10203

dynamic, 10187

accepted, 10187

normal, 10168

smaller, 10159

differences, 10158

peak, 10144

main, 10137

journals, 10122

determine, 10097

program, 10087

require, 10086

finally, 10065

neurons, 10065

hand, 10046

processing, 10035