As one of the co-authors of the upcoming CCS forum (stay tuned this week, it should be great!), I've proposed looking at spam generation, particularly avoiding content-based spam detectors through the use of auto-generated pseudo-texts.
I don't want to spoil things by writing too much now, the forum will go live sometime this week. For now, I'll just say I spent a few hours this evening coming up with a CCS pseudo-text generator. I'll be explaining a bit more about how it works in the coming days but the code is devilishly simple thanks to an existing library in Python (nltk). In order to do this I gathered:
- The text the co-authors and I have generated while brainstorming for the forum (this includes e-mails and wiki content)
- Recent CCS articles published by the invited forum cohosts
- Twitter comments posted using the #critcode hashtag in the last few days
All told I ended up with 150kb worth of text. Not a lot, by the standards of any modern corpus but still a substantial amount of language generated in the past few weeks. Some more statistics:
That's actually really small for a corpus, unfortunately, but it'll get us started. For those who are curious, here are the top 20 words by raw count (I removed stop words but didn't stem, so we get program and programming).