Blog Post

Visualizing Ingredient Networks in 498,243 Recipes

Visualizing Ingredient Networks in 498,243 Recipes
I’ve spent the past month or so looking at and trying to figure out the co-citation network analyses done by Neal Caren, Kieran Healy, and Jonathan Goodwin, especially Goodwin's work with the Signs at 40 project. The graphs aren’t (too?) difficult to understand: the nodes (circles) represent sources (Doe, J. “Some really important article”) and the edges (lines) connecting the nodes show the two sources have been cited together. Some thresholds are set: a source must be cited more than a certain number of times within the corpus, and for two sources to be connected they must be cited together more than X times. I see the technique being useful to get an idea of what a set of people are talking about, at least in the context of a specific journal or set of journals. 
 
I want, however, to see if this technique could be useful in analyzing material beyond citations. More specifically, I want to see what information can be gleaned by developing a “co-ingredient network” for recipes. I assume, perhaps naïvely, that an ingredient list isn’t too different from a bibliography or works cited page, and that individual ingredients aren’t too different than citations.
 
While the aforementioned scholars are grabbing data from Web of Science, I opted for food.com. The choice was relatively arbitrary, allrecipes, Food Network (connected with but not actually food.com), or Epicurious, would have been reasonable choices as well. This took much longer than expected to run; my little MacBook Air was chugging along for a good week before it finally finished scraping all the recipes. In the end, I had a 132.1 MB text file that contained 498,243 recipes and 672,589 semi-unique ingredients.
 
The first problem was the “semi-unique” ingredients. Food.com doesn't appear to have strict style guides for their recipes. While "cream of mushroom soup" appears in 3928, "condensed cream of mushroom soup" also appears in 648 recipes, while "cream of mushroom soup, undiluted" is used 212 times, and "condensed cream of mushroom soup, undiluted" shows up in 149 recipes. Are these unique ingredients? Should they all be considered "mushroom soup,” “condensed mushroom soup,” or just simply “mushrooms?” I haven’t decided what to do about this yet.
 
My second problem was one of magnitude. Early attempts at visualizing my results came out clean when I was only working with 20 recipes: 
 
 
But once I started using more recipes and more ingredients, everything got a bit too dense:
 
 
I was hoping for a clearer spread of ingredients, something more like the first image, but with more interesting connections and, well, a lot more ingredients. Instead, I got this jumbled mess of orange and blue surrounding caster and brown sugar, and wondered if I had failed. After a bit of tinkering, I realized, no, I hadn’t failed. This was a relatively accurate visualization in terms of my algorithm.
 
In both visualizations, the colors are determined by a Louvain community detection method. In the first graph, the orange group in the upper right looks like a bloody mary and another cocktail or two, and the ingredients floating off to the left are probably some set of baked goods. In the second graph, items like salt, butter, and olive oil DO appear in a ton of recipes, and, if I understand things correctly, are going to draw nodes closer together in Western cuisine, just as garlic/scallions/ginger are going to be linked together, likely with soy sauce, in a huge number of stir fry recipes. I suspect salt and butter are used more frequently across some set of recipes than Derrida and Foucault in some set of journal articles, so perhaps these really just that dense. As more and more nodes are “cited” together with more frequency, the bonds between them get stronger and the first visualization begins to look a lot like the second. I’m still unsure how to deal with this. On the one hand, I’ve already learned a few interesting things from crunching numbers and ingredients in my data: cinnamon is the seventeenth most used ingredient on Food.com, appearing in 25,346 recipes; eggs are the most used protein, appearing in 24,271 recipes; and, ground beef is the most used animal protein, appearing in 10,762 recipes. One of the weirdest ingredients I’ve found was “top round London broil beef (ours was a Montana steer named Clyde).” On the other hand, my co-ingredient network seems to be of less use than I had hoped, at least in terms of this particular visualization.
 
I’m still working through all of this, both in terms of understanding network analysis and cocitation networks, as well as how (if?) they can be applied to food and recipes. If you have any suggestions or comments, I’d love to hear them.
 
 
166

No comments