Blog Post

Parsing Bibliographic Reference Lists - Call for help!

Parsing Bibliographic Reference Lists - Call for help!

Dear scholars and information technologists,

I am a software developer, and I am currently looking for a tool or library to parse bibliographic reference lists for a client of mine.



I need to import data from a website with static HTML pages into Drupal, the content management system I most often use. Part of the data are references lists. I need to parse each reference into its metadata parts, that is, author, book title, journal, pages, etc., according to its type (article, book, etc). An example of a page containing reference lists to be parsed can be found here:



I have been investigating this subject, and I have gathered a bunch of tools and code libraries. I will test them, of course. However, I usually import structured data (old relational databases or structured files, such as JSON or XML); this is my first time dealing with linear bibliographic reference lists.



Here is where I ask for help: I would like to hear from the community (you guys!) about your experience with a task like mine:

  • Could anyone who has used such a tool/library share his or her experience?
  • Is there any tool/library that you would recommend?
  • Do you know of any specific DH projects that have involved such parsing?
  • Any other tips ...?



  1. Apart from parsing the reference line into its metadata (title, author, pages, etc.) It would be great if the parsing results include a value that says what kind of reference it is (article, book, in book, booklet, proceedings, in proceedings, PhD thesis, Master thesis, conference, etc.).
  2. Coding language or running platforms are not important.
  3. Locating the reference lists is not an issue at all, nor is separating each list into individual reference lines. Only the parsing of a single reference line is the issue.


Amir Simantov


1 comment

This is totally the kind of problem you should throw at the Code4Lib listserv. It'll be like catnip.