Dear scholars and information technologists,
I am a software developer, and I am currently looking for a tool or library to parse bibliographic reference lists for a client of mine.
I need to import data from a website with static HTML pages into Drupal, the content management system I most often use. Part of the data are references lists. I need to parse each reference into its metadata parts, that is, author, book title, journal, pages, etc., according to its type (article, book, etc). An example of a page containing reference lists to be parsed can be found here: http://faculty.washington.edu/kpotter/xtxt1.htm
WHAT I NEED
I have been investigating this subject, and I have gathered a bunch of tools and code libraries. I will test them, of course. However, I usually import structured data (old relational databases or structured files, such as JSON or XML); this is my first time dealing with linear bibliographic reference lists.
WHAT I ASK
Here is where I ask for help: I would like to hear from the community (you guys!) about your experience with a task like mine:
- Could anyone who has used such a tool/library share his or her experience?
- Is there any tool/library that you would recommend?
- Do you know of any specific DH projects that have involved such parsing?
- Any other tips ...?
- Apart from parsing the reference line into its metadata (title, author, pages, etc.) It would be great if the parsing results include a value that says what kind of reference it is (article, book, in book, booklet, proceedings, in proceedings, PhD thesis, Master thesis, conference, etc.).
- Coding language or running platforms are not important.
- Locating the reference lists is not an issue at all, nor is separating each list into individual reference lines. Only the parsing of a single reference line is the issue.