It is always hard to get started with bibliographic research. Especially for those patrons and scholars who realize it is important not to just search by some words using a very general index, but to use keywords (sometimes also called subject headings). These users know that a very dedicated group of librarians has thoroughly examined the publications added to their OPAC and has enriched them with keywords. And therein lies a problem, because one might ask: "how do these keywords look like?".
In my last couple of blogs I was focusing on how to get an idea of subject areas (huge and small). I used Gephi to create maps which could indeed give an impression of subjects and subject areas. For technical reasons, however, these maps dealt with only a relatively small set of data. In the latest published map I only incorporated about 4,500 exposed titles to our users in the reading room of the library. This looks like it is much, but in fact it is not. Therefore smaller subjects may not appear during this specific time frame or stay unnoticed. To get a more thorough impression of the subjects and the keywords used in the library one should use as large a set of information as possible. Luckily I have collected such a set, but before I come to that, I like to sketch a situation.
So, suppose some patron is looking for publications about biological warfare and the Security Council (we are after all the Peace Palace Library). Chances are (about 70% of our users do so) he uses the all words index in our OPAC and types 'biological warfare security council'. No hits. He then tries 'biological warfare' using the same index. 92 hits, which he quickly scans to look for what interests him (about 6 pages of title info). He then tries 'security council'. 1,416, hits which he does not scan, because it is to much. Now suppose he all of a sudden realizes he should also use the keyword index. Biological warfare, 213 hits. Security council, 2,741 hits. Combined (we just assume that he knows how to do this), 1 hit (a freely available PDF file containing references to other freely available texts and links to websites).
It is my opinion that libraries should bring to the forefront sets of keywords, all related to one general subject. Library users then just have to check these overviews in order to comprehend which keywords they should use in their research. In order to build up this set it is important to collect just the keywords which were, preferably during a longer time span, exposed or used by our patrons. This way we can be sure all relevant keywords can be collected.
In January of this year I started to collect all the records, which were, in one way or another, seen or used by our patrons and visitors. This includes the records 'seen' by search robots, like the ones from Google. The database (we use for this MongoDB as a database system) contains now almost 7,500,000 records, but keep in mind that less then 10% of this number actually can be attributed to human beings. Each record contains, publication id, ip-number, time stamp and the keywords belonging to this publication. Given the size of the database it is possible to collect the really used or exposed keywords related to just one general subject.
I managed to create such a set of keywords as an example. They all were used in some combination with my 'main' keyword Biological and chemical weapons. The set contains a little bit more than 410 different keywords, some used quite a lot, others just a few times.
So what remains is, to determine what kind of presentation to use to get a quick and thorough impression about specific keywords. I decided for now to use Tableau and to draw three diagrams with different colors. Each diagram indicates the relative amount of use and the keyword description, each next diagram present the keywords used less and less. So if you are looking for a publication about chemical warfare and genetic manipulation, after a peek at the diagrams below you will know what keywords to use to get to this information. (Curious? Look at this chapter: Terrorism in the Genomic Age / John Ellis, 2004.)
And let's not forget serendipity.