donderdag 11 december 2014

Country profiles and OPAC use.

The library of the Peace Palace serves a global community. It is global because I can see this in the standard logging of the website. The use of all our website pages ends up in a log file and every line in this logfile contains the ip-number of the one who is using that specific page. This ip-number can be translated to a country of origin.

And that is what I did in my blog "MOOC: learning and instruction: Tableau and library use". In this blog I presented several maps, one of them dealing with the use of our 'human rights' website pages. The map is to the left (sorry about Alaska). One might say that the use of these pages is at least partly motivated by searching the Internet. And indeed that is usually the case. There will be only a handful of people who have added the library of the Peace Palace in their bookmarks.

Of course it is possible to follow the users if they move around on our website, but -in general and in truth- they mostly leave short tracks. I even think they are too short to make any substantiated conclusions about what our website users are looking for exactly. It is a lot easier to rise above the personal and to pay more attention to the geographical level. That means creating world maps, just to start with.

Next to the website, libraries also provide an OPAC and normally these libraries have a web search interface to their collections. And our OPAC server, you guessed it, produces log files. These log files look a lot different than the log files of regular web servers. Since the beginning of this week we collect these files (thanks to OCLC, The Netherlands) in order to parse them, store relevant data in a database and draw some conclusions. Like I stated, these files are a bit more complicated then the web server log files I usually look at, but I'am sure that in the next couple of months I will be able to deal with them. Just a random example of one log entry:

#XXX.XXX.XXX.XX 60765 1418079639.497074 GET /DB=1/SET=2/TTL=1/CMD?ACT=SRCHA&IKT=4&SRT=YOP&TRM=population HTTP/1.1
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36
Accept-Encoding: gzip, deflate, sdch
Accept-Language: fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4
Cookie: DB="1"; PSC_1="$c%2526db%253D$d%07%08"

Where you see all the XX's there is the ip-number, which I replaced for obvious reasons. On the same line I see 'IKT' which indicates the index used for searching and 'TRM' with the actual term looked for, 'population'. I also see our user has found another set of results before (SET=2, there must be a previous SET=1). I also notice -in Cookie:- the session identifier (b6cadb8a-0), which I can use to recreate all the actions of our user. In summary, I can say that there are a lot of possibilities to collect useful information. Information the library can use to provide optimal and actual information and services like instruction to its users.

To end this blog I will show two maps -I used Tableau- containing counts of succesful searches in our OPAC during just two days, 8-9th December. One map shows the use of our OPAC in Europe and not surprisingly, The Netherlands score the best.

In the other map I had to leave out The Netherlands to create sufficient distinction in colour. This map contains the same data as the map above. 

woensdag 3 december 2014

Presenting keywords. Eh? Which keywords? And how?

It is always hard to get started with bibliographic research. Especially for those patrons and scholars who realize it is important not to just search by some words using a very general index, but to use keywords (sometimes also called subject headings). These users know that a very dedicated group of librarians has thoroughly examined the publications added to their OPAC and has enriched them with keywords. And therein lies a problem, because one might ask: "how do these keywords look like?".

In my last couple of blogs I was focusing on how to get an idea of subject areas (huge and small). I used Gephi to create maps which could indeed give an impression of subjects and subject areas. For technical reasons, however, these maps dealt with only a relatively small set of data. In the latest published map I only incorporated about 4,500 exposed titles to our users in the reading room of the library. This looks like it is much, but in fact it is not. Therefore smaller subjects may not appear during this specific time frame or stay unnoticed. To get a more thorough impression of the subjects and the keywords used in the library one should use as large a set of information as possible. Luckily I have collected such a set, but before I come to that, I like to sketch a situation.

So, suppose some patron is looking for publications about biological warfare and the Security Council (we are after all the Peace Palace Library). Chances are (about 70% of our users do so) he uses the all words index in our OPAC and types 'biological warfare security council'. No hits. He then tries 'biological warfare' using the same index. 92 hits, which he quickly scans to look for what interests him (about 6 pages of title info). He then tries 'security council'. 1,416, hits which he does not scan, because it is to much. Now suppose he all of a sudden realizes he should also use the keyword index. Biological warfare, 213 hits. Security council, 2,741 hits. Combined (we just assume that he knows how to do this), 1 hit (a freely available PDF file containing references to other freely available texts and links to websites).

It is my opinion that libraries should bring to the forefront sets of keywords, all related to one general subject. Library users then just have to check these overviews in order to comprehend which keywords they should use in their research. In order to build up this set it is important to collect just the keywords which were, preferably during a longer time span, exposed or used by our patrons. This way we can be sure all relevant keywords can be collected. 

In January of this year I started to collect all the records, which were, in one way or another, seen or used by our patrons and visitors. This includes the records 'seen' by search robots, like the ones from Google. The database (we use for this MongoDB as a database system) contains now almost 7,500,000 records, but keep in mind that less then 10% of this number actually can be attributed to human beings. Each record contains, publication id, ip-number, time stamp and the keywords belonging to this publication. Given the size of the database it is possible to collect the really used or exposed keywords related to just one general subject.

I managed to create such a set of keywords as an example. They all were used in some combination with my 'main' keyword Biological and chemical weapons. The set contains a little bit more than 410 different keywords, some used quite a lot, others just a few times. 

So what remains is, to determine what kind of presentation to use to get a quick and thorough impression about specific keywords. I decided for now to use Tableau and to draw three diagrams with different colors. Each diagram indicates the relative amount of use and the keyword description, each next diagram present the keywords used less and less. So if you are looking for a publication about chemical warfare and genetic manipulation, after a peek at the diagrams below you will know what keywords to use to get to this information. (Curious? Look at this chapter: Terrorism in the Genomic Age / John Ellis, 2004.) 

And let's not forget serendipity.