donderdag 27 november 2014

Keywords. Collecting data.

In the last couple of weeks I blogged about keywords as they were displayed to the users of the OPAC in the library of the Peace Palace. I showed a couple of maps built with Gephi, some exhaustive, others very detailed.

But, how did I collect and adapt the data to be used by Gephi? I already mentioned "exposed keywords to the user" in an earlier blog. So to start with; what is the meaning of "exposed keywords"? I mean with this "keywords such as they occur in the presentation of the titles which were actually seen, perhaps even read, by the user". I' am interested in these keywords.

The enumeration of keywords in just one title can indeed be considered as a very small network. All these keywords are somehow linked to one another. Therefore, the first step is to gather all the presented titles and the second step is to collect all these small networks of keywords and then, lastly, to create one huge file which can be used by Gephi.

In the table below I give some examples of the file structure. In the left column you see five keywords (for Gephi they are nodes), each with a count of one, called 'use'. Underneath that you see the unique combinations of the keywords (for Gephi they are edges), also with a count, called 'weight'. The number after capital P is a unique keyword identifier. Gephi likes doing arithmetic with simple codes instead of -sometimes- long strings with weird characters in it. In the middle column you see the same, except now the keywords are from another title. In the third column both sets of keywords are combined. Take notice of the keyword 'Women', it occurs in both titles, therefore in the third column the 'use' is raised to two. At the bottom of each column you see the corresponding Gephi map.

nodedef>name VARCHAR,label VARCHAR, use INT
P076244229,"Middle East",1
P076242366,"Women",1
P076239810,"Islam",1
P076243265,"Family law",1
P07624234X,"Islamic law",1
edgedef>node1 VARCHAR,node2 VARCHAR,weight INT
P076244229,P076242366,1
P076244229,P076239810,1
P076244229,P076243265,1
P076244229,P07624234X,1
P076242366,P076239810,1
P076242366,P076243265,1
P076242366,P07624234X,1
P076239810,P076243265,1
P076239810,P07624234X,1
P076243265,P07624234X,1




















































nodedef>name VARCHAR,label VARCHAR, use INT
P076239519,"Refugees",1
P076242986,"Asylum",1
P076242366,"Women",1
P356258076,"Girls",1
P076256758,"Immigration",1
P241824206,"Convention relating to the Status of Refugees (Geneva, 28 July 1951)",1
P252290518,"Sex crimes",1
P255990391,"Gender",1
P332051005,"E-docs",1
edgedef>node1 VARCHAR,node2 VARCHAR,weight INT
P076239519,P076242986,1
P076239519,P076242366,1
P076239519,P356258076,1
P076239519,P076256758,1
P076239519,P241824206,1
P076239519,P252290518,1
P076239519,P255990391,1
P076239519,P332051005,1
P076242986,P076242366,1
P076242986,P356258076,1
P076242986,P076256758,1
P076242986,P241824206,1
P076242986,P252290518,1
P076242986,P255990391,1
P076242986,P332051005,1
P076242366,P356258076,1
P076242366,P076256758,1
P076242366,P241824206,1
P076242366,P252290518,1
P076242366,P255990391,1
P076242366,P332051005,1
P356258076,P076256758,1
P356258076,P241824206,1
P356258076,P252290518,1
P356258076,P255990391,1
P356258076,P332051005,1
P076256758,P241824206,1
P076256758,P252290518,1
P076256758,P255990391,1
P076256758,P332051005,1
P241824206,P252290518,1
P241824206,P255990391,1
P241824206,P332051005,1
P252290518,P255990391,1
P252290518,P332051005,1
P255990391,P332051005,1

















nodedef>name VARCHAR,label VARCHAR, use INT
P076244229,"Middle East",1
P076242366,"Women",2
P076239810,"Islam",1
P076243265,"Family law",1
P07624234X,"Islamic law",1
P076239519,"Refugees",1
P076242986,"Asylum",1
P356258076,"Girls",1
P076256758,"Immigration",1
P241824206,"Convention relating to the Status of Refugee (Geneva, 28 July 1951)",1
P252290518,"Sex crimes",1
P255990391,"Gender",1
P332051005,"E-docs",1
edgedef>node1 VARCHAR,node2 VARCHAR,weight INT
P076244229,P076242366,1
P076244229,P076239810,1
P076244229,P076243265,1
P076244229,P07624234X,1
P076242366,P076239810,1
P076242366,P076243265,1
P076242366,P07624234X,1
P076239810,P076243265,1
P076239810,P07624234X,1
P076243265,P07624234X,1
P076239519,P076242986,1
P076239519,P076242366,1
P076239519,P356258076,1
P076239519,P076256758,1
P076239519,P241824206,1
P076239519,P252290518,1
P076239519,P255990391,1
P076239519,P332051005,1
P076242986,P076242366,1
P076242986,P356258076,1
P076242986,P076256758,1
P076242986,P241824206,1
P076242986,P252290518,1
P076242986,P255990391,1
P076242986,P332051005,1
P076242366,P356258076,1
P076242366,P076256758,1
P076242366,P241824206,1
P076242366,P252290518,1
P076242366,P255990391,1
P076242366,P332051005,1
P356258076,P076256758,1
P356258076,P241824206,1
P356258076,P252290518,1
P356258076,P255990391,1
P356258076,P332051005,1
P076256758,P241824206,1
P076256758,P252290518,1
P076256758,P255990391,1
P076256758,P332051005,1
P241824206,P252290518,1
P241824206,P255990391,1
P241824206,P332051005,1
P252290518,P255990391,1
P252290518,P332051005,1
P255990391,P332051005,1





Of course I do not create these lengthy (15.000 lines and more) files by hand. I wrote a couple of crude PHP scripts to generate a crude file. This file I clean up with R and Microsoft Excel and the resulting file is ready to be used by Gephi. The scripts use a MongoDB collection, which contains all the logging of OPAC use in our readingroom. It is possible to detect 'exposed titles' (so also the keywords therein) in this logging.

To conclude. This is all very technical stuff and we may not expect our users to do this kind of research themselves, based on rough data provided by the library. However, some library staff members should certainly be able to do this. And then communicate about the results, using interesting maps for instance. Communicate to management about library collection issues, communicate to users about trends, communicate about almost lost niches in the collection, communicate about actual, important subcollections which can be used in updating dossiers, research guides, alerting systems, etc.

My other blogs about 'Gephi in libraries':

donderdag 20 november 2014

Just below the surface.

Using Gephi("an interactive visualization and exploration platform for all kinds of networks") to create unprocessed maps of exposed keywords to the user in the library of the Peace Palace, will result in an image in which a few huge subjects will dominate. These subjects are indicators of the core business of the library: Human Rights, European Union, United States of America, International Law and International Criminal Law to name just a few.To the left you see a very reduced image of such a map, but a few main keywords are still discernible.
These extra large topics veil the keywords just below. Zooming in will eventually bring you to the overshadowed keywords, but at a very deep level, so you will lose an overview of the structure. To the left we have zoomed in on an area clearly dominated by 'Human rights'. Now if I remove 'Human rights' from this cluster, Gephi will recalculate a lot of values, because one predominant element has been removed. After all, all keywords consitute one network. So the map gets a new shape. Especially, if all of the above mentioned subjects are removed and that is exactly what I have done. All the veiled keywords will float to the surface.
Let us now choose another criterion, in stead of the number of times a keyword occurs, to create a map. Gephi gives us a few other options, one of them is betweenness centrality.

Et voilĂ , after using the option 'rank parameter' in Gephi and choosing for betweenness centrality a new overall map appears, now with new highlighted nodes or keywords. Before zooming in, I will try to explain what betweenness centrality is. In brief, betweenness centrality is an indicator value for a key position. The higher the value the more important the role of the keyword. This value is calculated by counting the shortest paths between two keywords in our network. The keyword which appears the most times as being in between two different keywords, has the highest betweenness centrality value; these keywords are brokers or intermediaries. I used these values to create the map at the left.

After zooming in a little on the section of the map where the overall keyword 'Human Rights' used to be, a new picture arises. We see keywords like Children, Women and Family law, all of course related to Human rights and quite a few of them with a high key position or broker value. In short a new picture of related subjects emerges, indicating what the library of the Peace Palace could provide to its users.

By the way, the relations of keywords with a high betweenness centrality are not restricted to just one general subject. Between this kind of keywords there could be dense relations to other general subjects as well, see the image to the left.

Using Gephi maps not only give students and scholars a tool in hand to explore the collections of libraries, it also is a clear reminder of the necessity of using keywords to conduct efficient bibliographic research.


Those of you who would like to have the data file used in Gephi to create all the maps shown, do contact me at a.janson at ppl dot nl.

My other blogs about 'Gephi in libraries':

donderdag 13 november 2014

Keywords! Maps! Let's dive in.


Last week I blogged about maps and keywords: Library and user: one interest? I presented a few maps, created with Gephi, with which I tried to compare the activities of the library staff with the interests of OPAC users. I talked about general subjects like 'international criminal law', 'space debris' and things like that.

These maps can also be used to get a detailed picture, although I admit the presented maps are a bit difficult to read after zooming in. However, librarians can use Gephi itself to do detailed research in order to find out what our patrons are looking for.
See for example this image, clipped from the Gephi overview graph frame, which shows keywords all about art, trade and illegal activities in just a tiny section of the map. I think librarians can use such insights to better facilitate their users, especially if they detect returning patterns in searches during a longer period.

If librarians can 'translate' these insights in more relevant acquisitions, improvements in their research guides (in this case the Peace Palace Library, Cultural Heritage) or write specific blogs or tweets, I'am sure interested visitors will return to the library.

Of course users can manipulate the map with OPAC searches and focus on just one group (to the left you see the International Criminal Law group), but even one large group can be quite intimidating. Nevertheless those users who take some time can obtain a thorough knowledge about keywords grouped around one or two core subjects of the Peace Palace Library. Just start selecting a group using 'Group Selector' then click the largest bubble and check all the other keywords in the 'Information Pane'.

Librarians may use some of the more specific possibilities Gephi offers to look at maps in a very specific way. They may use for instance "Betweenness Centrality numbers" to look at 'broker' keywords, thus getting an idea about intermediaries. This knowledge too, I repeat, can be used to better respond to the needs of library users. I will write about this another time.

donderdag 6 november 2014

Library and user: one interest?

Quote: "But also interesting is, to see whether the library staff takes the interests of the patrons into account while acquiring documents for their collection? That is a subject for another blog."

Here I am referring to an earlier blogpost in which I tried to show what our users are looking for in the OPAC of the Peace Palace Library. In order to make this happen I focused on the use of our link resolver and presentation of a general subject in this link resolver. I used Tableau to create some graphs. 


However, the same thing can be done on the basis of the title descriptions which appeared on the screens in the reading room of the library after a succesful search. So, I collected all these titles and used all the keywords added to these titles to create a map using Gephi. In yet another blog I reported about this, although over there I used the recent acquisitions of the month of September.


In order to gain insight to answer the question "whether the library staff takes the interests of the patrons into account while acquiring documents for their collection?" I created two maps for comparison. One about the acquisitions in October and the other about the use of the OPAC in the readingroom in the same month. 



Acquisitions OPAC

If I enumerate the main subject topics which can be identified on indicated webpages, we get the following lists:


Acq:
  • International criminal law
  • Human rights
  • European Union
  • International trade
  • History
  • United Nations
  • Private international law
  • Islam/Islamic law
  • Law of the sea
  • Immigration
OPAC:
  • International criminal law*
  • Human rights*
  • European Union*
  • United Nations*
  • Intermational humanitarian law
  • International commercial arbitration
  • History/Politics*
  • Environmental protection
  • Law of the sea*
  • Space law

So our user behavior indicates special interest in Space, Environment, Commerce -among other things- which were not covered by our library staff. However, the library acquired material about Immigration, Islamic Law and Trade which was not looked for by our OPAC users. But of great importance is still the observation that both parties share their interest in the core business of the library of the Peace Palace: Criminal law, Human rights, European Union.

Only with regard to the peripheral areas differences exist and for a large part that can be related to current events, like boat refugees in the Mediterranean Sea, terrorism in the Middle East, space debris and environmental issues. 

Anyway, the simple fact that the 'small subjects' are also found and acquired, means that the library of the Peace Palace is on the right track. The 'small subjects' looked for now, were added in the past!