Thoughts about libraries, data visualization and things like that.: Keywords. Collecting data.

In the last couple of weeks I blogged about keywords as they were displayed to the users of the OPAC in the library of the Peace Palace. I showed a couple of maps built with Gephi, some exhaustive, others very detailed.

But, how did I collect and adapt the data to be used by Gephi? I already mentioned "exposed keywords to the user" in an earlier blog. So to start with; what is the meaning of "exposed keywords"? I mean with this "keywords such as they occur in the presentation of the titles which were actually seen, perhaps even read, by the user". I' am interested in these keywords.

The enumeration of keywords in just one title can indeed be considered as a very small network. All these keywords are somehow linked to one another. Therefore, the first step is to gather all the presented titles and the second step is to collect all these small networks of keywords and then, lastly, to create one huge file which can be used by Gephi.

In the table below I give some examples of the file structure. In the left column you see five keywords (for Gephi they are nodes), each with a count of one, called 'use'. Underneath that you see the unique combinations of the keywords (for Gephi they are edges), also with a count, called 'weight'. The number after capital P is a unique keyword identifier. Gephi likes doing arithmetic with simple codes instead of -sometimes- long strings with weird characters in it. In the middle column you see the same, except now the keywords are from another title. In the third column both sets of keywords are combined. Take notice of the keyword 'Women', it occurs in both titles, therefore in the third column the 'use' is raised to two. At the bottom of each column you see the corresponding Gephi map.

nodedef>name VARCHAR,label VARCHAR, use INT
P076244229,"Middle East",1
P076242366,"Women",1
P076239810,"Islam",1
P076243265,"Family law",1
P07624234X,"Islamic law",1
edgedef>node1 VARCHAR,node2 VARCHAR,weight INT
P076244229,P076242366,1
P076244229,P076239810,1
P076244229,P076243265,1
P076244229,P07624234X,1
P076242366,P076239810,1
P076242366,P076243265,1
P076242366,P07624234X,1
P076239810,P076243265,1
P076239810,P07624234X,1
P076243265,P07624234X,1

nodedef>name VARCHAR,label VARCHAR, use INT
P076239519,"Refugees",1
P076242986,"Asylum",1
P076242366,"Women",1
P356258076,"Girls",1
P076256758,"Immigration",1
P241824206,"Convention relating to the Status of Refugees (Geneva, 28 July 1951)",1
P252290518,"Sex crimes",1
P255990391,"Gender",1
P332051005,"E-docs",1
edgedef>node1 VARCHAR,node2 VARCHAR,weight INT
P076239519,P076242986,1
P076239519,P076242366,1
P076239519,P356258076,1
P076239519,P076256758,1
P076239519,P241824206,1
P076239519,P252290518,1
P076239519,P255990391,1
P076239519,P332051005,1
P076242986,P076242366,1
P076242986,P356258076,1
P076242986,P076256758,1
P076242986,P241824206,1
P076242986,P252290518,1
P076242986,P255990391,1
P076242986,P332051005,1
P076242366,P356258076,1
P076242366,P076256758,1
P076242366,P241824206,1
P076242366,P252290518,1
P076242366,P255990391,1
P076242366,P332051005,1
P356258076,P076256758,1
P356258076,P241824206,1
P356258076,P252290518,1
P356258076,P255990391,1
P356258076,P332051005,1
P076256758,P241824206,1
P076256758,P252290518,1
P076256758,P255990391,1
P076256758,P332051005,1
P241824206,P252290518,1
P241824206,P255990391,1
P241824206,P332051005,1
P252290518,P255990391,1
P252290518,P332051005,1
P255990391,P332051005,1

nodedef>name VARCHAR,label VARCHAR, use INT
P076244229,"Middle East",1
P076242366,"Women",2
P076239810,"Islam",1
P076243265,"Family law",1
P07624234X,"Islamic law",1
P076239519,"Refugees",1
P076242986,"Asylum",1
P356258076,"Girls",1
P076256758,"Immigration",1
P241824206,"Convention relating to the Status of Refugee (Geneva, 28 July 1951)",1
P252290518,"Sex crimes",1
P255990391,"Gender",1
P332051005,"E-docs",1
edgedef>node1 VARCHAR,node2 VARCHAR,weight INT
P076244229,P076242366,1
P076244229,P076239810,1
P076244229,P076243265,1
P076244229,P07624234X,1
P076242366,P076239810,1
P076242366,P076243265,1
P076242366,P07624234X,1
P076239810,P076243265,1
P076239810,P07624234X,1
P076243265,P07624234X,1
P076239519,P076242986,1
P076239519,P076242366,1
P076239519,P356258076,1
P076239519,P076256758,1
P076239519,P241824206,1
P076239519,P252290518,1
P076239519,P255990391,1
P076239519,P332051005,1
P076242986,P076242366,1
P076242986,P356258076,1
P076242986,P076256758,1
P076242986,P241824206,1
P076242986,P252290518,1
P076242986,P255990391,1
P076242986,P332051005,1
P076242366,P356258076,1
P076242366,P076256758,1
P076242366,P241824206,1
P076242366,P252290518,1
P076242366,P255990391,1
P076242366,P332051005,1
P356258076,P076256758,1
P356258076,P241824206,1
P356258076,P252290518,1
P356258076,P255990391,1
P356258076,P332051005,1
P076256758,P241824206,1
P076256758,P252290518,1
P076256758,P255990391,1
P076256758,P332051005,1
P241824206,P252290518,1
P241824206,P255990391,1
P241824206,P332051005,1
P252290518,P255990391,1
P252290518,P332051005,1
P255990391,P332051005,1

Of course I do not create these lengthy (15.000 lines and more) files by hand. I wrote a couple of crude PHP scripts to generate a crude file. This file I clean up with R and Microsoft Excel and the resulting file is ready to be used by Gephi. The scripts use a MongoDB collection, which contains all the logging of OPAC use in our readingroom. It is possible to detect 'exposed titles' (so also the keywords therein) in this logging.

To conclude. This is all very technical stuff and we may not expect our users to do this kind of research themselves, based on rough data provided by the library. However, some library staff members should certainly be able to do this. And then communicate about the results, using interesting maps for instance. Communicate to management about library collection issues, communicate to users about trends, communicate about almost lost niches in the collection, communicate about actual, important subcollections which can be used in updating dossiers, research guides, alerting systems, etc.

My other blogs about 'Gephi in libraries':

Thoughts about libraries, data visualization and things like that.

donderdag 27 november 2014

Keywords. Collecting data.

Geen opmerkingen:

Een reactie posten