I’ve been curious about how to move an online collection from HTML into more of a Linked Data model. When I first started looking at it, schema.org was the Place to Be. Lately, though, it seems like Wikidata is where all the Cool Kids hang out.

So, I thought I would learn a bit about SPARQL, a query language for RDF-type semantic databases like wikidata. RDF (resourced description framework) is a structure based on triples, variously described as subject-predicate-object, thing-relation-thing, or another analogy. It lets you write queries on the pile of wikidata by asking, for example, for all the objects with same predicate.

I took a couple of workshops at conferences about all this, and it left me both daunted and bowed.

But now I’m ready to figure it out.

So, I found a really good little video explaining the SPARQL language by Navino Evans, one of the founders of Histropedia, who used an example of the women who graduated from the University of Edinburgh. It goes through how to make stuff appear (using SELECT), how to define what you want in your table (using WHERE), how to use labels, and some of the different visualisation tools you can use.

It was really easy to follow along, so I did — substituting University of Toronto for his institution.

How to define your SPARQL query

On the first line, use SELECT to define the columns you want to see in you table. You will define these in the next section. For example:

SELECT ?person ?personLabel ?birthPlaceLabel ?coordinates ?birthDate ?deathDate ?image

This says show the person’s name, place of birth and its latitude and longitude, their dates of birth and death (I am a librarian) and their image.

In the WHERE section, you indicate the predicate (in this case, ?person, this is about a person), the property (wdt: P27, country of citizenship) and the value (wd: Q16, Canada). That is, give me a list of Canadians. You must end statements with a period. For example:

WHERE {
?person wdt:P27 wd:Q16 .
}

So, that would be a pretty huge result set!

It also shows you how to use a service, in this case the Label service. This puts the name onto the identifier. For example, the Canadian author Gail Bowen is entity Q1491217. Without the Label service, she would be referred to only as Q1491217 in  your table. Labels are fun!

Here is the code I ended up running with (pun not really intended) in the Wikidata Query Service. The text after the # is a comment, explaining the code. It will run that way in the WQS, but not in some of the other ways of displaying your results.

SELECT ?person ?personLabel ?birthPlaceLabel ?coordinates ?birthDate ?deathDate ?image

WHERE {
?person wdt:P27 wd:Q16 .                        #country of citizenship (P27) is Canada (Q16)
?person wdt:P69 wd:Q180865 .                    #educated at (P69) UofT (Q180865)
?person wdt:P21 wd:Q6581072 .                   #sex or gender (P21) is female (Q6581072)
?person wdt:P19 ?birthPlace .                   #place of birth (P19) is named ?birthPlace, add ?birthPlace to query, with Label to call the Label service
?birthPlace wdt:P625 ?coordinates .             #co-ordinate location (P625) is named ?coordinates, returns latlong 
?person wdt:P569 ?birthDate .                   #date of birth (P569) is named ?birthDate
OPTIONAL {?person wdt:P570 ?deathDate .}        #date of death (P570) is named ?deathDate
OPTIONAL {?person wdt:P18 ?image .}             #image (P18) is called ?image

SERVICE wikibase:label {                       
bd:serviceParam wikibase:language "en" .
        }                                       #label service, adds label to query results (which are just numbers), add ?personLabel to query
}

This says, basically: give me a list of people of Canadian citizenship, educated at the University of Toronto, who are female. Show me where they are born, and the latitude and longitude of that place. Give me their death dates and image, if available. Oh, and please use the names of the things I am looking for, not just their ID numbers.

So, if you don’t make the two OPTIONAL fields optional, it makes them mandatory. That means, if they aren’t dead or if no picture of them exists, they don’t make the list. For example, this result set currently has 289 results.

  • If the image is not optional, but death date is, there now are 67 results.
  • If the death date is not optional, but image is, there now are 98 results.
  • If neither is optional, there are only 15 results. For readability’s sake, this is the one pictured in the graphic interpretations, below.

I like this example, because you really do run up against the invisibility of the female gender in the Wikiverse (and, let’s face it, in the Universe). Imagine if I were looking for aboriginal women!

Wikidata Query Service

This visualisation is a table. By clicking on the little eye (top left, under the code), you have the option for a number of other ways of seeing your data, including map and timeline. To run the code, press the little blue arrow on the left, above the table and below the code.

Here is a timeline of the women who went to UofT, who have already died, and who have wikidata records and who have images uploaded.  (NOTE: You can change this, by adding data to the dataset. Anyone can do it. Surely, more than 15 women of note graduated from UofT, lived, made a valuable contribution, and died. The eldest was born in 1868, for goodness’ sake.)

Screenshot_2019-05-28 Histropedia - Wikidata Query Timeline

This is a timeline visualisation using Historpedia’s wikidata viewer. By clicking the little eye-shaped icon on the right, below the code in the SPARQL query window, you can see a number of other options, including map (if you select the P625, coordinate location, field in your query).

Here is a grid view of the same query.

Screenshot_2019-05-28 Wikidata Query Service Grid
So, daunted I should not have been, really. I must address my fear of acronyms (FOA) and plough forward undaunted in the future. This was NHAA (not hard at all).

Next, I need to learn how to contribute to Wikidata. I did do a foray a year or two ago, using a wikidata game, but it didn’t give me enough information and I mis-attributed a couple of pieces of data. They didn’t slap me down with as much glee and (un)intentional violence as editors on the Wikipedia site, but it still left me feeling a bit back-footed.

Because I have a slower learning curve, I need to understand 100% before contributing so that I am not attacked. These projects are not exactly transparent to someone who doesn’t already understand the ecosystems (or who, like me, has been away from it for a couple of decades). Don’t bite the newbies, dude. It’s a thing.

Wikidata is very much more welcoming than Wikipedia, and less ambiguous, to boot. Be not afraid.

It’s been a while!

Toronto Public Library (TPL) has digitized about 11 thousand historical Canadian ebooks, which are freely downloadable on their website. It would be cool to lay them out on virtual shelves, and make them into a virtual library where you can browse items and get that library serendipity thing happening.

To do that, I want a list of the books by call number, title and publication date. If I sort this list into call-number order, I can find related books by their proximity to the one I am seeing.

I already have a PowerShell script to pull data from a website so I’ll start from there.

First, I search the archive to limit the results to the books I want, and then I get the web address of the RSS feed for those results (bottom left-hand corner of the first page of results).  For example, the link to the human-viewable page is: http://www.torontopubliclibrary.ca/search.jsp?Erp=20&N=38537+37906+38531&Ns=p_dig_date_record&Nso=1&view=grid

where, I guess

  • Erp is the number of results on the page
  • N=38537 is the Digital Archive
  • N=37906 is books
  • N=38531 is Baldwin, the name of the special collection.

The RSS feed for the same results is: http://www.torontopubliclibrary.ca/rss.jsp?Erp=20&N=38537+37906+38531&Ns=p_dig_date_record&Nso=1&view=grid

As you can see, the RSS feed has exactly the same parameters as the page URL. When we go to the second page, we get this URL (change: &NO=20): http://www.torontopubliclibrary.ca/search.jsp?Erp=20&N=38537+37906+38531&No=20&Ns=p_dig_date_record&Nso=1&view=grid

This allows us to page through the RSS, as well. This is handy, because it turns out they only ever serve 150 results per query. Grrr. Ask me how I know.

We can now build URLs that pull 150 results at a time, using Erp=150 and increasing No by 150 per result set.

Let’s look at some code.

#create object to grab thingies off the internet
$WebClient = New-Object System.Net.WebClient
$WebClient.Encoding = [System.Text.Encoding]::UTF8

# RSS returns only max 150 books at a time.  Paging 150 at a time, total books in this request = 11342
$pageSize = 150
$pageStart = 0
$totalRecords = 999999

The web client is our handy-dandy in-memory browser. Note, we set the encoding to UTF-8 so we can get all the accents and special characters. We also set some variables for page size and page number, and a default for the number of books in our search (to change once we have our first page results).

# force insertion order to be maintained on hash table
$list = [Ordered]@{}

We create a hash table to hold all the lines of data that we create for each book: @{}.  The [Ordered] part of the statement allows the initial insertion order to be maintained, which is usually lost in hash tables.

Speedy retrieval + Order = Awesome!

This is the plan for our main block to process each file of 150 books.

while ($pageStart -lt $totalRecords) {
    #Build a URL for 150 books and retrieve the RSS data
    #Compare total books to default value, replace default with actual
    #Loop through each of the 150 books, extracting call number, date and title
    #Add this filtered data to the list from previous books
    #Move to the next page of results and repeat
}

To build a URL for the books, substitute in the starting point.

    #Build a URL for 150 books and retrieve the RSS data
    $link = "http://www.torontopubliclibrary.ca/rss.jsp?view=grid&Erp=$pageSize&No=$pageStart&Ntt=Books&N=38537+37906+38531"
    [xml]$books = $WebClient.DownloadString($link)

Find the number of books matching your search. This is a value in the RSS feed. Store it.

    #Compare total books to default value, replace default with actual
    if ($totalRecords -eq 999999) {
        $totalRecords = $books.rss.channel.results.'total-results'
        "Total records: $totalRecords"
    }

To extract the data we look at each book and grab the columns  we need and write it to a list.

    #Loop through each of the 150 books, extracting call number, date and title
    foreach ($item in $books.rss.channel.item) {
        # create line of data for each book, fast way.
        $filteredItem = [PSCustomObject]@{
            CallNumber = $item.SelectSingleNode('./record/attributes/attr[@name="p_dig_identifier"]')."#text"
            PubDate = $item.SelectSingleNode('./record/attributes/attr[@name="p_dig_pub_date"]')."#text"
            Title = $item.title
        }
        #Add this filtered data to the list from previous books
        $list[$item.record.recordId] = $filteredItem
    }

Using a PSCustomObject to create the summary line is a much faster and cleaner way of doing this than the way we built this in previous projects.

We also used SelectSingleNode, a quicker, cleaner way to get named attributes from the RSS attribute list. For more information on how this works in this particular RSS feed and why that hurts so much, see our previous post, Looking at Toronto Public Library book data.

To move to the next page, recalculate the starting book number to start at the next 150 books.

    # Move to the next page of results and repeat
    $pageStart += $pageSize

This is repeated once for each 150-book set, until there  are no more books remaining in the search.

Once all books have been summarized, we write the full list to a comma-separated (CSV) variable file. Note that we also wrote it in UTF-8 encoding, to capture all the special characters.

# export list to CSV
$list | Select -expand Values | Export-Csv '.\booklistrss.csv' -NoTypeInformation -Encoding UTF8

Here is formatted, sorted Excel spreadsheet: booklistrss-csv

Source code

#extract-callnumbers.ps1

#create object to grab thingies off the internet
$WebClient = New-Object System.Net.WebClient
$WebClient.Encoding = [System.Text.Encoding]::UTF8

# RSS returns only max 150 books at a time.  Paging 150 at a time, total books in this request = 11342
$pageSize = 150
$pageStart = 0
$totalRecords = 999999

# force insertion order to be maintained on hash table
$list = [Ordered]@{}

while ($pageStart -lt $totalRecords) {
    #Build a URL for 150 books and retrieve the RSS data
    $link = "http://www.torontopubliclibrary.ca/rss.jsp?view=grid&Erp=$pageSize&No=$pageStart&Ntt=Books&N=38537+37906+38531"
    [xml]$books = $WebClient.DownloadString($link)

    #Compare total books to default value, replace default with actual
    if ($totalRecords -eq 999999) {
        $totalRecords = $books.rss.channel.results.'total-results'
        "Total records: $totalRecords"
    }

    "Processing $($pageStart + 1) - $($pageStart + $pageSize), $($books.rss.channel.item.Count) titles"
    "link: $link"

    #Loop through each of the 150 books, extracting call number, date and title
    foreach ($item in $books.rss.channel.item) {
        # create line of data for each book, fast way.
        $filteredItem = [PSCustomObject]@{
            CallNumber = $item.SelectSingleNode('./record/attributes/attr[@name="p_dig_identifier"]')."#text"
            PubDate = $item.SelectSingleNode('./record/attributes/attr[@name="p_dig_pub_date"]')."#text"
            Title = $item.title
            #PubYear = $item.SelectSingleNode('./record/attributes/attr[@name="Publication Year"]')."#text"
        }
        #Add this filtered data to the list from previous books
        $list[$item.record.recordId] = $filteredItem
    }
    
    #show first 10 to screen
    #$list[0..9] | ft CallNumber, PubDate, PubYear, Title

    # Move to the next page of results and repeat
    $pageStart += $pageSize
}

""
"Exporting to CSV..."
# export list to CSV
$list | Select -expand Values | Export-Csv '.\booklistrss.csv' -NoTypeInformation -Encoding UTF8
Skyline of Toronto in golden tones with the words Open Toronto in the foregrond

Open Toronto by Richard Pietro, 2015

So, since I started playing around with their data, it looks TPL has published some open data on their website. To find it, go right to the very bottom of the tpl.ca website and look for Open Data and Feeds in the left-hand column.

I heard about this from the excellent Toronto Open Data Bookclub meetup, which will henceforth be meeting in the reference library at Yonge & Bloor.

Toronto Public Library open datasets

A lot of the datasets they offer are canned feeds from their website, so the work I did earlier with the live feed will still give me the most current data. They do, however, provide a real-time feed of searches done on their website, which is interesting (but I will look at later). Read the rest of this entry »

Book jacket: Gardner. Five minds for the future.Gardner, Howard. Five minds for the future. Boston, Mass. : Harvard Business School Press, c2007. ISBN 9781591399124.


American developmental psychologist Howard Gardner, Hobbs Professor of Cognition and Education at the Harvard Graduate School of Education, wrote this book to describe the intellectual skills and the type of learning he thinks will be necessary to succeed in the workplace of the future.

He proposes five “minds”, or domains of learning, that educators, employers and individuals must foster:

  • The disciplined mind
  • The synthesising mind
  • The creative mind
  • The respectful mind
  • The ethical mind

By disciplined mind, he means someone who has an in-depth knowledge of her or his domain or profession (at least 10 years), and a grasp of the fundamentals of science, math, history and of other intellectual pursuits. He sees this starting at a young age, after the mastery of the basic literacies in school, and continuing to develop and deepen throughout life. He also discusses interdisciplinary learning, where mastery in more than one domain is acquired, and where the two or more domains inform the decisions and practice of the individual. Read the rest of this entry »

A little while ago, I found an archived recording of a talk by Jamie Larue speaking at the University of Illinois’ ILEAD USA get-together this past June.

The Fourth Turning: Leadership and Social Change looked at libraries and leadership through a lens developed by William Strauss and Neil Howe, known as the Fourth Turning. The big lines of the theory are that generations succeed each other in a predictable order, and that can be used to predict what may happen in the future. Mr. Larue takes it and applies it to the Retiring Boomer phenomenon in libraries.

I found it very interesting, I must confess. And I learned a few things, too! For example, apparently Boomers feel that having a meeting IS doing something about a problem. This explains a lot, to me, about something that has been causing me a bit of stress: why spend all this time talking instead of getting something done? (You got it: I’m probably a GenX.)

One of the most interesting parts of his discussion was his take on the two ways libraries justify their existence as civic institutions.

  • The first, “what’s in it for me“, speaks to the Boomer population. It’s what gets us to refer to the people who use libraries as consumers, and explains why we are so often told to “run it like a business”. It’s resulted in decreasing support despite increasing usage and effectiveness.
  • The second, “what’s in it for us“, speaks in terms of the public good and of people as citizens. Think Carnegie. Think citizen as a contributing member of society and libraries helping make it so: very turn-of-the-last-century. It’s what happens when we refer to libraries as the third place, or as community hubs, strengthening the fabric of society with inclusiveness and respect.
Inside shot of people working in the reading room of the Carnegie library in Munhall, PA, at the turn of the last century. Lots of women in long dresses and big oak furniture and very sepia.

Adult reading room, Carnegie library of Homestead, Munhall, Pennyslvania

Read the rest of this entry »

120px-HTML5_logo_and_wordmark.svgBack in June, I started the Learn HTML5 from W3C MOOC offered through edX. This is a 6-week course, with an estimated 6-8 hours of effort per week required. Myself, I spent about 45 hours on the course, but did not complete half of week 5 or any of week 6.

They are offering this course again beginning October 5, 2015.

This MOOC is free to register, but you can get a verified certificate for US$ 129. It’s taught by Michel Buffa, and each week begins with a video lecture and has exercises that can be done using a JS Bin console.

This course assumes you are familiar with the older HTML standards, it is not for a raw beginner. It only describes the changes the new standard makes to the way web pages work, and assumes a basic HTML and CSS experience, and uses some JavaScript.

I particularly liked how he focused on the need for accessible design and on how HTML5 facilitates access for all. He starts by discussing the structural elements: nav, article, section, headings, table of contents and the main element. He also goes into microdata, which is important for linked data.

Next, the audio and video elements are discussed, including designing a console, adding closed captions, and using your own web camera for video streaming. This is not really something I plan on doing a lot of, but it was very interesting and useful to understand how to do it.

Next, the canvas element: drawing little shapes and lines, doing transforms on them and, finally, animating them. He went over how to draw text, images, lines and paths, arcs and circles, rounded rectangles and quadratic curves before showing us how to animate all that. It was mad fun. I even made a little animation, which I never suspected I would ever do.

Where I got sidetracked by life was the HTML5 Forms (week 5), and the final week, Basic APIs. I will probably do the October 5th class in order to complete these two sections, but the learning curve steepened rather quickly.

Because it’s just an overview of the changes brought by the new standard, there is a lot of ground to cover and some sections are pretty heavy slogging. But it’s very well done. I am not uniformly interested in all of the changes, so that is likely why I found some parts more difficult than others.

It’s good to do if you have done HTML in the past and want to know how to use the new standard, which is the most inclusive standard yet. HTML5 really does rock.

 

I just completed Stanford’s CS101 Computer Science 101 MOOC. It is at a very beginner level, intended for people who use computers but who don’t know a whole lot about them.

It’s free to register, although you can pay for a verified ID credit. Regardless of whether you pay or not, you can obtain a certificate of accomplishment — suitable for framing — at the end of the course if you do the exercises.

It is “self-paced”, basically an archive of a course taught by Nick Parlante in the summer of 2012. For me, that means I actually have a shot at completing it! I found it a very useful and enjoyable course with an enthusiastic and engaged professor.

This class provides a very good introduction to how computers work. It jumps right into some simple and fun coding exercises involving image manipulation that allow you to easily see what changes you are making with your code. It goes into variables a little bit, and forays into for-loops, using small image-related exercises and puzzles to illustrate the concepts. The exercises are very relevant and also quite fun. The blue-screen image manipulation section in particular was a lot of fun.

After images, he tackles computer hardware and the byte-kilobyte-megabyte-gigabyte issue. Next, software, networking and TCP/IP. His explanations are non-technical and interesting, his approach is less “For Dummies” than “For people who haven’t learned this yet”. Refreshing, not cheesy.

Then he gets into tables and how spreadsheets work, again with clear explanation and simple yet relevant tasks. Finally, he goes into the differences between analog and digital, and explores various types of digital media. The last section, Internet Security, should be incorporated into every library user ed course involving computers. It was well done, not alarmist but also very clear on the risks and how to assume them responsibly.

I would most definitely take another online class with Nick Parlante. He is one of the best instructors I have seen: he explains the material clearly, organises it in a very understandable manner, and has an obvious respect for the people he is teaching. Well done!