Scraping and grabbing

The project-based element of this course will be creating a set of spatio-temporal narratives about publishing and bookselling in Beirut.

One of the issues we have been facing with planning this course is how to grab large amounts of publishing data about Beirut to be able to transform this into a historical database about publishers, their locations, the subjects they published, the languages they published in and the dates of activity.

We have found some born-digital lists, some lists in books and we are working on geocoding them to provide students with a baseline.

First, I used webscraping techniques to pull down lists of publishers and bookstores currently in operation from the Yellow Pages.

Second, I wanted to automate the process for grabbing a list from the Library of Congress and NYPL.  The best I thought of for now was

This searches trilingually for the city Beirut and year in the KPUB info of the LOC catalog.

(The first bolded string is بيروت and the second bolded integer is the year of publication.)  As one might expect, the numbers grew slowing throughout the 50s 60s and 70s.  From 1975 to 1976, the number of holdings drop from 229 to 89.  The shoot back up to almost 300 in 1977.

Third, I am going to strip this data of all but its time, publisher and language, verifying that it actually contains just publications from Beirut.  Students who would like to study thematic evolution of publishing over the century will be able to mine this data.

More thought needs to be given to the what kind of restricted set of publications those coming from the LOC consist of.

Leave a Reply

Your email address will not be published. Required fields are marked *