Table of Contents

Writing a little web scraper is not difficult. However, I had to spend some time to find out the right libraries and resources. It is also not trivial at first to start a Lisp project, so

this tutorial is a mix of hopefully useful stuff, for people interested in Common Lisp and with not much experience with the language and its ecosystem.

In part 1, "Make it work", we explore how to do asynchronous web scraping. We try things out at the repl and we create a first package. We learn basic data structures and their gotchas in the way, we also see where to find documentation, some useful libraries and beginner-friendly tips.

Part 2 is a short introduction to the CL Object System (CLOS), which is much more a pleasure to work with than dealing with alists and other data structures.

In part 3, "Tests and CI", we write our first unit test, with mocks, which gives us the opportunity to write a first macro, and we setup Continuous Integration on Gitlab CI.

Part 4 is about building and delivering self-contained executables, parsing command line arguments and such. While it was necessary for us at the time to gather this information, it is now available in the Common Lisp Cookbook and is probably more up to date. We advise the reader to go there.

Part 5, "Moving functions into their own package", shows and explains things that appear strange at first, like "shadowing import", symbols conflicts and such. We also write helper functions and make good use of closures.

Part 6 is about error handling or rather condition handling.

Part 7, "More features", shows only a cache system at the moment.

Finally, we have a look at CL21.

1 Make it work

Scraping The Pirate Bay is easy, they don't rely on javascript-generated pages. We just have to:

  • get the html page (dexador: (dex:get <url>))
  • parse the html into a data structure (plump: (plump:parse <html>))
  • search with CSS selectors (lquery: (lquery:$ <parsed-html> <selectors>))

We suppose you have a ready Common Lisp development environment. If not, see the Cookbook/getting-started (and if that's not enough, fill an issue !).

Let's go.

Install our dependencies right away:

(ql:quickload '("dexador" "plump" "lquery" "str"))

To begin with, we do a search on the website and we copy-paste the url. We get one like this:

(defparameter *search-url* "https://piratebay.to/search/?FilterStr=matrix&ID=&Limit=800&Letter=&Sorting=DSeeder"
    "the url to search matrix.")

It has our search term in it (matrix) along with url parameters.

It also sorts the results by number of seeders for us :) (&Sorting=DSeeder).

We will use CSS selectors to extract information from the web page, so we can use our browser's developer tools to inspect the structure of the page and guess our selectors: right click on a result's title, choose "inspect element". It highlights some html similar to this:

<td class="Title">
  <span class="ColorA">
    <a href="https://piratebay.to/torrent/1922147/Matrix (1999)Blu-Ray 720p Dublado PT-BR - mo93438/" onclick="Javascript:OpenDetailPage('https://piratebay.to/torrent/1922147/Matrix (1999)Blu-Ray 720p Dublado PT-BR - mo93438/'); return false;">Matrix (1999)Blu-Ray 720p Dublado PT-BR - mo93438 </a>
  </span>
  <br>
    <span class="ColorB VaA">Upload Date: 20.02.15 </span>
    <span class="ColorB VaA">Size: 796,86 MB </span>
    <span class="ColorB"> </span>
</td>

The title is well delimited so we'll start selecting our elements by the CSS class Title, which gives:

(defparameter *selectors* ".Title")

If you are not accostumed to CSS selectors, this post is a nice helper: https://codingsec.net/2016/12/select-specific-text-css-using-selectors/.

1.1 Trying out at the REPL

Let's try in the REPL:

(defparameter html (dex:get *search-url*)) ;; returns the html
(defparameter parsed (plump:parse html))   ;; returns a list of plump elements
(defparameter results (lquery:$ parsed *selectors*)) ;; returns a list of stuff
(setf results (lquery:$ parsed *selectors*))

it returns:

#(#<PLUMP-DOM:ELEMENT th {100369D943}> #<PLUMP-DOM:ELEMENT td {10036A7CE3}>
  #<PLUMP-DOM:ELEMENT td {10036B6163}> #<PLUMP-DOM:ELEMENT td {10036FC903}>
  ...
  #<PLUMP-DOM:ELEMENT td {10047C0BC3}> #<PLUMP-DOM:ELEMENT th {10047D44E3}>
  #<PLUMP-DOM:ELEMENT td {10047DE853}> #<PLUMP-DOM:ELEMENT th {10047F2333}>
  #<PLUMP-DOM:ELEMENT td {10047FC673}> #<PLUMP-DOM:ELEMENT td {100480ACF3}>
  #<PLUMP-DOM:ELEMENT td {10048195D3}> #<PLUMP-DOM:ELEMENT td {1004827D73}>)

We can check its inner text representation:

(lquery:$ parsed *selectors* (text))
;; =>
#("
Title
"
  "Matrix FRENCH DVDRIP 1999 COOLUpload Date: 05.06.15 Size: 700,30 MB"
  "The Matrix Reloaded (2003) FullHD, Dual Audio: English + SpaUpload Date: 12.04.15 Size: 8,51 GB"
[...]
"
  "Arturia - Matrix 12-V v1.0.1.9 OS X [HEXWARS][dada]Upload Date: 28.12.14 Size: 100,86 MB"
  "Native Instruments - Maschine Expansion Golden Kingdom HYBRIUpload Date: 03.06.15 Size: 267,84 MB"
  "Arturia - Matrix-12 V v1 1 0 522 R2 AU AAX VST VST3 ST OS X Upload Date: 02.04.15 Size: 100,49 MB"
  "Arturia - Matrix-12 V v1.1.0.522 OS X [PitcHsHiFTeR][dada]Upload Date: 28.03.15 Size: 130,44 MB")

A little explanation for lquery: the last (text) part is an lquery thing to get the text representation of the node, instead of a lquery internal object. Likewise we'll be able to call (attr :href) or (html), which are self-explanatory. If you want CL symbols there, use (inline (...)).

I like to check the html content of the plump nodes. We use the serialize plump function (the second function from the doc ;) ):

(plump:serialize (first results))

but we get an error:

The value #( … all the content of results here … ) is not of type LIST when binding LIST …

Sorry for the inconvenience. Indeed, lquery returns a vector, not a list (we can see that with #() that denotes a vector), so we can not use first but have to use (aref <vector> 0) instead, or we have to coerce the result to a list.

(see the Cookbook's Data Structures page).

Personnally I find this frustrating, particularly being used in Python to access lots of data structures in the same manner. If you feel like fixing this, have a look at CL21, "Common Lisp for the 21st century" which, amongst other features, redefines some functions to make them generic (that work on lists, vectors, hashmaps,…). CL21 is a CL library, meaning we can of course use the others CL libraries with it, or that we can use cl21 alongside CL in only in some places but not in all our project (like in only one file, one "package").

It is written by a super productive and innovative CL hacker and has 600+ stars on github. Nevertheless, it wasn't touched in two years and, as it lacks docstrings and direction, we can be surprised by the new implementation of some functions (see its issues). Some people discourage the use of it. I, at least, am a happy user for the moment :)

Allright so, back to printing the html content of our first result:

(plump:serialize (aref results 0))
<th class="Title header ">
<a href="https://piratebay.to/search/0/800/0/matrix/0/ATitle/1/">Title</a>
</th>
">"

gosh, there is not much in it. I was too quick in defining my CSS selector. This first result should not be catched by our selector, because it is not a link to a torrent but only the header of the table (see "Title header" and the th ?).

But that's not all. I also want to scrape the number of seeders and leechers and the .Title selector doesn't include them (this I see with the browser's web inspector). If I select the including tr I'll get all the data, and I must ensure to select those tr from inside the body of the table, tbody. So I'll use a first selector that will return a list of elements of the table:

(setf *selectors* "tbody tr")

and then for each result I'll get the title and the number of seeders.

So I can scrape again my search results with the new selector:

(setf results (lquery:$ parsed *selectors* (text)))

this should print interesting stuff, like our torrents titles and even strings like S: 16L: 1 which are indeed our seeders and leechers.

I check the html content and it seems allright. It has the link to the torrent's page inside the href of the "a" element, as well as the seeders count.

Note that we can also inspect the results with the mouse: left/right clicks on the elements printed in the REPL get us into the Slime inspector.

1.2 Putting it together in a function

We came up with this function:

(defun torrents (words)
  ""
  (let* ((html (dex:get *search-url*))
         (parsed (plump:parse html))
         (res (lquery:$ parsed *selectors*)))
    res))

and if you prefer a threading macro / pipes, no problem, but we'll load another external library:

(ql:quickload "arrow-macros")
;; and we import its symbols:
(use-package "arrow-macros")
(-<>> *search-url*
  (dex:get)
  (plump:parse)
  (lquery:$ <> *selectors*))

arrow-macros defines a few threading macros. The classic one would be ->, which inserts the result of the preceding form as first argument, ->> that puts it last, which is what we wanted for the two forms but not for the last one, with lquery, which needs the parsed html as first argument. So we use -<>>: the arrow will populate the last argument, except when it encounters a <> placeholder. -<>> has a little name, "Diamond Spear".

1.3 Creating a new project

Before we write more functions we should create a new project. For this I use a skeleton generator which will create the right defsystem, defpackage and so for us.

I use cl-project, which also generates a tests skeleton (in the contrary of quick-project):

(ql:quickload "cl-project")
(cl-project:make-project #P"~/path/to/cl-torrents/")

Note that it may be easier for you sometimes if you create your new Common Lisp projects into ~/.quicklisp/local-projects (known by Quicklisp) or ~/.local/share/common-lisp/ (known by ASDF). Doing so, you will be able to ql:quickload your project right away.

1.4 Adding our dependencies

Our new cl-torrents.asd looks like this:

#|
  This file is a part of cl-torrents project.
|#

(in-package :cl-user)
(defpackage cl-torrents-asd
  (:use :cl :asdf))
(in-package :cl-torrents-asd)

(defsystem cl-torrents
  :version "0.1"
  :author ""
  :license ""
  :depends-on ()  ;; <== list of dependencies
  :components ((:module "src"
                :components
                ((:file "cl-torrents"))))
  :description ""
  :long-description
  …)

For pythonistas, it is very similar to a setup.py.

It has the depends-on paramater which accepts a list of package names. We have to register here dexador and the others:

:depends-on (:str
             :dexador
             :plump
             :lquery)

and arrow-macros if you wish.

If you need to set package versions, use something like:

:depends-on ((:version "cl-ppcre" "2.0.11"))

but there is usually no need. To learn more about dependencies management in Common Lisp, what to do when you need project-local dependencies (like a virtual env), see the Cookbook.

1.5 Loading the project

Open the .asdf file and compile and load it. In Slime, it's with C-c C-k (slime-compile-and-load-file, see also the Emacs menu).

Now we can load the project at the REPL and install its dependencies:

(ql:quickload "cl-torrents" ;; or asdf:make
; compiling file "/home/vince/projets/cl-torrents/src/cl-torrents.lisp" (written 28 AUG 2017 10:21:07 PM):
; compiling (IN-PACKAGE :CL-USER)
; compiling (DEFPACKAGE CL-TORRENTS ...)
; compiling (IN-PACKAGE :CL-TORRENTS)
; compiling (DEFPARAMETER *SEARCH-URL* ...)
; compiling (DEFPARAMETER *SELECTORS* ...)
; compiling (DEFUN TORRENTS ...)

; /home/vince/.cache/common-lisp/sbcl-1.3.19-linux-x64/home/vince/projets/cl-torrents/src/cl-torrents-tmp5GEXGEG5.fasl written
; compilation finished in 0:00:00.029
; compilation unit finished
T

And now we can use our function at the REPL.

We go into our package so that we can call our functions directly:

(in-package :cl-torrents)

We could import the functions from our package and call them directly, but we need to export them and we'll see that shortly.

We could call them with the project prefix, but we need a doublon colon because our functions are not exported yet (so they're kinda private, but not strictly, like with a method starting with the underscore _ in Python).

(cl-torrents::torrents "matrix")

1.6 Searching with our keywords

Until now we only tried things out with a given search url, set in stone. It's time to insert our own search terms into this search url.

We'll put a {KEYWORDS} placeholder into the url:

(defparameter *search-url* "https://piratebay.to/search/?FilterStr={KEYWORDS}&ID=&Limit=800&Letter=&Sorting=DSeeder"
    "base search url. {KEYWORDS} to be replaced by + separated words.")

which we will replace with a +-separated list of keywords.

With a little look at the "strings" cookbook page, we'll go with the little str library (our lib actually):

(ql:quickload "str")

Let's try:

(defparameter words "matrix trilogy")
;; => WORDS
(str:words words)
;; => ("matrix" "trilogy")
(str:join "+" *) ;; the * is a REPL shortcut to insert the previous result. + inserts the previous input.
;; => "matrix+trilogy"

and voilà. We put this at the beginning of our search function and we get:

(defparameter *search-url* "https://piratebay.to/search/?FilterStr={KEYWORDS}&ID=&Limit=800&Letter=&Sorting=DSeeder" "base search url. KEYWORDS to be replaced.")

(defun torrents (words)
  "Search torrents."
  (let* ((terms (str:words words))
         (query (str:join "+" terms))
         (*search-url* (str:replace-all "{KEYWORDS}" query *search-url*))
         (req (dex:get *search-url*))
         (html (plump:parse req))
         (res (lquery:$ html *selectors*))
         (res-list (coerce res 'list))))

    res-list))

In the end we prefer to return a list, rather than a vector.

Let's try:

(torrents "matrix trilogy")
("Matrix FRENCH DVDRIP 1999 COOLUpload Date: 05.06.15 Size: 700,30 MB"
  "The Matrix Reloaded (2003) FullHD, Dual Audio: English + SpaUpload Date: 12.04.15 Size: 8,51 GB"
  "The Matrix Trilogy (1999-2003) + Extras 1080p BluRay x264 DuUpload Date: 12.02.15 Size: 12,86 GB"
  "The Matrix Trilogy (1999-2003) BluRay BDRip 1080p AC3Upload Date: 15.09.15 Size: 23,29 GB"
  "The Matrix Trilogy (1999-2003) BDRip 1080p Dual Audio [ HindUpload Date: 14.01.15 Size: 10,23 GB"
  "The Matrix Revolutions (2003) BRRip [Dual Audio] [Hindi+Eng]Upload Date: 24.02.15 Size: 496,36 MB"
  "Matrix (1999)Blu-Ray 720p Dublado PT-BR - mo93438Upload Date: 20.02.15 Size: 796,86 MB"
  "The Matrix Reloaded (2003) BRRip [Dual Audio] [Hindi+Eng] 50Upload Date: 22.02.15 Size: 496,39 MB"
  [and more results]

Cool !

We can commit this, have a break and enjoy how things are going. It was very easy, except one or two gotchas :)

Of course, we need to get more stuff out of this, like the torrent's magnet link.

1.7 Formatting output

Our torrents function prints out intelligable output, but we don't control it yet. We want to iterate over the search results and print exactly what we want.

So first we need to extract the title, with the CSS selector we found at the beginning.

(defun result-title (node)
  "Return the title of a search result."
  (aref
   (lquery:$ node ".Title a" (text))
   0))

When we iterate over the result list:

(defun display-results (&optional (results *last-search*) (stream t))
  "Results: list of plump nodes. We want to print a numbered list with the needed information (torrent title, the number of seeders,..."
  (mapcar (lambda (it)
            ;; do not rely on *last-search*.
            (format stream "~a~%" (result-title it)))
          results)
  t)

it prints something like:

Matrix FRENCH DVDRIP 1999 COOL
The Matrix Reloaded (2003) FullHD, Dual Audio: English + Spa
The Matrix Trilogy (1999-2003) + Extras 1080p BluRay x264 Du
The Matrix Trilogy (1999-2003) BluRay BDRip 1080p AC3
The Matrix Trilogy (1999-2003) BDRip 1080p Dual Audio [ Hind
Matrix (1999)Blu-Ray 720p Dublado PT-BR - mo93438
...

What we have in mind is to print the index of the result next to it, and for convenience to print the first result last, so that it stays to the bottom and it's easier to see from the prompt.

We have a quick look at the Cookbook for string formatting (the simplest directive is ~a, for aesthetics, and justifying text on the left is with ~@a). ~% is the newline.

(defun display-results (&optional (results *last-search*) (stream t))
  "Results: list of plump nodes. We want to print a numbered list with the needed information (torrent title, the number of seeders,..."
  (mapcar (lambda (it)
            (format stream "~3@a: ~65a ~%"
                    (position it *last-search*) ;; <-- find the position of the result in the list
                    (result-title it)))  ;; <-- we reverse the list
          (reverse results))

          t)

Here we use another global variable that we introduced eventually. In the end of our torrents function, we add this:

(setf *last-search* res-list)

so that our search results are saved in this variable which we define:

(defvar *last-search* nil
    "Remembering the last search.")

and we can easily access this result list elsewhere.

So, we get this formatting:

198: Arturia - Matrix-12 V v1.1.0.522 OS X [PitcHsHiFTeR][dada]
197: Arturia - Matrix-12 V v1 1 0 522 R2 AU AAX VST VST3 ST OS X
196: Native Instruments - Maschine Expansion Golden Kingdom HYBRI
195: Arturia - Matrix 12-V v1.0.1.9 OS X [HEXWARS][dada]
194: PPPD-374 Ikuchichi Beauty Salon That Just Busty Beauty Is In
193: THE MATRIX TRILOGY: Complete Collection - DVDRip
...
 10: Matrix Reloaded (2003)Blu-Ray 720p Dublado PT-BR - mo93438
  9: Matrix Revolutions (2003)Blu-Ray 720p Dublado PT-BR - mo9343
  8: Die Matrix Trilogie - 1 bis 3 - KOMPLETT
  7: The Matrix Reloaded (2003) BRRip [Dual Audio] [Hindi+Eng] 50
  6: The Matrix Revolutions (2003) BRRip [Dual Audio] [Hindi+Eng]
  5: Matrix (1999)Blu-Ray 720p Dublado PT-BR - mo93438
  4: The Matrix Trilogy (1999-2003) BDRip 1080p Dual Audio [ Hind
  3: The Matrix Trilogy (1999-2003) BluRay BDRip 1080p AC3
  2: The Matrix Trilogy (1999-2003) + Extras 1080p BluRay x264 Du
  1: The Matrix Reloaded (2003) FullHD, Dual Audio: English + Spa
  0: Matrix FRENCH DVDRIP 1999 COOL
T

The indexes are aligned on 3 digits on the right with ~3@a and titles are truncated at 65 characters, nice :) It will be easy to add more information on the right side (seeders, leechers).

_Note__: We can also truncate the titles and add an ellipsis with str:prune:

2: The Matrix Trilogy  (...)
1: The Matrix Reloaded (...)
0: Matrix FRENCH DVDRIP(...)

1.8 Getting more torrent information

With plump:serialize we could check what html is inside our plump node:

(plump:serialize (second res))
<td class="Title">
<span class="ColorA">
<a href="https://piratebay.to/torrent/2297350/Matrix FRENCH DVDRIP 1999 COOL/" onclick="Javascript:OpenDetailPage('https://piratebay.to/torrent/2297350/Matrix FRENCH DVDRIP 1999 COOL/'); return false;">Matrix FRENCH DVDRIP 1999 COOL
</a>
</span>
<br/>
<span class="ColorB VaA">Upload Date: 05.06.15
</span>
<span class="ColorB VaA">Size: 700,30 MB
</span>
<span class="ColorB"/>
</td>

We want to get the torrent's page, the url in the firts href. From this page we'll be able to access the magnet link.

We know how to access the a:

(defparameter *elt* (first res))
(lquery:$ *elt* "a" (text))
;; => #("Matrix FRENCH DVDRIP 1999 COOL")

it returns a plump node.

We use (lquery:$ ... (attr :href)) as seen above to extract attributes:

(lquery:$ *elt* "a" (attr :href))
;; => #("https://piratebay.to/torrent/2297350/Matrix FRENCH DVDRIP 1999 COOL/")

Ok. But watch out again, the result is a vector (of one element).

We put this in a function:

(defun detail-page-url (node)
  "Extract the link of the details page. `node': plump node, containing the url."
  (let ((href-vector (lquery:$ node "a" (attr :href))))
     (aref href-vector 0)))

which we can test (either write it at the REPL either write it in the project and compile, C-c C-c in Slime):

(mapcar #'detail-page-url res)  ;; #' is shorthand for function
;; =>
("https://piratebay.to/search/0/800/0/matrix/0/ATitle/1/"
 "https://piratebay.to/torrent/2297350/Matrix FRENCH DVDRIP 1999 COOL/"
 "https://piratebay.to/torrent/2156107/The Matrix Reloaded (2003) FullHD, Dual Audio: English + Spa/"
 "https://piratebay.to/torrent/1885366/The Matrix Trilogy (1999-2003) + Extras 1080p BluRay x264 Du/"
[…]

1.8.1 To the magnet link

We have the torrent's details page, we know how to request it, now we want to get the magnet link.

We experiment, and get a list of the links inside this page.

This is what we first came up with: (can be simpler)

(mapcar (lambda (it)
          (lquery:$ it (attr  :href)))
        (coerce (lquery:$ * "a") 'list))
;; =>
(NIL NIL NIL NIL NIL NIL NIL "https://piratebay.to/" "https://piratebay.to/"
 […]
 "http://imdb.com/title/tt1778413/" "https://piratebay.to/profile/Anonym"
 "https://piratebay.to/Downloader.php?ID=2289391&Filename=Matrix+FRENCH+DVDRIP+1999+COOL"
 "magnet:?xt=urn:btih:40eca43690cf1b99b0a4d485ebf4855d20b0bac5" "http://"
 […]
 "https://twitter.com/piratebayto" "https://www.facebook.com/thepiratebayto"
 "http://bitcoin.org" "bitcoin:1FX2wz8NiWPdtCGSrzn7j3NAg4VKqGovto" "/")

in the result, there's our magnet link.

At first I was frustrated to have to coerce something to a list but it is not mandatory.

Because mapcar expects a list and lquery returns a vector, we had to transform lquery's result to a list with coerce. But we can simply use map, that works on lists and vectors, and ask it to return a list:

(map 'list (lambda (it)
             (lquery:$ it (attr  :href)))
           (lquery:$ * "a"))

I find the name mapcar unusual too, it was frustrating at the beginning but it's a just a name after all.

We could also use cl21's map, which works on lists and vectors so no more questions, it will work.

Still with cl21, we can write shorter lambdas, with the shorthand lm or with the ^ reader macro and accessing arguments with %1%n or simply % for the first one:

(map ^(lquery:$ % (attr :href)) ...)

We filter the list above to extract the magnet link:

(remove-if-not (lambda (it)
                 (str:starts-with? "magnet" it))
               *)

Here, I used again a short verb from an external library for string manipulation. The CL way would be something like:

(string= "magnet-foo" "magnet" :start1 0 :end1 (length "magnet"))
T

and yet we must handle nils, differences of length,… so boring.

We end up with the following functions:

(defun magnet-link-from (node)
  "Extract the magnet link from a `torrent' result."
  (let* ((url (detail-page-url node))
         (html (request-details url))
         (parsed (plump:parse html)))
    (find-magnet-link parsed)))
  • we extract the magnet link from an html (the page of a torrent's page) parsed with plump like this:
(defun find-magnet-link (parsed)
  "Extract the magnet link. `parsed': plump:parse result."
  (let* ((hrefs (coerce (lquery:$ parsed "a" (attr :href)) 'list))
         (magnet (remove-if-not (lambda (it)
                                  (str:starts-with? "magnet" it))
                                hrefs)))
    (first magnet)))
  • this one gets a plump node (from the search results), extracts the url of the torrent's page and calls our function above to extract the magnet link:
(defun magnet-link-from (node)
  "Extract the magnet link from a `torrent' result."
  (let* ((url (detail-page-url node))
         (html (dex:get url))
         (parsed (plump:parse html)))
    (find-magnet-link parsed)))
  • finally we need an easy way to call the function above and give it a reference to a search result.
(defun magnet (index)
  "Search the magnet from last search's `index''s result."
  (magnet-link-from (elt *last-search* index)))

And we simply use it like so: given an output like

...
  5: Matrix (1999)Blu-Ray 720p Dublado PT-BR - mo93438
  4: The Matrix Trilogy (1999-2003) BDRip 1080p Dual Audio [ Hind
  3: The Matrix Trilogy (1999-2003) BluRay BDRip 1080p AC3
  2: The Matrix Trilogy (1999-2003) + Extras 1080p BluRay x264 Du
  1: The Matrix Reloaded (2003) FullHD, Dual Audio: English + Spa
  0: Matrix FRENCH DVDRIP 1999 COOL
T

We request the magnet link with:

(magnet 0)
"magnet:?xt=urn:btih:40eca43690cf1b99b0a4d485ebf4855d20b0bac5"

That works, but beware if we ask for an index that does not exist. Let's take precautions:

(defun magnet (index)
  "Search the magnet from last search's `index''s result."
  (if *last-search*
      (if (< index (length *last-search*))
          (magnet-link-from (elt *last-search* index))
          (format t "The search returned ~a results, we can not access the magnet link n°~a.~&" (length *last-search*) index))
      (format t "The search returned no results, we can not return this magnet link.~&")))

1.9 Exporting functions

We need to export symbols in order to use them from the outside of their source file, in order to use them directly (use-package) or with (my-package:my-function). If we don't export them, we can still access them with a double colon: (my-package::my-function).

Our package definition contains this:

(defpackage cl-torrents
  (:use :cl))

We add it an export clause:

(defpackage cl-torrents
  (:use :cl)
  (:export :torrents
           :magnet))

We could also mark the functions to export with a decorator à-la Python, like this:

@export
(defun torrents (…)
    …)

which is quite elegant and can be handy. This is doable with the cl-annot library. It also requires a small Slime configuration.

1.10 Conclusion

This leads us to the end of part one.

We wrote a function that makes a query on TPB and returns a list of Plump objects. Then the function that prints the results calls other ones that extract the title, the seeders and leechers from the Plump objects: there is some coupling and that should be avoided. In reality, we quickly changed the torrents function so that it returns a list of alists with the required information extracted:

(defun torrents (words &key (stream t))
  "Search torrents."
  (format stream "searching on the Pirate Bay…")
  (let* ((query (str:join "+" words))
         (*search-url* (str:replace-all "{KEYWORDS}" query *search-url*))
         (req (request *search-url*))
         (html (plump:parse req))
         (res (lquery:$ html *selectors*))
         (toret (map 'list (lambda (node)
                             `((:title . ,(result-title node))
                               (:href . ,(result-href node))
                               (:leechers . ,(result-leechers node))
                               (:seeders . ,(result-peers node))
                               (:source . :tpb)))
                     res)))
    (format stream " found ~a results.~&" (length res))
    toret))

(defun result-title (node)
  "Return the title of a search result."
  (aref
   (lquery:$ node ".Title a" (text))
   0))

(defun result-href (node)
  (let* ((href-vector (lquery:$ node "a" (attr :href))))
    (aref href-vector 0)))

That way, the print function does not rely on TPB-specific code and we can give it other alists to print, from other torrent sites.

We now want or need more:

  • getting more content (seeders, leechers): this is done in the app.
  • downloading the torrent file ?
  • error handling (network errors or timeout, unexpected errors, user interrupts) (part 5),
  • scraping other sites, asynchronously (the asynchronous part will be straightforward, there's a library for that and it's one function change. See the Cookbook tutorial.)
  • some cache (part 6),
  • unit tests, "live" tests and continuous integration: see the next section.
  • building a standalone executable, parsing command line arguments: see part 3.

We'll carry on by writing tests, then we'll build a self-contained executable.

2 Designing the scrapers with CLOS

At this point, we know how to scrape The Pirate Bay and retrieve information like the torrents' title, href to their web page, their number of seeders, etc.

We only used regular data structures like alists. The library worked like this until version 0.12, but we always felt it clunky. In particular, dealing with lists of alists, accessing values with assoc-value from Alexandria or loosing our tests when we changed the data structures didn't feel right.

We also only considered one torrent source, but in reality we wrote scrapers for more (five at the time of writing). We must find solutions for code reuse and better representing each torrent source.

We'll now use the Common Lisp Object System (CLOS), which is a pleasure to work with. Does it seem verbose, too weird ? Well, just begin with a little class and experiment by yourself, you'll see.

You'll find more learning resources in the Cookbook.

Let's dive in:

(defclass torrent ()
  ((title
    :initarg :title
    :accessor title)
   (href
    :initarg :href
    :accessor href)
   (seeders
    :initarg :seeders
    :accessor seeders)
   (leechers
    :initarg :leechers
    :accessor leechers)
   (source
    :initarg :source
    :accessor source)))

(defmethod print-object ((it torrent) stream)
  (print-unreadable-object (it stream)
    (format stream "~a, ~a"
            (str:prune 30 (title it))
            (source it))))

(defun make-torrent (&key title href seeders leechers source)
  (assert title)
  (assert source)
  (assert href)
  (make-instance 'torrent
                 :title title
                 :href href
                 :seeders seeders
                 :leechers leechers
                 :source source))

Notes on the snippet above: the structure of defclass is

(defclass <name> (list of super-classes)
  (list of slots of the form
    (name attributes))
 (optionally class options))

An object is created with make-instance 'class, but is good practice to define a constructor like make-torrent.

The :initarg is the keyword argument to the constructor: (make-instance 'torrent :title "foo"), the accessor is both a getter and a setter:

(defvar *foo* (make-instance 'torrent :title "foo" :href "http://" :source :test))
(title *foo*) ;; => "foo"
(setf (title *foo*) "bar")

print-object is for pretty printing. The line

(defmethod print-object ((it torrent) stream)

specializes the print-object generic method for torrent classes. It takes a required stream parameter.

See the actual code here and how it is used in scrapers here.

Working with the CLOS is delightful. If we created objects at the REPL and then change the class definition, the objects are updated: deleted slots are deleted, slots added are instantiated (with rules we can control), accessor names are updated, etc.

Of course, working with an object system allows code-reuse, CLOS allows to specialize methods for different classes, it allows polymorphism, etc, which we don't show here.

Just dive into the CLOS and have fun !

3 Tests and CI

We wouldn't be called a developper if we didn't write any test.

Our favorite test framework (which we found on the Awesome CL list) is Prove.

The file t/cl-torrents.lisp, generated by cl-project, looks like this:

(in-package :cl-user)
(defpackage cl-torrents-test
  (:use :cl
        :cl-torrents  ;; => import our exported functions in cl-torrents.lisp
        :prove))      ;; => import all Prove verbs (like python's "from prove import *")
(in-package :cl-torrents-test)

;; NOTE: To run this test file, execute `(asdf:test-system :cl-torrents)' in your Lisp.

(plan nil)  ;; optional Prove setting.

;; blah blah blah.

(finalize)

We add our first and simplest test:

(ok (torrents "matrix"))

It only checks that this command doesn't fail. We compile it with C-c C-c and we see it run in the REPL.

This test does a network call: it is not an unit test. It's an "end-to-end" test instead, and that's ok we need one too :) We'll write unit tests now, and also hide the large output of the search results.

3.1 Unit tests

Since we do webscraping, the result from the network calls are likely to be different each time. That's good for "integration" or "end-to-end" tests but not for unit tests. We must find a way to fake the result of dex:get and return the same thing, always.

A solution is to save a piece of html in the testing directory and make sure that a call to dex:get returns it. In other words we're looking to mock functions calls. There's a library to do this and more, Mockingbird:

This package provides some useful stubbing and mocking macros for unit testing. Used when specified functions in a test should not be computed but should instead return a provided constant value.

It also makes possible to check if a given function was called, if so how many times, with what arguments, etc, which is very nice for tests.

Ok, let's go. We record the html of the search results:

mkdir t/assets/
wget  -O t/assets/search-matrix.html https://piratebay.to/search/\?FilterStr\=matrix\&ID\=1\&ID\=\&Limit\=800\&Letter\=\&Sorting\=DSeeder

We need to read this file into a string. A quick look to the Cookbook: (unfortunately this is not a one-liner :( )

(defun file-to-string (path)
  "Return the given file as a string."
    (with-open-file (stream path
                            :external-format :utf-8)
      (let ((data (make-string (file-length stream))))
        (read-sequence data stream)
        data)))

and we use it:

;; Load the search result html from a file.
(defparameter htmlpage (file-to-string #p"t/assets/search-matrix.html"))

From mockingbird, we need with-dynamic-stubs. We'll mock a call to dex:get:

(with-dynamic-stubs ((dex:get htmlpage))
  (ok (torrents "matrix") "torrent search ok"))

This test (run with C-c C-c) should not make a network call and should always return the matrix results. Indeed, if we write (torrents "dysney") instead it returns (and prints) the same.

So from here, we can write more unit tests. When we want to test the magnet function, we realize that we need to mock another dex:get call, the one that requests the html page of a particular result. We extract the network call from the function, what we should have done from the beginning as best practice actually (we'll also need to expand this with error checking and more):

(defun request-details (url)
  "Get the html page of the given url. Mocked in unit tests."
  (dex:get url))

Now we mock it. Extending the test above:

(with-dynamic-stubs ((dex:get htmlpage)
                     (cl-torrents::request-details resultpage))

  (ok (torrents "matrix" out) "torrent search ok")

  (ok (str:starts-with? "magnet" (magnet 0))
      "magnet <i> returns the the magnet link from search result."))

Our tests still write a lot of stuff on standard output, let's fix that.

3.2 Capturing output

We knew giving an optional stream parameter to our torrents function would be useful sometime:

(defun torrents (words &optional (stream t)) ...)

The t means "print to standard output". The trick is to give it another stream, notably one that goes to a string:

(ok (with-output-to-string (out)
      (torrents "matrix" out)) "torrent search ok")

and that's it, our tests are silent now.

We can write more of them.

3.3 Isolating tests (with a macro)

I'm not bothered (yet?) by the way we wrote tests above, all of them inside a with-dynamic-stubs macro. It's just that they are not isolated, at each C-c C-c it compiled and ran the whole form, running all our tests.

If we want, we can isolate them, each one under its own and same with-dynamic-stubs. But as soon as there's repetition… it's time to refactor with a macro. There's not much to it but we're glad for a little practice.

Each test will be of the form:

(with-dynamic-stubs (<stubs>)
    <tests>)

The only argument to our macro is a form containing the tests:

(defmacro with-mocked-search-results (body)

We get this simple macro:

(defmacro with-mocked-search-results (body)
    `(with-dynamic-stubs ((dex:get htmlpage)
                          (cl-torrents::request-details resultpage))
         ,body))

The backquote kind of warns that there will be variables inside this form, and the coma kind of says to not evaluate the argument but to put it as is.

So when we use it like this:

(with-mocked-search-results
    (ok (with-output-to-string (out)
          (torrents "foo" out))
        "search ok"))

we can see how it gets expanded like this:

  (macroexpand-1
   '(with-mocked-search-results ;; <-- note the quote
     (ok (with-output-to-string (out)
           (torrents "foo" out))
      "search ok"))
   )
;; (WITH-DYNAMIC-STUBS ((DEXADOR:GET HTMLPAGE)
;;                      (CL-TORRENTS::REQUEST-DETAILS RESULTPAGE))
;;    (OK (WITH-OUTPUT-TO-STRING (OUT) (TORRENTS "foo" OUT)) "search ok"))
;; T

Easy :)

3.4 make test

We must be able to run tests from the terminal for at least two reasons:

  • given the moving state of our Lisp image during development it is possible we get to a state that does not reflect the app at startup (for example, given that a defparameter symbol doesn't get re-evaluated with C-c C-c, are you not tempted to setf its value for testing ?)
  • Continuous Integration
  • I find it handy (that's not a solid reason though).

    We have a look at Prove's page and we see that we can run a test file with

(prove:run #P"myapp/tests/my-test.lisp")

Then we need to think about the steps needed when we arrive into a new SBCL repl.

  • we need to find our system definition.
(load #P"cl-torrents.asd")
  • we need to make all its symbols and dependencies available in the image.
(ql:quickload "cl-torrents")  ;; given we have Quicklisp installed and started with SBCL.
  • run the test suite.
  • quit the REPL.

    That gives this make target:

test:
        sbcl --load cl-torrents.asd \
             --eval '(ql:quickload :cl-torrents)' \
             --eval '(prove:run #P"t/cl-torrents.lisp")' \
             --eval '(quit)'

We can also have a look at the run-prove Roswell utility.

Continuous Integration is well explained here:

We'll finish with a note on Prove. It is easy to use and the doc is ok, but we were surprised that the tests output are not very informative (we don't have an example right now though). So we might as well either try Fukamashi's Rove, "intended to be Prove's next major release", either try Shinmera's Parachute, which we discovered only recently. Parachute also has includes basic fixtures support, test suite dependencies, conditionals, time limits (also in Prove ?), silent output, and interactive reports.

On a positive note, kuddos to Mockingbird it was very helpful.

3.5 Continuous Integration

Now that we have tests running on the terminal, we'll want to run them automatically. We probably want to run them on every commit but we could choose only on every tag, periodically (every sunday night), or just manually with a command in a commit message.

We'll use Gitlab's CI, which is free for public or private repos.

Gitab CI works with Docker images so we'll use the existing daewok/lisp-devel one. It includes SBCL, ECL, CCL and ABCL, and has Quicklisp installed in the home (/home/lisp/), so we can quickload packages right away. It also has a more bare bones option.

We used this example .gitlab-ci.yml successfully:

image: daewok/lisp-devel

before_script:
  - apt-get update -qy
  - apt-get install -y git-core
  - git clone https://github.com/Chream/mockingbird ~/quicklisp/local-projects/

test:
  script:
    - make test
  • "image": we set which Docker image to use.
  • "test": just the name of our pipeline.
  • "script": this is the recognized section to run any script we want. If it exits with success, our pipeline passes. So here we run our own "make test" command (Gitlab has cloned our project and put us at its root for us).
  • "beforescript": our tests rely on a library that isn't on Quicklisp yet, so we clone it into the usual ~/quicklisp/local-projects/. We have to update apt's cache and install git beforehand.

    And this is it ! You know have a working continuous integration stack.

    But while we're at it, let's see a couple more Docker commands and a way to test our CI commands locally.

3.6 A couple Docker commands

We have a Docker image ready to use, so let's see how to use it ourselves. Run:

sudo service docker start
docker run --rm -it daewok/lisp-devel:latest bash

This will download what's needed for the first time (around 400Mo) and then drop you into a bash prompt. Now you are in a fresh Debian-based system with Lisp stuff installed. You can enter an sbcl REPL and install Quicklisp libraries. But you don't have access to your code outside of Docker. Mount it:

docker run --rm -it -v /path/to/local/code:/usr/local/share/common-lisp/source daewok/lisp-devel:latest bash

This will put your local /path/to/local/code/ into Docker where indicated.

Now we can test things around. I don't know how to test our .gitlab-ci.yml straight away though.

Also see the Lisp image's README, they have more stuff on how to develop from our Emacs on the Lisp environment inside the image.

4 Building and delivering self-contained executables

4.1 SBCL.

So far we used the torrents command in our Slime REPL. That's fine and well, but what if we want to distribute our software to users and simply run it from the shell ?

Building (self-contained) executables is implementation-specific (Rowsell fixes that). With SBCL, as says its documentation, it is a matter of:

(sb-ext:save-lisp-and-die #P"torrents" :toplevel #'main :executable t)

where #p"torrents" is the pathname of the executable, :toplevel is the function to be called ("entry point" in Python), :executable t tells to build an executable instead of an image. We could build an image to save the state of our current Lisp image, to come back working with it later. Specially useful if we made a lot of work that is computing intensive.

sb-ext is an SBCL extension to run external processes. See other SBCL extensions (many of them are made implementation-portable in other libraries).

We have to define a main function that accepts keywords as arguments (we won't parse command line options just yet).

The Cookbook has us covered once again and, double coincidence, we rely on SBCL's sb-ext to access the list of command line arguments with sb-ext:*posix-argv* (this is not portable yet but a library, or Roswell, fixes that, follow the recipe). It returns a list of arguments, the first one being the name of the executable, as usual.

That gives:

(defun main ()
  "Get command line arguments with SBCL."
  (torrents (subseq sb-ext:*posix-argv* 1)))

Here we give a list to torrents but I found useful to accept both lists and a string (to allow the user/developper to call directly (torrents "foo bar") in Slime):

(defun torrents (words &optional (stream t))
  "Search torrents."
  (let* ((terms (if (listp words)
                    words
                    ;; The main gives words as a list,
                    ;; the user at the Slime REPL one string.
                    (str:words words)))

Cool we have basic command line arguments ! We'll parse them for real shortly.

Another thing to know about save-lisp-and-die is that we must run this function in another REPL than Slime.

So, let's launch an SBCL repl at our project root:

rlwrap sbcl

(rlwrap gives repl history. The default SBCL repl is the most basic one. We should have a look to the new cl-repl too, an ipython-like repl.).

If we run save-lisp-and-die directly, we'll get errors that "cl-torrents package does not exist", or that "The function COMMON-LISP-USER::MAIN is undefined". Indeed, as is SBCL has no way to know where is our system. We need to load our asd file, quickload our system and now that the ongoing image knows all the dependencies an symbols, we can create an executable out of it.

(load "cl-torrents.asd")
(ql:quickload :cl-torrents)
(use-package :cl-torrents)
(sb-ext:save-lisp-and-die #p"torrents" :toplevel #'main :executable t)

Obviously we want this in a Makefile.

4.2 make build

sbcl has the --eval and --load options.

build:
        sbcl --load cl-torrents.asd \
             --eval '(ql:quickload :cl-torrents)' \
             --eval '(use-package :cl-torrents)' \
             --eval "(sb-ext:save-lisp-and-die #p\"torrents\" :toplevel #'main :executable t)"

Now a make build will take the time to quickload the dependencies and build our executable. We can try it on the shell and give it to anyone, they won't have any dependency to install, not even a Lisp implementation.

Let's realize what we achieved for a moment. Personally I was unable to do something like this with my other languages before (Python, JS).

And it works for web apps too ! We just need one more line in the main function, like this:

;; https://stackoverflow.com/questions/30422451/sbcl-deploying-hunchentoot-application-as-executable
(sb-thread:join-thread (find-if (lambda (th)
                                (search "hunchentoot" (sb-thread:thread-name th)))
                              (sb-thread:list-all-threads))))

And we can embed a webapp in Electron for the desktop ! (with Ceramic).

Are there downsides ?

4.3 with ASDF or Roswell

We might enjoy a more shell-friendly way to build our executable, and most of all a portable one, so we would have the same command to work with various implementations.

Since version 3.1, ASDF, the de-facto build system, shipped with every implementation, allows that with the make command. It needs parameters in our .asd declaration:

:build-operation "program-op" ;; leave as is
:build-pathname "<binary-name>"
:entry-point "<my-package:main-function>"

and then a call to asdf:make :my-package.

So, in a Makefile:

LISP ?= sbcl

build:
    $(LISP) --non-interactive \
            --load cl-torrents.asd \
            --eval '(ql:quickload :cl-torrents)' \
            --eval '(asdf:make :cl-torrents)'

The --non-interactive option asks to not enter the debugger if an error were to happen, and exits after the last command (otherwise we would add a (quit)).

Finally Roswell, that does a lot, also has the ros build command, that should work for more implementations. It is a bit under-documented (issue).

4.4 Size and startup times of executables per implementation

$ ls -lh torrents
-rwxr-xr-x 1 vince vince 69M sept. 30 22:58 torrents

Our executables weighs 78Mo. Quite big for this little cli app. But it will be totally ok for a business application.

To ease the distribution of apps to people who already have a Lisp, we can use Roswell: we need to register our app in Quicklisp, install Roswell and install a program with ros install cl-torrents.

SBCL isn't the only Lisp implementation though. ECL, Embeddable Common Lisp, compiles Lisp programs to C. That should create a smaller executable.

Given this reddit comment, ECL produces indeed the smallest executables of all, an order of magnituted smaller than SBCL, but with a startup time an order of magnitude slower (or more).

ECL is particularly suited to this task. Since it transpiles to C, the C compiler will remove dead code automatically for you. You can get incredibly small executables this way.

program size implementation CPU startup time
28 /bin/true 15% .0004
1005 ecl 115% .5093
48151 sbcl 91% .0064
27054 ccl 93% .0060
10162 clisp 96% .0170
4901 ecl.big 113% .8223
70413 sbcl.big 93% .0073
41713 ccl.big 95% .0094
19948 clisp.big 97% .0259

We may have another trick to distribute small executables: to make the fasl files executables.

This will basically mean that your on-disk utilities will not include a copy of the SBCL core but use the central copy on the system, which makes them relatively small. Memory footprint will be the same though.

So we have room for improvement.

Finally, we tried the latest ECL version (16.1.3) with the lisp-devel Docker image (Debian's 13.x is too old, it has ASDF compatibility problems, that should be resolved with ASDF 3.3.2). We had a simple makefile target:

# Makefile
        ecl \
                -eval '(ql:quickload :closer-mop)' \  # seems needed O_o
                -eval '(load "cl-torrents.asd")' \
                -eval '(ql:quickload :cl-torrents)' \ # building everything...
                -eval '(asdf:make-build :cl-torrents :type :program :move-here #P"./")'

which we can call manually after loading the image (or integrate into Gitlab CI):

service docker start
docker run --rm -it -v /home/vince/projets/cl-torrents:/usr/local/share/common-lisp/source  daewok/lisp-devel:latest bash

In the end, we got a 52Mo executable, and unfortunately it got a runtime error, so we won't ship it, and we won't investigate this.

4.5 Parsing command line arguments

We saw earlier that SBCL stores the arguments into sb-ext:*posix-argv*. The first thing is that it differs from implementations, so we want a library to handle the differences for us.

We also want to parse them.

A quick look at the awesome-cl#scripting list and we'll try unix-opts.

(ql:quickload "unix-opts")

We can call it with its opts alias (nickname).

unix-opts allows to declare the arguments with opts:define-opts:

(opts:define-opts
    (:name :help
           :description "print this help text"
           :short #\h
           :long "help")
    (:name :nb-results
           :description "maximum number of results to print."
           :short #\n
           :long "nb"
           :arg-parser #'parse-integer)
    (:name :magnet
           :description "get the magnet link of the given search result."
           :short #\m
           :long "magnet"
           :arg-parser #'parse-integer))

Here parse-integer is a built-in CL function.

We parse and get them with opts:get-opts, which returns two values: the first is the list of valid options and the second the remaining free arguments. We then must use multiple-value-bind to catch everything:

(multiple-value-bind (options free-args)
    ;; opts:get-opts returns the list of options, as parsed,
    ;; and the remaining free args as second value.
    ;; There is no error handling yet (specially for options not having their argument).
    (opts:get-opts)
    ...

We can explore this by giving a list of strings (as options) to get-opts:

(multiple-value-bind (options free-args)
                   (opts:get-opts '("hello" "-h" "-n" "1"))
                 (format t "Options: ~a~&" options)
                 (format t "free args: ~a~&" free-args))
Options: (HELP T NB-RESULTS 1)
free args: (hello)
NIL

If we put an unknown option, we get into the debugger. We'll see error handling later on.

So options is a property list, i.e. a list that alternates a key and a value. It helps me to see that it looks like this:

[o|o]---[o|o]---[o|o]---[o|/]
 |       |       |       |
FOO     "foo"   BAR     "bar"

We can use getf and setf with plists, so that's how we do our logic. Below we print the help with opts:describe and then exit (portable way).

(multiple-value-bind (options free-args)
    ;; No error handling yet (for unknown options or one not having its argument).
    (opts:get-opts)

  (if (getf options :help)
      (progn
        (opts:describe
         :prefix "CL-torrents. Usage:"
         :args "[keywords]")
        (opts:exit))) ;; <= exit takes an optional return status.
  (if (getf options :nb-results)
      (setf *nb-results* (getf options :nb-results)))

  (torrents free-args)

  (if (getf options :magnet)
      ;; if we had caching we wouldn't have to search for torrents first.
      (progn
        (format t "~a~&" (magnet (getf options :magnet)))
        (opts:exit))))

The example in the unix-opts repository suggests a macro to do slightly better. And there is no error handling yet.

Here we are. We can build programs that run both in the REPL (well, that's obvious) and in the terminal.

If you like this and want more shiny stuff:

  • cl-ansi-term - print colorized text, horizontal lines, progress bars, list, tables,…
  • cl-readline - bindings to the GNU Readline library (Emacs and Vim input modes, history, basic expansion, custom completion,…). Example project: cl-repl. Demo example: cl-readine-example. shtookovina by the same author is an example command-line app that makes good use of both projects.
  • the Awesome-cl/scripting link above.

4.6 Automatically building and delivering the executable

We already have a continuous integration system with Gitlab CI (see above, Tests and CI). We're then very close in having a system to automatically build and deliver our executable. Moreover, we'll do that only when we push a new tag.

For this, we'll declare a job artifact.

First, we create a new job to run our command that builds the executable:

build:
  script:
    - make build

We'll see:

To be able to browse the artifacts, and get an url to download the last one, we add:

build:
  script:
    - make build
  artifacts:
    paths:
      - torrents

And now, to only to this on a new tag:

build:
  only:
    # Only when a new tag is pushed.
    - tags
  script:
    - make build
  artifacts:
    paths:
      - torrents

(see the doc).

We'll add the rule to run tests first and to not build the software if they fail (we use environments). We declare two names of "stages" (test and build) that respect the given order, and add a "stage: xxx" in each job. Our final .gitlab-ci.yml is this:

image: daewok/lisp-devel

stages:          # <= new
  - test
  - build

before_script:
  - apt-get update -qy
  - apt-get install -y git-core
  - git clone https://github.com/Chream/mockingbird ~/quicklisp/local-projects/

test:
  stage: test     # <= new
  only:

  - tags
  script:
    - make test

build:
  stage: build    # <= new
  only:
    # Only when a new tag is pushed.
    - tags
  script:
    - make build
  artifacts:
    paths:
      # xxx include the version number
      - torrents

And voilà, that's it.

We have an url to download the latest artifact:

https://gitlab.com/vindarel/cl-torrents/-/jobs/artifacts/master/raw/torrents?job=build

Let's just put the link on an icon and add the version number on the executable's name.

4.7 Final words

As said before, we didn't do error handling yet, and this can be annoying in the shell: on an error, including an user's C-c, we get a stacktrace and we are dropped into the Lisp debugger. We fix this below.

5 Moving functions into their own package

We have written a few helpers functions to treat with colors of keywords. As we were trying out things, we wrote those in our main file. But we'd like them to go into their own "utils" file and package.

(If you want to see the helper functions, go to the next section.)

5.1 Dealing with symbol conflicts. shadowing-import, uninterning.

We move them in a new src/utils.lisp file.

Our utils.lisp file declares a new cl-torrents.utils package, exporting what's needed:

(in-package :cl-user)
(defpackage cl-torrents.utils
  (:use :cl)
  (:export :sublist
           :helper-function-one))

(in-package :cl-torrents.utils)

We add this file to the .asd:

:components ((:module "src"
              :components
              ((:file "utils")  ;; <= added
               (:file "cl-torrents"))))

And on the other side into cl-torrents.lisp, either we :use all the export'ed utilities:

(:use :cl
      :cl-torrents.utils)

either we import explicitely what we need (and we'll do that as it is more explicit, although a bit redondant):

(defpackage cl-torrents
  (:use :cl)
  (:import-from :cl-torrents.utils
                :colorize-all-keywords
                :keyword-color-pairs
                :sublist)
  …

And then, as we quickload cl-torrents with its new organization, we get trapped trapped into the debugger:

IMPORT CL-TORRENTS.UTILS::KEYWORD-COLOR-PAIRS causes
name-conflicts in #<PACKAGE "CL-TORRENTS"> between the following
symbols:
  CL-TORRENTS.UTILS::KEYWORD-COLOR-PAIRS,
  CL-TORRENTS::KEYWORD-COLOR-PAIRS
   [Condition of type SB-EXT:NAME-CONFLICT]
See also:
  Common Lisp Hyperspec, 11.1.1.2.5 [:section]

Restarts:
 0: [SHADOWING-IMPORT-IT] Shadowing-import CL-TORRENTS.UTILS::KEYWORD-COLOR-PAIRS, uninterning KEYWORD-COLOR-PAIRS.
 1: [DONT-IMPORT-IT] Don't import CL-TORRENTS.UTILS::KEYWORD-COLOR-PAIRS, keeping KEYWORD-COLOR-PAIRS.
 2: [RESOLVE-CONFLICT] Resolve conflict.
 3: [RETRY] Retry compiling #<CL-SOURCE-FILE "cl-torrents" "src" "cl-torrents">.
 4: [ACCEPT] Continue, treating compiling #<CL-SOURCE-FILE "cl-torrents" "src" "cl-torrents"> as having been successful.
 5: [RETRY] Retry ASDF operation.
 […]

Fortunately there's nothing to be afraid of, it is pretty clear. We have a name conflict between those two symbols:

CL-TORRENTS.UTILS::KEYWORD-COLOR-PAIRS,
CL-TORRENTS::KEYWORD-COLOR-PAIRS
  • the first one being our new function into its new cl-torrents.utils package,
  • the second being the old one, that was into cl-torrents' main file. The function has been moved, but the symbol is still around !

    We need to give priority to the new symbol, and that's just what offers the first restart n° 0, to "shadow import CL-TORRENTS.UTILS::KEYWORD-COLOR-PAIRS, uninterning KEYWORD-COLOR-PAIRS.". "uninterning" is getting rid of the symbol, shadowing-import is this process to import a symbol that hides one already present.

    We can press 0 and… we get the same messages. This time, for the other function that was being called directly in our main lisp code. We don't get this debugger for the "intermediate" functions like next-color, the ones that are called only by the two main ones.

    It was quick for two warnings. Still, we could use a way to see and accept everything in a row.

5.2 The helper functions. Ansi colors. Closures. Looping over a plist.

This was fun to do. Our goal was to colorize each keyword with a different color, like so:

This helps seeing the most relevant results, because sometimes the Pirate Bay website (at least this copy) returns not very related results.

We have the list of searched keywords in *keywords*. We want to associate each one with a different color.

We use the cl-ansi-text library, on Quicklisp. It offers to either colorize strings like this:

(with-color (:red)
    (princ "Gets printed red..."))

where :red is a CL symbol, or with functions:

(import 'cl-ansi-text:yellow)
(yellow "Yellow string")

we'll prefer this method as we are treating with text.

We are dealing with ansi colors in the terminal (either Slime or the terminal). So what this does is adding an extra prefix and suffix to our string:

(yellow "yellow"
;; "yellow"

We have to remember this.

We define the available colors into a variable:

(defparameter *colors* '(
                         cl-ansi-text:blue
                         cl-ansi-text:green
                         cl-ansi-text:yellow
                         cl-ansi-text:cyan
                         cl-ansi-text:magenta
                         cl-ansi-text:red
                         )
  "Functions to colorize text.")

There are 6 of them. Now we want a way to give each keyword a different color. And if we have more than 6 keywords, we cycle over the list. We want a function to cycle through a list… but it would not start at 0 everytime… that reminds me of closures !

;; closure to loop over the list of available colors.
(let ((index 0))
  (defun next-color ()
    "At each call, return the next color of the list -and start over. Uses *colors*."
    (let ((nb-colors (length *colors*))
          (color (elt *colors* index)))
      (incf index)
      (if (>= index nb-colors)
          (setf index 0))
      color))

  (defun reset-color ()
    (setf index 0))
  )

The function next-color uses the index that is defined outside of it (and which is still not global, but internal to this let only). So every time we call next-color, index is where it had been left. Cool ! I'd have used a class for this in another language I guess.

We define a function to associate a color to a keyword:

(defun keyword-color-pairs (&optional (keywords *keywords*))
  "Associate each keyword with a different color and return a list of pairs."
  (mapcar (lambda (it)
            `(,it . ,(next-color)))
          keywords))

A good use of backquotes and comas.

We have a function to colorize a keyword inside a given string with a color:

(defun colorize-keyword-in-string (title keyword color-f)
  "Colorize the given keyword in the title.
Keep the letters' possible mixed up or down case.
`color-f': color function (cl-ansi-text)."
  ;; It colorizes only the first occurence of the word.
  ;;
  ;; We begin by looking if the string contains our keyword. "search" returns the index.
  (let ((start (search keyword (string-downcase title) :test #'equalp))) ;; case insensitive
    (if (numberp start)
      ;; if "start" is not nil, we get the position of the last letter of the keyword in the string.
      (let* ((end (+ start (length keyword)))
             ;; and we extract the substring.
             ;; In doing so we keep the original case of each letter.
             (sub (subseq title start end))
             ;; We all our color function.
             (colored-sub (funcall color-f sub)))
        ;; We replace the substring with the colored substring, and return it.
        (str:replace-all sub colored-sub title))
      title)))

And we do the same for all keywords. Looping over a plist is easy:

(loop for (word . color) in keywords-color
   do (...
(defun colorize-all-keywords (title kw-color)
  "Colorize all the user's search keywords in the given title.
`kw-color': list of pairs with a keyword and a color (function)."
  (let ((new title))
    (loop for (word . color) in kw-color
       do (progn
            (setf new (colorize-keyword-in-string new word color))))
    new)
  )

(anyone have a functional version, with reduce maybe ?)

Now we're happy, we run a search query again and… damn, everything's colorized but the seeders and leechers are not aligned any more :( Indeed, we print our search results with a format string:

(format stream "~3@a: ~65a ~3@a/~3@a~%" index title-colored seeders leechers)

See the fixed 65 ? But our colored titles have different sizes, due to the ansi prefixes and suffixes :(

5.3 Formatting a format string

No big deal ! We have to adapt this "65", by adding the difference of length between the normal title and the colored one.

(let* ((title (result-title it))
       (title-colored (colorize-all-keywords title *keywords-colors*))
       (title-padding (+ 65
                         (- (length title-colored)
                            (length title))))

But how can we insert a variable in lieu of "65" ? If we insert another ~a, it will print the number but it will not be interpreted as an option of the format directive… we will then format the format string.

In the following, ~~ is the directive to print a ~, so ~~~aa with the padding as argument will render as ~65a (or ~75a depending on the new length).

(defun display-results (&optional (results *last-search*) (stream t))
  "Results: list of plump nodes. We want to print a numbered list with the needed information (torrent title, the number of seeders,... Print at most *nb-results*."
  (mapcar (lambda (it)
            (let* ((title (result-title it))
                   (title-colored (colorize-all-keywords title *keywords-colors*))
                   (title-padding (+ 65
                                     (- (length title-colored)
                                        (length title))))
                   ;; ~~ prints a ~ so here ~~~aa with title-padding gives ~65a or ~75a.
                   (format-string (format nil "~~3@a: ~~~aa ~~3@a/~~3@a~~%" title-padding)))

              (format stream format-string
                    (position it *last-search*)
                    title-colored
                    (result-peers it)
                    (result-leechers it))))
           (reverse (sublist results 0 *nb-results*)))
  t)

To see the larger picture, see display-results on the repository.

6 Error handling

6.1 Dealing with malformed command line arguments

If you try to give a bad argument on the command, you'll enter the debugger. Obviously, we prefer to print an informative message, print the help and exit.

unix-opts will throw (or rather signal) conditions (hear "exceptions" if you wish) on malformed options:

  • opts:unknown-option,
  • opts:missing-arg,
  • opts:parser-failed and
  • opts:missing-required-option.

    It also has a complete example. This library is really nice.

    The general form of condition handling is to use handler-bind:

(handler-bind ((one-condition #'its-handler)
               (another-condition #'another-handler))
   (some code))

where we are in charge in writing the #'handlers. They take the condition object as argument. We give them below. They use opts:describe to print the help message and (opts:raw-arg condition) to read the argument in cause from the condition object.

Yes, conditions are CLOS classes. When we create our owns, we can define different slots, etc. The article on z0ltan.wordpress.com will tell you more, and looking at unix-opts code may help (just keep if it's too much for now).

This is how it defines a condition: it first defines a generic troublesome-option class, that inherits from simple-error and has three slots, amongst which an option that will store the bad one. Then it defines unknown-option, which inherits from troublesome-option.

(define-condition troublesome-option (simple-error)
  ((option
    :initarg :option
    :reader option))
  (:report (lambda (c s) (format s "troublesome option: ~s" (option c))))
  (:documentation "Generalization over conditions that have to do with some
particular option."))

(define-condition unknown-option (troublesome-option)
  ()
  (:report (lambda (c s) (format s "unknown option: ~s" (option c))))
  (:documentation "This condition is thrown when parser encounters
unknown (not previously defined with `define-opts') option."))

Then, when it encounters an error, it signals a condition and sets its option slot like this:

(error 'unknown-option
   :option opt)

See the source.

Anyway, these are our little functions to display an helpful error message and exit:

(defun unknown-option (condition)
  (format t "~s option is unknown.~%" (opts:option condition))
  (opts:describe)
  (exit))

(defun missing-arg (condition)
  (format t "Bad options: ~a needs an argument.~&" (opts:option condition))
  (opts:describe)
  (exit))

(defun arg-parser-failed (condition)
  (format t "Error: could not parse ~a as argument of ~a~&."
          (opts:raw-arg condition)
          (opts:option condition))
  (opts:describe)
  (exit))

And so we surround the reading of the options with handler-bind:

(handler-bind ((opts:unknown-option #'unknown-option)
               (opts:missing-arg #'missing-arg)
               (opts:arg-parser-failed #'arg-parser-failed)
               ;; (opts:missing-required-option #'...) ;; => in upcoming version
  (opts:get-opts)) ;; <= this will signal conditions, if any.

We can build and try:

$ ./torrents -p foo
"-p" option is unknown.

Available options:
  -h, --help               print this help text
  -n, --nb ARG             maximum number of results to print.
  -m, --magnet ARG         get the magnet link of the given search result.

And all is well. We didn't catch everything though

6.2 Catching a user's C-c termination signal

Let's try a C-c and read the stacktrace:

$ ./torrents matrix
^Csearching on the Pirate Bay…                                 <== C-c
debugger invoked on a SB-SYS:INTERACTIVE-INTERRUPT in thread   <== condition name
#<THREAD "main thread" RUNNING {1003156A03}>:
  Interactive interrupt at #x7FFFF6C6C170.

Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.

restarts (invokable by number or by possibly-abbreviated name):
  0: [CONTINUE     ] Return from SB-UNIX:SIGINT.               <== it was a SIGINT indeed
  1: [RETRY-REQUEST] Retry the same request.

The signaled condition is named after our implementation: sb-sys:interactive-interrupt. We just have to surround our code with a try/catch or rather, with handler-case:

(handler-case
    (torrents free-args)
  (sb-sys:interactive-interrupt () (progn
                                     (format *error-output* "Abort.~&")
                                     (opts:exit))))

This code won't work with ECL or another implementation. We know about trivial-signal, but we were not satisfied with our test (it didn't work, see issue #3).

This will be good for now, but in case we found portable examples on other projects. We could do something like this (thanks Clack):

(handler-case
    <main>
  (#+sbcl sb-sys:interactive-interrupt
   #+ccl  ccl:interrupt-signal-condition
   #+clisp system::simple-interrupt-condition
   #+ecl ext:interactive-interrupt
   #+allegro excl:interrupt-signal
   ()
   <exit>))

here #+, you know, includes the line at compile time depending on the implementation. There's also #-. Actually #+ looks at symbols in the *features* list. We can also combine symbols with and, or and not.

6.3 Scrapers errors

Web scraping is by definition unstable, we should be prepared to all sort of errors.

For every function that parses result, we surround it with a handler-case:

(defun result-leechers (node)
  (handler-case
      (parse-integer (elt (lquery:$ node ".ttable_col1" (text)) 2))
    (error ()
      -1)))

We do that generally for all torrents functions:

(defun torrents (words &key (stream t))
  "Return a list of..."
  (format stream "searching on Kat...")
  (handler-case
      (let* ((query (str:join "+" words))
         ...
    (error ()
      (format stream " no results.~&"))))))

We chose to just print "no results" and return nil (the return value of format). We could also print the error message of the condition (optional argument of "error": (error (condition) ...)). We might want to not catch errors when we are in Slime too. Is that possible ?

7 More features

7.1 Cache

We'd like a cache system. This was quick and easy with the Clache library.

We instantiate a cache on the file system (a file per item saved, not on a DB):

(defvar *cache-directory*
  (merge-pathnames #p".cl-torrents/cache/" (user-homedir-pathname))
  "The directory where cl-torrents stores its cache.")

(defun ensure-cache ()
  (ensure-directories-exist
   (merge-pathnames *cache-directory*)))

(defparameter *store* (progn
                        (ensure-cache)
                        (make-instance 'file-store :directory *cache-directory*))
  "Cache. The directory must exist.")

the FS storage is provided by cl-store. We could also create an in-memory cache, suitable for memoization.

The function ensure-directories-exist creates a nested directory structure for us if needed. We wrap its call in the ensure-cache function in order to call it at the beginning of the main function, when cl-torrents is called from the command line.

Saving something in the cache and retrieving it is made with (setcache key val store) and getcache key store. We just create wrapper functions, to print a little confirmation message, and because we might want to mock them in unit tests:

(defun save-results (terms val store)
  "Save results in cache."
  (format t "Saving results for ~a.~&" terms)
  (setcache terms val store))

(defun get-cached-results (terms store)
  (when (getcache terms store)
    (progn
      (format t "Got cached results for ~a.~&" terms)
      (getcache terms store))))

and we use it like this:

;; in the middle of `torrents'.
(if (get-cached-results joined *store*)
                 (getcache joined *store*)
                 (tpb::torrents words :stream log-stream))

Better yet, with an optional store argument that defaults to the *store* parameter:

(defun save-results (terms val &key (store *store*))
   ...

setcache takes an optional expiry time, in seconds.

We re-build the project (make build), search for matrix once (we wait a bit) and a second time: we hit the cache.

Quick and easy ! (and too quick, we don't use the expiry time yet)

8 Conclusion

That the end of our tutorial (for now ?).

You can see the full sources and comment on the repository: https://github.com/vindarel/cl-torrents

9 Appendice A: the same, with CL21

We like CL21. It cleans many CL inconsistencies and brings "modern" features.

Yet it is only a CL extension, that we can use isolated in just one package of our code base if we don't want to use it everywhere, and we can still use regular CL symbols prefixed by cl:.

9.1 Installation

First, we need to install it from its own Quicklisp dist (repository).

(ql-dist:install-dist "http://dists.cl21.org/cl21.txt")
(ql:quickload :cl21)

To try things out at the REPL:

(in-package :cl21)

9.2 Replacing CL by CL21

We have to use cl21 instead of cl, and cl21-user instead of cl-user.

Our package definition:

(in-package :cl-user)
(defpackage torrents
  (:use :cl
        :clache)
  ...

becomes:

(in-package :cl21-user)
(defpackage torrents
  (:use :cl21
        :clache)
  ...

Our .asd gained a cl21 dependency. This is not enough to automatically install CL21 from its own dist though.

When we compiled our sources, the debugger complained about symbol conflicts and asked to choose between the old and the new one. Indeed, CL21 either redefines or re-exports all CL symbols, so that's a lot of conflicts in the running image. We restarted our Lisp.

  (ql:quickload "cl21")
  ;; To load "cl21":
  ;; Load 1 ASDF system:
  ;;     cl21
  ;; ;
  ;; Loading "cl21"
  ;; .
  ;; .......
  ;; (
   ;; "cl21")

  ;; Compiled cl-torrents.asd
  (ql:quickload "cl-torrents")
;; To load "torrents":
;;   Load 1 ASDF system:
;;     torrents
;; ; Load ing "torrents"
;; ..................................................
;; [package torrents.utils]..........................
;; [package tpb].....................................
;; [package torrentcd]...............................
;; [package kat].....................................
;; [package torrents]................................
;; [package torrents]................................
;; [package torrents]..
;; ("torrents")

(in-package :torrents)
;; ...

here we compiled the cl-torrents.lisp sources (C-c C-k) and had an error on the debugger, complaining that mapcar didn't exist. Indeed, CL21 uses the classical map name instead. It still has mapcan. BTW, "filter" is provided by keep[-if][-not] (instead of remove[-if][not]).

9.3 Benefits ?

I know I have been frustrated by first not being generic, thus not working on arrays. I wanted to use it when trying things with plump and lquery, but plump returns an array. And it is the same with other functions and other data structures. And there are inconsistencies, sometimes the first parameter is the list, sometimes not.

It just feels wrong to me, specially today, and specially since CL has this genericity mecanism.

So some CL functions become generic (and extensible):

  • getf (also for hash-maps,…)
  • equalp
  • emptyp
  • coerce

    and CL21 also defines new functions:

  • append
  • flatten
  • elt
  • emptyp
  • equalp
  • split, split-if
  • drop, drop-while
  • take, take-while
  • join
  • length
  • keep, keep-if, nkeep, nkeep-if
  • partition, partition-if
  • https://lispcookbook.github.io/cl-cookbook/cl21.html#generic-functions

    Same story with map and mapcar. You want to use what is named map in all other languages, but you get error messages, have to google and learn about mapcar and map 'list with a freaking first argument.

    In the end, we nearly didn't have to edit our code base to make it work with CL21. We don't have real benefits to do this to an existing code base either, they would have been noticeable (and huge for us) during development, and should be for future developments.

    Allright, I'm sold on CL21, if it was only for that, but there's more, check it out !

9.4 Downsides ?

Is using CL21 worth the risk ? It doesn't show much activity on Github, although from time to time an issue shows activity, and it has open issues. I'll let you judge, this was only an example in a little project.

There's a real consequence to the long dependency list of CL21 though: the executable size. We had one of 78Mo, and the CL21 one is 93Mo.

Author: vindarel

Created: 2018-09-20 jeu. 18:39

Validate