Table of Contents
- 1. Make it work
- 2. Designing the scrapers with CLOS
- 3. Tests and CI
- 4. Building and delivering self-contained executables
- 5. Moving functions into their own package
- 6. Error handling
- 7. More features
- 8. Conclusion
- 9. Appendice A: the same, with CL21
Writing a little web scraper is not difficult. However, I had to spend some time to find out the right libraries and resources. It is also not trivial at first to start a Lisp project, so
this tutorial is a mix of hopefully useful stuff, for people interested in Common Lisp and with not much experience with the language and its ecosystem.
In part 1, "Make it work", we explore how to do asynchronous web scraping. We try things out at the repl and we create a first package. We learn basic data structures and their gotchas in the way, we also see where to find documentation, some useful libraries and beginner-friendly tips.
Part 2 is a short introduction to the CL Object System (CLOS), which is much more a pleasure to work with than dealing with alists and other data structures.
In part 3, "Tests and CI", we write our first unit test, with mocks, which gives us the opportunity to write a first macro, and we setup Continuous Integration on Gitlab CI.
Part 4 is about building and delivering self-contained executables, parsing command line arguments and such. While it was necessary for us at the time to gather this information, it is now available in the Common Lisp Cookbook and is probably more up to date. We advise the reader to go there.
Part 5, "Moving functions into their own package", shows and explains things that appear strange at first, like "shadowing import", symbols conflicts and such. We also write helper functions and make good use of closures.
Part 6 is about error handling or rather condition handling.
Part 7, "More features", shows only a cache system at the moment.
Finally, we have a look at CL21.
1 Make it work
Scraping The Pirate Bay is easy, they don't rely on javascript-generated pages. We just have to:
- get the html page (dexador:
(dex:get <url>)
) - parse the html into a data structure (plump:
(plump:parse <html>)
) - search with CSS selectors (lquery:
(lquery:$ <parsed-html> <selectors>)
)
We suppose you have a ready Common Lisp development environment. If not, see the Cookbook/getting-started (and if that's not enough, fill an issue !).
Let's go.
Install our dependencies right away:
(ql:quickload '("dexador" "plump" "lquery" "str"))
To begin with, we do a search on the website and we copy-paste the url. We get one like this:
(defparameter *search-url* "https://piratebay.to/search/?FilterStr=matrix&ID=&Limit=800&Letter=&Sorting=DSeeder" "the url to search matrix.")
It has our search term in it (matrix
) along with url parameters.
It also sorts the results by number of seeders for us :) (&Sorting=DSeeder
).
We will use CSS selectors to extract information from the web page, so we can use our browser's developer tools to inspect the structure of the page and guess our selectors: right click on a result's title, choose "inspect element". It highlights some html similar to this:
<td class="Title"> <span class="ColorA"> <a href="https://piratebay.to/torrent/1922147/Matrix (1999)Blu-Ray 720p Dublado PT-BR - mo93438/" onclick="Javascript:OpenDetailPage('https://piratebay.to/torrent/1922147/Matrix (1999)Blu-Ray 720p Dublado PT-BR - mo93438/'); return false;">Matrix (1999)Blu-Ray 720p Dublado PT-BR - mo93438 </a> </span> <br> <span class="ColorB VaA">Upload Date: 20.02.15 </span> <span class="ColorB VaA">Size: 796,86 MB </span> <span class="ColorB"> </span> </td>
The title is well delimited so we'll start selecting our elements by
the CSS class Title
, which gives:
(defparameter *selectors* ".Title")
If you are not accostumed to CSS selectors, this post is a nice helper: https://codingsec.net/2016/12/select-specific-text-css-using-selectors/.
1.1 Trying out at the REPL
Let's try in the REPL:
(defparameter html (dex:get *search-url*)) ;; returns the html (defparameter parsed (plump:parse html)) ;; returns a list of plump elements (defparameter results (lquery:$ parsed *selectors*)) ;; returns a list of stuff (setf results (lquery:$ parsed *selectors*))
it returns:
#(#<PLUMP-DOM:ELEMENT th {100369D943}> #<PLUMP-DOM:ELEMENT td {10036A7CE3}> #<PLUMP-DOM:ELEMENT td {10036B6163}> #<PLUMP-DOM:ELEMENT td {10036FC903}> ... #<PLUMP-DOM:ELEMENT td {10047C0BC3}> #<PLUMP-DOM:ELEMENT th {10047D44E3}> #<PLUMP-DOM:ELEMENT td {10047DE853}> #<PLUMP-DOM:ELEMENT th {10047F2333}> #<PLUMP-DOM:ELEMENT td {10047FC673}> #<PLUMP-DOM:ELEMENT td {100480ACF3}> #<PLUMP-DOM:ELEMENT td {10048195D3}> #<PLUMP-DOM:ELEMENT td {1004827D73}>)
We can check its inner text representation:
(lquery:$ parsed *selectors* (text)) ;; => #(" Title " "Matrix FRENCH DVDRIP 1999 COOLUpload Date: 05.06.15 Size: 700,30 MB" "The Matrix Reloaded (2003) FullHD, Dual Audio: English + SpaUpload Date: 12.04.15 Size: 8,51 GB" [...] " "Arturia - Matrix 12-V v1.0.1.9 OS X [HEXWARS][dada]Upload Date: 28.12.14 Size: 100,86 MB" "Native Instruments - Maschine Expansion Golden Kingdom HYBRIUpload Date: 03.06.15 Size: 267,84 MB" "Arturia - Matrix-12 V v1 1 0 522 R2 AU AAX VST VST3 ST OS X Upload Date: 02.04.15 Size: 100,49 MB" "Arturia - Matrix-12 V v1.1.0.522 OS X [PitcHsHiFTeR][dada]Upload Date: 28.03.15 Size: 130,44 MB")
A little explanation for lquery
: the last (text)
part is
an lquery thing to get the text representation of the node, instead of
a lquery internal object. Likewise we'll be able to call (attr
:href)
or (html)
, which are self-explanatory. If you want CL
symbols there, use (inline (...))
.
I like to check the html content of the plump nodes. We use the
serialize
plump function (the second function from the doc ;) ):
(plump:serialize (first results))
but we get an error:
The value #( … all the content of results here … ) is not of type LIST when binding LIST …
Sorry for the inconvenience. Indeed, lquery returns a vector,
not a list (we can see that with #()
that denotes a vector), so we
can not use first
but have to use (aref <vector> 0)
instead, or we
have to coerce
the result to a list.
(see the Cookbook's Data Structures page).
Personnally I find this frustrating, particularly being used in Python to access lots of data structures in the same manner. If you feel like fixing this, have a look at CL21, "Common Lisp for the 21st century" which, amongst other features, redefines some functions to make them generic (that work on lists, vectors, hashmaps,…). CL21 is a CL library, meaning we can of course use the others CL libraries with it, or that we can use cl21 alongside CL in only in some places but not in all our project (like in only one file, one "package").
It is written by a super productive and innovative CL hacker and has 600+ stars on github. Nevertheless, it wasn't touched in two years and, as it lacks docstrings and direction, we can be surprised by the new implementation of some functions (see its issues). Some people discourage the use of it. I, at least, am a happy user for the moment :)
Allright so, back to printing the html content of our first result:
(plump:serialize (aref results 0)) <th class="Title header "> <a href="https://piratebay.to/search/0/800/0/matrix/0/ATitle/1/">Title</a> </th> ">"
gosh, there is not much in it. I was too quick in defining my CSS
selector. This first result should not be catched by our selector,
because it is not a link to a torrent but only the header of the table
(see "Title header" and the th
?).
But that's not all. I also want to scrape the number of seeders and
leechers and the .Title
selector doesn't include them (this I see
with the browser's web inspector). If I select the including tr
I'll
get all the data, and I must ensure to select those tr
from inside
the body of the table, tbody
. So I'll use a first selector that
will return a list of elements of the table:
(setf *selectors* "tbody tr")
and then for each result I'll get the title and the number of seeders.
So I can scrape again my search results with the new selector:
(setf results (lquery:$ parsed *selectors* (text)))
this should print interesting stuff, like our torrents titles and even
strings like S: 16L: 1
which are indeed our seeders and leechers.
I check the html content and it seems allright. It has the link to the torrent's page inside the href of the "a" element, as well as the seeders count.
Note that we can also inspect the results with the mouse: left/right clicks on the elements printed in the REPL get us into the Slime inspector.
1.2 Putting it together in a function
We came up with this function:
(defun torrents (words) "" (let* ((html (dex:get *search-url*)) (parsed (plump:parse html)) (res (lquery:$ parsed *selectors*))) res))
and if you prefer a threading macro / pipes, no problem, but we'll load another external library:
(ql:quickload "arrow-macros") ;; and we import its symbols: (use-package "arrow-macros")
(-<>> *search-url* (dex:get) (plump:parse) (lquery:$ <> *selectors*))
arrow-macros defines a few threading macros. The classic one would be
->
, which inserts the result of the preceding form as first
argument, ->>
that puts it last, which is what we wanted for the
two forms but not for the last one, with lquery, which needs the
parsed html as first argument. So we use -<>>
: the arrow will
populate the last argument, except when it encounters a <>
placeholder. -<>>
has a little name, "Diamond Spear".
1.3 Creating a new project
Before we write more functions we should create a new project. For
this I use a skeleton generator which will create the right
defsystem
, defpackage
and so for us.
I use cl-project
, which also generates a tests skeleton (in the
contrary of quick-project
):
(ql:quickload "cl-project") (cl-project:make-project #P"~/path/to/cl-torrents/")
Note that it may be easier for you sometimes if you create your new
Common Lisp projects into ~/.quicklisp/local-projects
(known by
Quicklisp) or ~/.local/share/common-lisp/
(known by ASDF). Doing
so, you will be able to ql:quickload
your project right away.
1.4 Adding our dependencies
Our new cl-torrents.asd
looks like this:
#| This file is a part of cl-torrents project. |# (in-package :cl-user) (defpackage cl-torrents-asd (:use :cl :asdf)) (in-package :cl-torrents-asd) (defsystem cl-torrents :version "0.1" :author "" :license "" :depends-on () ;; <== list of dependencies :components ((:module "src" :components ((:file "cl-torrents")))) :description "" :long-description …)
For pythonistas, it is very similar to a setup.py
.
It has the depends-on
paramater which accepts a list of package
names. We have to register here dexador
and the others:
:depends-on (:str :dexador :plump :lquery)
and arrow-macros
if you wish.
If you need to set package versions, use something like:
:depends-on ((:version "cl-ppcre" "2.0.11"))
but there is usually no need. To learn more about dependencies management in Common Lisp, what to do when you need project-local dependencies (like a virtual env), see the Cookbook.
1.5 Loading the project
Open the .asdf
file and compile and load it. In Slime, it's with
C-c C-k
(slime-compile-and-load-file
, see also the Emacs menu).
Now we can load the project at the REPL and install its dependencies:
(ql:quickload "cl-torrents" ;; or asdf:make ; compiling file "/home/vince/projets/cl-torrents/src/cl-torrents.lisp" (written 28 AUG 2017 10:21:07 PM): ; compiling (IN-PACKAGE :CL-USER) ; compiling (DEFPACKAGE CL-TORRENTS ...) ; compiling (IN-PACKAGE :CL-TORRENTS) ; compiling (DEFPARAMETER *SEARCH-URL* ...) ; compiling (DEFPARAMETER *SELECTORS* ...) ; compiling (DEFUN TORRENTS ...) ; /home/vince/.cache/common-lisp/sbcl-1.3.19-linux-x64/home/vince/projets/cl-torrents/src/cl-torrents-tmp5GEXGEG5.fasl written ; compilation finished in 0:00:00.029 ; compilation unit finished T
And now we can use our function at the REPL.
We go into our package so that we can call our functions directly:
(in-package :cl-torrents)
We could import the functions from our package and call them directly,
but we need to export
them and we'll see that shortly.
We could call them with the project prefix, but we need a doublon
colon because our functions are not exported yet (so they're kinda
private, but not strictly, like with a method starting with the
underscore _
in Python).
(cl-torrents::torrents "matrix")
1.6 Searching with our keywords
Until now we only tried things out with a given search url, set in stone. It's time to insert our own search terms into this search url.
We'll put a {KEYWORDS}
placeholder into the url:
(defparameter *search-url* "https://piratebay.to/search/?FilterStr={KEYWORDS}&ID=&Limit=800&Letter=&Sorting=DSeeder" "base search url. {KEYWORDS} to be replaced by + separated words.")
which we will replace with a +
-separated list of keywords.
With a little look at the "strings" cookbook page, we'll go with the little str library (our lib actually):
(ql:quickload "str")
Let's try:
(defparameter words "matrix trilogy") ;; => WORDS (str:words words) ;; => ("matrix" "trilogy") (str:join "+" *) ;; the * is a REPL shortcut to insert the previous result. + inserts the previous input. ;; => "matrix+trilogy"
and voilà. We put this at the beginning of our search function and we get:
(defparameter *search-url* "https://piratebay.to/search/?FilterStr={KEYWORDS}&ID=&Limit=800&Letter=&Sorting=DSeeder" "base search url. KEYWORDS to be replaced.") (defun torrents (words) "Search torrents." (let* ((terms (str:words words)) (query (str:join "+" terms)) (*search-url* (str:replace-all "{KEYWORDS}" query *search-url*)) (req (dex:get *search-url*)) (html (plump:parse req)) (res (lquery:$ html *selectors*)) (res-list (coerce res 'list)))) res-list))
In the end we prefer to return a list, rather than a vector.
Let's try:
(torrents "matrix trilogy") ("Matrix FRENCH DVDRIP 1999 COOLUpload Date: 05.06.15 Size: 700,30 MB" "The Matrix Reloaded (2003) FullHD, Dual Audio: English + SpaUpload Date: 12.04.15 Size: 8,51 GB" "The Matrix Trilogy (1999-2003) + Extras 1080p BluRay x264 DuUpload Date: 12.02.15 Size: 12,86 GB" "The Matrix Trilogy (1999-2003) BluRay BDRip 1080p AC3Upload Date: 15.09.15 Size: 23,29 GB" "The Matrix Trilogy (1999-2003) BDRip 1080p Dual Audio [ HindUpload Date: 14.01.15 Size: 10,23 GB" "The Matrix Revolutions (2003) BRRip [Dual Audio] [Hindi+Eng]Upload Date: 24.02.15 Size: 496,36 MB" "Matrix (1999)Blu-Ray 720p Dublado PT-BR - mo93438Upload Date: 20.02.15 Size: 796,86 MB" "The Matrix Reloaded (2003) BRRip [Dual Audio] [Hindi+Eng] 50Upload Date: 22.02.15 Size: 496,39 MB" [and more results]
Cool !
We can commit this, have a break and enjoy how things are going. It was very easy, except one or two gotchas :)
Of course, we need to get more stuff out of this, like the torrent's magnet link.
1.7 Formatting output
Our torrents
function prints out intelligable output, but we
don't control it yet. We want to iterate over the search results
and print exactly what we want.
So first we need to extract the title, with the CSS selector we found at the beginning.
(defun result-title (node) "Return the title of a search result." (aref (lquery:$ node ".Title a" (text)) 0))
When we iterate over the result list:
(defun display-results (&optional (results *last-search*) (stream t)) "Results: list of plump nodes. We want to print a numbered list with the needed information (torrent title, the number of seeders,..." (mapcar (lambda (it) ;; do not rely on *last-search*. (format stream "~a~%" (result-title it))) results) t)
it prints something like:
Matrix FRENCH DVDRIP 1999 COOL The Matrix Reloaded (2003) FullHD, Dual Audio: English + Spa The Matrix Trilogy (1999-2003) + Extras 1080p BluRay x264 Du The Matrix Trilogy (1999-2003) BluRay BDRip 1080p AC3 The Matrix Trilogy (1999-2003) BDRip 1080p Dual Audio [ Hind Matrix (1999)Blu-Ray 720p Dublado PT-BR - mo93438 ...
What we have in mind is to print the index of the result next to it, and for convenience to print the first result last, so that it stays to the bottom and it's easier to see from the prompt.
We have a quick look at the Cookbook for string formatting (the
simplest directive is ~a
, for aesthetics, and justifying text on
the left is with ~@a
). ~%
is the newline.
(defun display-results (&optional (results *last-search*) (stream t)) "Results: list of plump nodes. We want to print a numbered list with the needed information (torrent title, the number of seeders,..." (mapcar (lambda (it) (format stream "~3@a: ~65a ~%" (position it *last-search*) ;; <-- find the position of the result in the list (result-title it))) ;; <-- we reverse the list (reverse results)) t)
Here we use another global variable that we introduced eventually. In
the end of our torrents
function, we add this:
(setf *last-search* res-list)
so that our search results are saved in this variable which we define:
(defvar *last-search* nil "Remembering the last search.")
and we can easily access this result list elsewhere.
So, we get this formatting:
198: Arturia - Matrix-12 V v1.1.0.522 OS X [PitcHsHiFTeR][dada] 197: Arturia - Matrix-12 V v1 1 0 522 R2 AU AAX VST VST3 ST OS X 196: Native Instruments - Maschine Expansion Golden Kingdom HYBRI 195: Arturia - Matrix 12-V v1.0.1.9 OS X [HEXWARS][dada] 194: PPPD-374 Ikuchichi Beauty Salon That Just Busty Beauty Is In 193: THE MATRIX TRILOGY: Complete Collection - DVDRip ... 10: Matrix Reloaded (2003)Blu-Ray 720p Dublado PT-BR - mo93438 9: Matrix Revolutions (2003)Blu-Ray 720p Dublado PT-BR - mo9343 8: Die Matrix Trilogie - 1 bis 3 - KOMPLETT 7: The Matrix Reloaded (2003) BRRip [Dual Audio] [Hindi+Eng] 50 6: The Matrix Revolutions (2003) BRRip [Dual Audio] [Hindi+Eng] 5: Matrix (1999)Blu-Ray 720p Dublado PT-BR - mo93438 4: The Matrix Trilogy (1999-2003) BDRip 1080p Dual Audio [ Hind 3: The Matrix Trilogy (1999-2003) BluRay BDRip 1080p AC3 2: The Matrix Trilogy (1999-2003) + Extras 1080p BluRay x264 Du 1: The Matrix Reloaded (2003) FullHD, Dual Audio: English + Spa 0: Matrix FRENCH DVDRIP 1999 COOL T
The indexes are aligned on 3 digits on the right with ~3@a
and
titles are truncated at 65 characters, nice :) It will be easy to
add more information on the right side (seeders, leechers).
_Note__: We can also truncate the titles and add an ellipsis with
str:prune
:
2: The Matrix Trilogy (...) 1: The Matrix Reloaded (...) 0: Matrix FRENCH DVDRIP(...)
1.8 Getting more torrent information
With plump:serialize
we could check what html is inside our plump
node:
(plump:serialize (second res)) <td class="Title"> <span class="ColorA"> <a href="https://piratebay.to/torrent/2297350/Matrix FRENCH DVDRIP 1999 COOL/" onclick="Javascript:OpenDetailPage('https://piratebay.to/torrent/2297350/Matrix FRENCH DVDRIP 1999 COOL/'); return false;">Matrix FRENCH DVDRIP 1999 COOL </a> </span> <br/> <span class="ColorB VaA">Upload Date: 05.06.15 </span> <span class="ColorB VaA">Size: 700,30 MB </span> <span class="ColorB"/> </td>
We want to get the torrent's page, the url in the firts href
. From
this page we'll be able to access the magnet link.
We know how to access the a
:
(defparameter *elt* (first res)) (lquery:$ *elt* "a" (text)) ;; => #("Matrix FRENCH DVDRIP 1999 COOL")
it returns a plump node.
We use (lquery:$ ... (attr :href))
as seen above to extract attributes:
(lquery:$ *elt* "a" (attr :href)) ;; => #("https://piratebay.to/torrent/2297350/Matrix FRENCH DVDRIP 1999 COOL/")
Ok. But watch out again, the result is a vector (of one element).
We put this in a function:
(defun detail-page-url (node) "Extract the link of the details page. `node': plump node, containing the url." (let ((href-vector (lquery:$ node "a" (attr :href)))) (aref href-vector 0)))
which we can test (either write it at the REPL either write it in the
project and compile, C-c C-c
in Slime):
(mapcar #'detail-page-url res) ;; #' is shorthand for function ;; => ("https://piratebay.to/search/0/800/0/matrix/0/ATitle/1/" "https://piratebay.to/torrent/2297350/Matrix FRENCH DVDRIP 1999 COOL/" "https://piratebay.to/torrent/2156107/The Matrix Reloaded (2003) FullHD, Dual Audio: English + Spa/" "https://piratebay.to/torrent/1885366/The Matrix Trilogy (1999-2003) + Extras 1080p BluRay x264 Du/" […]
1.8.1 To the magnet link
We have the torrent's details page, we know how to request it, now we want to get the magnet link.
We experiment, and get a list of the links inside this page.
This is what we first came up with: (can be simpler)
(mapcar (lambda (it) (lquery:$ it (attr :href))) (coerce (lquery:$ * "a") 'list)) ;; => (NIL NIL NIL NIL NIL NIL NIL "https://piratebay.to/" "https://piratebay.to/" […] "http://imdb.com/title/tt1778413/" "https://piratebay.to/profile/Anonym" "https://piratebay.to/Downloader.php?ID=2289391&Filename=Matrix+FRENCH+DVDRIP+1999+COOL" "magnet:?xt=urn:btih:40eca43690cf1b99b0a4d485ebf4855d20b0bac5" "http://" […] "https://twitter.com/piratebayto" "https://www.facebook.com/thepiratebayto" "http://bitcoin.org" "bitcoin:1FX2wz8NiWPdtCGSrzn7j3NAg4VKqGovto" "/")
in the result, there's our magnet link.
At first I was frustrated to have to coerce
something to a list but
it is not mandatory.
Because mapcar
expects a list and
lquery returns a vector, we had to transform lquery's result to a list with
coerce
. But we can simply use map
, that works on lists and
vectors, and ask it to return a list:
(map 'list (lambda (it) (lquery:$ it (attr :href))) (lquery:$ * "a"))
I find the name mapcar
unusual too, it was frustrating at the
beginning but it's a just a name after all.
We could also use cl21's map
, which works on lists and vectors so no
more questions, it will work.
Still with cl21, we can write shorter lambdas, with the shorthand lm
or with the ^
reader macro and accessing arguments with %1
… %n
or simply %
for the first one:
(map ^(lquery:$ % (attr :href)) ...)
We filter the list above to extract the magnet link:
(remove-if-not (lambda (it) (str:starts-with? "magnet" it)) *)
Here, I used again a short verb from an external library for string manipulation. The CL way would be something like:
(string= "magnet-foo" "magnet" :start1 0 :end1 (length "magnet")) T
and yet we must handle nils, differences of length,… so boring.
We end up with the following functions:
(defun magnet-link-from (node) "Extract the magnet link from a `torrent' result." (let* ((url (detail-page-url node)) (html (request-details url)) (parsed (plump:parse html))) (find-magnet-link parsed)))
- we extract the magnet link from an html (the page of a torrent's page) parsed with plump like this:
(defun find-magnet-link (parsed) "Extract the magnet link. `parsed': plump:parse result." (let* ((hrefs (coerce (lquery:$ parsed "a" (attr :href)) 'list)) (magnet (remove-if-not (lambda (it) (str:starts-with? "magnet" it)) hrefs))) (first magnet)))
- this one gets a plump node (from the search results), extracts the url of the torrent's page and calls our function above to extract the magnet link:
(defun magnet-link-from (node) "Extract the magnet link from a `torrent' result." (let* ((url (detail-page-url node)) (html (dex:get url)) (parsed (plump:parse html))) (find-magnet-link parsed)))
- finally we need an easy way to call the function above and give it a reference to a search result.
(defun magnet (index) "Search the magnet from last search's `index''s result." (magnet-link-from (elt *last-search* index)))
And we simply use it like so: given an output like
... 5: Matrix (1999)Blu-Ray 720p Dublado PT-BR - mo93438 4: The Matrix Trilogy (1999-2003) BDRip 1080p Dual Audio [ Hind 3: The Matrix Trilogy (1999-2003) BluRay BDRip 1080p AC3 2: The Matrix Trilogy (1999-2003) + Extras 1080p BluRay x264 Du 1: The Matrix Reloaded (2003) FullHD, Dual Audio: English + Spa 0: Matrix FRENCH DVDRIP 1999 COOL T
We request the magnet link with:
(magnet 0)
"magnet:?xt=urn:btih:40eca43690cf1b99b0a4d485ebf4855d20b0bac5"
That works, but beware if we ask for an index that does not exist. Let's take precautions:
(defun magnet (index) "Search the magnet from last search's `index''s result." (if *last-search* (if (< index (length *last-search*)) (magnet-link-from (elt *last-search* index)) (format t "The search returned ~a results, we can not access the magnet link n°~a.~&" (length *last-search*) index)) (format t "The search returned no results, we can not return this magnet link.~&")))
1.9 Exporting functions
We need to export symbols in order to use them from the outside of
their source file, in order to use them directly (use-package
) or
with (my-package:my-function)
. If we don't export them, we can still
access them with a double colon: (my-package::my-function)
.
Our package definition contains this:
(defpackage cl-torrents (:use :cl))
We add it an export
clause:
(defpackage cl-torrents (:use :cl) (:export :torrents :magnet))
We could also mark the functions to export with a decorator à-la Python, like this:
@export (defun torrents (…) …)
which is quite elegant and can be handy. This is doable with the cl-annot library. It also requires a small Slime configuration.
1.10 Conclusion
This leads us to the end of part one.
We wrote a function that makes a query on TPB and returns a list of
Plump objects. Then the function that prints the results calls other
ones that extract the title, the seeders and leechers from the Plump
objects: there is some coupling and that should be avoided. In
reality, we quickly changed the torrents
function so that it returns
a list of alists with the required information extracted:
(defun torrents (words &key (stream t)) "Search torrents." (format stream "searching on the Pirate Bay…") (let* ((query (str:join "+" words)) (*search-url* (str:replace-all "{KEYWORDS}" query *search-url*)) (req (request *search-url*)) (html (plump:parse req)) (res (lquery:$ html *selectors*)) (toret (map 'list (lambda (node) `((:title . ,(result-title node)) (:href . ,(result-href node)) (:leechers . ,(result-leechers node)) (:seeders . ,(result-peers node)) (:source . :tpb))) res))) (format stream " found ~a results.~&" (length res)) toret)) (defun result-title (node) "Return the title of a search result." (aref (lquery:$ node ".Title a" (text)) 0)) (defun result-href (node) (let* ((href-vector (lquery:$ node "a" (attr :href)))) (aref href-vector 0)))
That way, the print function does not rely on TPB-specific code and we can give it other alists to print, from other torrent sites.
We now want or need more:
- getting more content (seeders, leechers): this is done in the app.
- downloading the torrent file ?
- error handling (network errors or timeout, unexpected errors, user interrupts) (part 5),
- scraping other sites, asynchronously (the asynchronous part will be straightforward, there's a library for that and it's one function change. See the Cookbook tutorial.)
- some cache (part 6),
- unit tests, "live" tests and continuous integration: see the next section.
- building a standalone executable, parsing command line arguments: see part 3.
- …
We'll carry on by writing tests, then we'll build a self-contained executable.
2 Designing the scrapers with CLOS
At this point, we know how to scrape The Pirate Bay and retrieve information like the torrents' title, href to their web page, their number of seeders, etc.
We only used regular data structures like alists. The library worked
like this until version 0.12, but we always felt it clunky. In
particular, dealing with lists of alists, accessing values with
assoc-value
from Alexandria or loosing our tests when we changed the
data structures didn't feel right.
We also only considered one torrent source, but in reality we wrote scrapers for more (five at the time of writing). We must find solutions for code reuse and better representing each torrent source.
We'll now use the Common Lisp Object System (CLOS), which is a pleasure to work with. Does it seem verbose, too weird ? Well, just begin with a little class and experiment by yourself, you'll see.
You'll find more learning resources in the Cookbook.
Let's dive in:
(defclass torrent () ((title :initarg :title :accessor title) (href :initarg :href :accessor href) (seeders :initarg :seeders :accessor seeders) (leechers :initarg :leechers :accessor leechers) (source :initarg :source :accessor source))) (defmethod print-object ((it torrent) stream) (print-unreadable-object (it stream) (format stream "~a, ~a" (str:prune 30 (title it)) (source it)))) (defun make-torrent (&key title href seeders leechers source) (assert title) (assert source) (assert href) (make-instance 'torrent :title title :href href :seeders seeders :leechers leechers :source source))
Notes on the snippet above: the structure of defclass is
(defclass <name> (list of super-classes) (list of slots of the form (name attributes)) (optionally class options))
An object is created with make-instance 'class
, but is good practice
to define a constructor like make-torrent
.
The :initarg
is the keyword argument to the constructor:
(make-instance 'torrent :title "foo")
, the accessor is both a getter
and a setter:
(defvar *foo* (make-instance 'torrent :title "foo" :href "http://" :source :test)) (title *foo*) ;; => "foo" (setf (title *foo*) "bar")
print-object
is for pretty printing. The line
(defmethod print-object ((it torrent) stream)
specializes the print-object
generic method for torrent
classes. It takes a required stream
parameter.
See the actual code here and how it is used in scrapers here.
Working with the CLOS is delightful. If we created objects at the REPL and then change the class definition, the objects are updated: deleted slots are deleted, slots added are instantiated (with rules we can control), accessor names are updated, etc.
Of course, working with an object system allows code-reuse, CLOS allows to specialize methods for different classes, it allows polymorphism, etc, which we don't show here.
Just dive into the CLOS and have fun !
3 Tests and CI
We wouldn't be called a developper if we didn't write any test.
Our favorite test framework (which we found on the Awesome CL list) is Prove.
The file t/cl-torrents.lisp
, generated by cl-project, looks like this:
(in-package :cl-user) (defpackage cl-torrents-test (:use :cl :cl-torrents ;; => import our exported functions in cl-torrents.lisp :prove)) ;; => import all Prove verbs (like python's "from prove import *") (in-package :cl-torrents-test) ;; NOTE: To run this test file, execute `(asdf:test-system :cl-torrents)' in your Lisp. (plan nil) ;; optional Prove setting. ;; blah blah blah. (finalize)
We add our first and simplest test:
(ok (torrents "matrix"))
It only checks that this command doesn't fail. We compile it with C-c
C-c
and we see it run in the REPL.
This test does a network call: it is not an unit test. It's an "end-to-end" test instead, and that's ok we need one too :) We'll write unit tests now, and also hide the large output of the search results.
3.1 Unit tests
Since we do webscraping, the result from the network calls are likely
to be different each time. That's good for "integration" or
"end-to-end" tests but not for unit tests. We must find a way to fake
the result of dex:get
and return the same thing, always.
A solution is to save a piece of html in the testing directory and
make sure that a call to dex:get
returns it. In other words we're
looking to mock functions calls. There's a library to do this and
more, Mockingbird:
This package provides some useful stubbing and mocking macros for unit testing. Used when specified functions in a test should not be computed but should instead return a provided constant value.
It also makes possible to check if a given function was called, if so how many times, with what arguments, etc, which is very nice for tests.
Ok, let's go. We record the html of the search results:
mkdir t/assets/ wget -O t/assets/search-matrix.html https://piratebay.to/search/\?FilterStr\=matrix\&ID\=1\&ID\=\&Limit\=800\&Letter\=\&Sorting\=DSeeder
We need to read this file into a string. A quick look to the Cookbook: (unfortunately this is not a one-liner :( )
(defun file-to-string (path) "Return the given file as a string." (with-open-file (stream path :external-format :utf-8) (let ((data (make-string (file-length stream)))) (read-sequence data stream) data)))
and we use it:
;; Load the search result html from a file. (defparameter htmlpage (file-to-string #p"t/assets/search-matrix.html"))
From mockingbird, we need with-dynamic-stubs
. We'll mock a call to
dex:get
:
(with-dynamic-stubs ((dex:get htmlpage)) (ok (torrents "matrix") "torrent search ok"))
This test (run with C-c C-c
) should not make a network call and
should always return the matrix results. Indeed, if we write
(torrents "dysney")
instead it returns (and prints) the same.
So from here, we can write more unit tests. When we want to test the
magnet
function, we realize that we need to mock another dex:get
call, the one that requests the html page of a particular result. We
extract the network call from the function, what we should have done
from the beginning as best practice actually (we'll also need to
expand this with error checking and more):
(defun request-details (url) "Get the html page of the given url. Mocked in unit tests." (dex:get url))
Now we mock it. Extending the test above:
(with-dynamic-stubs ((dex:get htmlpage) (cl-torrents::request-details resultpage)) (ok (torrents "matrix" out) "torrent search ok") (ok (str:starts-with? "magnet" (magnet 0)) "magnet <i> returns the the magnet link from search result."))
Our tests still write a lot of stuff on standard output, let's fix that.
3.2 Capturing output
We knew giving an optional stream parameter to our torrents
function
would be useful sometime:
(defun torrents (words &optional (stream t)) ...)
The t
means "print to standard output". The trick is to give it
another stream, notably one that goes to a string:
(ok (with-output-to-string (out) (torrents "matrix" out)) "torrent search ok")
and that's it, our tests are silent now.
We can write more of them.
3.3 Isolating tests (with a macro)
I'm not bothered (yet?) by the way we wrote tests above, all of them
inside a with-dynamic-stubs
macro. It's just that they are not
isolated, at each C-c C-c
it compiled and ran the whole form,
running all our tests.
If we want, we can isolate them, each one under its own and same
with-dynamic-stubs
. But as soon as there's repetition… it's time to
refactor with a macro. There's not much to it but we're glad for a
little practice.
Each test will be of the form:
(with-dynamic-stubs (<stubs>)
<tests>)
The only argument to our macro is a form containing the tests:
(defmacro with-mocked-search-results (body)
We get this simple macro:
(defmacro with-mocked-search-results (body) `(with-dynamic-stubs ((dex:get htmlpage) (cl-torrents::request-details resultpage)) ,body))
The backquote kind of warns that there will be variables inside this form, and the coma kind of says to not evaluate the argument but to put it as is.
So when we use it like this:
(with-mocked-search-results (ok (with-output-to-string (out) (torrents "foo" out)) "search ok"))
we can see how it gets expanded like this:
(macroexpand-1 '(with-mocked-search-results ;; <-- note the quote (ok (with-output-to-string (out) (torrents "foo" out)) "search ok")) ) ;; (WITH-DYNAMIC-STUBS ((DEXADOR:GET HTMLPAGE) ;; (CL-TORRENTS::REQUEST-DETAILS RESULTPAGE)) ;; (OK (WITH-OUTPUT-TO-STRING (OUT) (TORRENTS "foo" OUT)) "search ok")) ;; T
Easy :)
3.4 make test
We must be able to run tests from the terminal for at least two reasons:
- given the moving state of our Lisp image during development it is
possible we get to a state that does not reflect the app at startup
(for example, given that a
defparameter
symbol doesn't get re-evaluated withC-c C-c
, are you not tempted tosetf
its value for testing ?) - Continuous Integration
I find it handy (that's not a solid reason though).
We have a look at Prove's page and we see that we can run a test file with
(prove:run #P"myapp/tests/my-test.lisp")
Then we need to think about the steps needed when we arrive into a new SBCL repl.
- we need to find our system definition.
(load #P"cl-torrents.asd")
- we need to make all its symbols and dependencies available in the image.
(ql:quickload "cl-torrents") ;; given we have Quicklisp installed and started with SBCL.
- run the test suite.
quit the REPL.
That gives this make target:
test: sbcl --load cl-torrents.asd \ --eval '(ql:quickload :cl-torrents)' \ --eval '(prove:run #P"t/cl-torrents.lisp")' \ --eval '(quit)'
We can also have a look at the run-prove Roswell utility.
Continuous Integration is well explained here:
We'll finish with a note on Prove. It is easy to use and the doc is ok, but we were surprised that the tests output are not very informative (we don't have an example right now though). So we might as well either try Fukamashi's Rove, "intended to be Prove's next major release", either try Shinmera's Parachute, which we discovered only recently. Parachute also has includes basic fixtures support, test suite dependencies, conditionals, time limits (also in Prove ?), silent output, and interactive reports.
On a positive note, kuddos to Mockingbird it was very helpful.
3.5 Continuous Integration
Now that we have tests running on the terminal, we'll want to run them automatically. We probably want to run them on every commit but we could choose only on every tag, periodically (every sunday night), or just manually with a command in a commit message.
We'll use Gitlab's CI, which is free for public or private repos.
Gitab CI works with Docker images so we'll use the existing
daewok/lisp-devel one. It includes SBCL, ECL, CCL and ABCL, and
has Quicklisp installed in the home (/home/lisp/
), so we can
quickload
packages right away. It also has a more bare bones
option.
We used this example .gitlab-ci.yml
successfully:
image: daewok/lisp-devel before_script: - apt-get update -qy - apt-get install -y git-core - git clone https://github.com/Chream/mockingbird ~/quicklisp/local-projects/ test: script: - make test
- "image": we set which Docker image to use.
- "test": just the name of our pipeline.
- "script": this is the recognized section to run any script we want. If it exits with success, our pipeline passes. So here we run our own "make test" command (Gitlab has cloned our project and put us at its root for us).
"beforescript": our tests rely on a library that isn't on Quicklisp yet, so we clone it into the usual
~/quicklisp/local-projects/
. We have to update apt's cache and install git beforehand.And this is it ! You know have a working continuous integration stack.
But while we're at it, let's see a couple more Docker commands and a way to test our CI commands locally.
3.6 A couple Docker commands
We have a Docker image ready to use, so let's see how to use it ourselves. Run:
sudo service docker start docker run --rm -it daewok/lisp-devel:latest bash
This will download what's needed for the first time (around 400Mo)
and then drop you into a bash prompt. Now you are in a fresh
Debian-based system with Lisp stuff installed. You can enter an
sbcl
REPL and install Quicklisp libraries. But you don't have
access to your code outside of Docker. Mount it:
docker run --rm -it -v /path/to/local/code:/usr/local/share/common-lisp/source daewok/lisp-devel:latest bash
This will put your local /path/to/local/code/
into Docker where
indicated.
Now we can test things around. I don't know how to test our
.gitlab-ci.yml
straight away though.
Also see the Lisp image's README, they have more stuff on how to develop from our Emacs on the Lisp environment inside the image.
4 Building and delivering self-contained executables
4.1 SBCL.
So far we used the torrents
command in our Slime REPL. That's
fine and well, but what if we want to distribute our software to
users and simply run it from the shell ?
Building (self-contained) executables is implementation-specific (Rowsell fixes that). With SBCL, as says its documentation, it is a matter of:
(sb-ext:save-lisp-and-die #P"torrents" :toplevel #'main :executable t)
where #p"torrents"
is the pathname of the executable, :toplevel
is the function to be called ("entry point" in Python),
:executable t
tells to build an executable instead of an
image. We could build an image to save the state of our current
Lisp image, to come back working with it later. Specially useful if
we made a lot of work that is computing intensive.
sb-ext
is an SBCL extension to run external processes. See other
SBCL extensions (many of them are made implementation-portable in
other libraries).
We have to define a main function that accepts keywords as arguments (we won't parse command line options just yet).
The Cookbook has us covered once again and, double coincidence, we
rely on SBCL's sb-ext
to access the list of command line
arguments with sb-ext:*posix-argv*
(this is not portable yet but
a library, or Roswell, fixes that, follow the recipe). It returns a list of
arguments, the first one being the name of the executable, as
usual.
That gives:
(defun main () "Get command line arguments with SBCL." (torrents (subseq sb-ext:*posix-argv* 1)))
Here we give a list to torrents
but I found useful
to accept both lists and a string (to allow the user/developper to call
directly (torrents "foo bar")
in Slime):
(defun torrents (words &optional (stream t)) "Search torrents." (let* ((terms (if (listp words) words ;; The main gives words as a list, ;; the user at the Slime REPL one string. (str:words words)))
Cool we have basic command line arguments ! We'll parse them for real shortly.
Another thing to know about save-lisp-and-die
is that we must run
this function in another REPL than Slime.
So, let's launch an SBCL repl at our project root:
rlwrap sbcl
(rlwrap gives repl history. The default SBCL repl is the most basic one. We should have a look to the new cl-repl too, an ipython-like repl.).
If we run save-lisp-and-die
directly, we'll get errors that
"cl-torrents package does not exist", or that "The function
COMMON-LISP-USER::MAIN is undefined". Indeed, as is SBCL has no way
to know where is our system. We need to load our asd file,
quickload our system and now that the ongoing image knows all the
dependencies an symbols, we can create an executable out of it.
(load "cl-torrents.asd") (ql:quickload :cl-torrents) (use-package :cl-torrents) (sb-ext:save-lisp-and-die #p"torrents" :toplevel #'main :executable t)
Obviously we want this in a Makefile.
4.2 make build
sbcl has the --eval
and --load
options.
build: sbcl --load cl-torrents.asd \ --eval '(ql:quickload :cl-torrents)' \ --eval '(use-package :cl-torrents)' \ --eval "(sb-ext:save-lisp-and-die #p\"torrents\" :toplevel #'main :executable t)"
Now a make build
will take the time to quickload the dependencies
and build our executable. We can try it on the shell and give it to
anyone, they won't have any dependency to install, not even a Lisp
implementation.
Let's realize what we achieved for a moment. Personally I was unable to do something like this with my other languages before (Python, JS).
And it works for web apps too ! We just need one more line in the main function, like this:
;; https://stackoverflow.com/questions/30422451/sbcl-deploying-hunchentoot-application-as-executable (sb-thread:join-thread (find-if (lambda (th) (search "hunchentoot" (sb-thread:thread-name th))) (sb-thread:list-all-threads))))
And we can embed a webapp in Electron for the desktop ! (with Ceramic).
Are there downsides ?
4.3 with ASDF or Roswell
We might enjoy a more shell-friendly way to build our executable, and most of all a portable one, so we would have the same command to work with various implementations.
Since version 3.1, ASDF, the de-facto build system, shipped with every implementation, allows that with the make command. It needs parameters in our .asd declaration:
:build-operation "program-op" ;; leave as is :build-pathname "<binary-name>" :entry-point "<my-package:main-function>"
and then a call to asdf:make :my-package
.
So, in a Makefile:
LISP ?= sbcl build: $(LISP) --non-interactive \ --load cl-torrents.asd \ --eval '(ql:quickload :cl-torrents)' \ --eval '(asdf:make :cl-torrents)'
The --non-interactive
option asks to not enter the debugger if
an error were to happen, and exits after the last command
(otherwise we would add a (quit)
).
Finally Roswell, that does a lot, also has the ros build
command, that
should work for more implementations. It is a bit under-documented
(issue).
4.4 Size and startup times of executables per implementation
$ ls -lh torrents -rwxr-xr-x 1 vince vince 69M sept. 30 22:58 torrents
Our executables weighs 78Mo. Quite big for this little cli app. But it will be totally ok for a business application.
To ease the distribution of apps to people who already have a Lisp,
we can use Roswell: we need to register our app in Quicklisp,
install Roswell and install a program with ros install
cl-torrents
.
SBCL isn't the only Lisp implementation though. ECL, Embeddable Common Lisp, compiles Lisp programs to C. That should create a smaller executable.
Given this reddit comment, ECL produces indeed the smallest executables of all, an order of magnituted smaller than SBCL, but with a startup time an order of magnitude slower (or more).
ECL is particularly suited to this task. Since it transpiles to C, the C compiler will remove dead code automatically for you. You can get incredibly small executables this way.
program size | implementation | CPU | startup time |
---|---|---|---|
28 | /bin/true | 15% | .0004 |
1005 | ecl | 115% | .5093 |
48151 | sbcl | 91% | .0064 |
27054 | ccl | 93% | .0060 |
10162 | clisp | 96% | .0170 |
4901 | ecl.big | 113% | .8223 |
70413 | sbcl.big | 93% | .0073 |
41713 | ccl.big | 95% | .0094 |
19948 | clisp.big | 97% | .0259 |
We may have another trick to distribute small executables: to make the fasl files executables.
This will basically mean that your on-disk utilities will not include a copy of the SBCL core but use the central copy on the system, which makes them relatively small. Memory footprint will be the same though.
So we have room for improvement.
Finally, we tried the latest ECL version (16.1.3) with the lisp-devel Docker image (Debian's 13.x is too old, it has ASDF compatibility problems, that should be resolved with ASDF 3.3.2). We had a simple makefile target:
# Makefile ecl \ -eval '(ql:quickload :closer-mop)' \ # seems needed O_o -eval '(load "cl-torrents.asd")' \ -eval '(ql:quickload :cl-torrents)' \ # building everything... -eval '(asdf:make-build :cl-torrents :type :program :move-here #P"./")'
which we can call manually after loading the image (or integrate into Gitlab CI):
service docker start docker run --rm -it -v /home/vince/projets/cl-torrents:/usr/local/share/common-lisp/source daewok/lisp-devel:latest bash
In the end, we got a 52Mo executable, and unfortunately it got a runtime error, so we won't ship it, and we won't investigate this.
4.5 Parsing command line arguments
We saw earlier that SBCL stores the arguments into
sb-ext:*posix-argv*
. The first thing is that it differs from
implementations, so we want a library to handle the differences for
us.
We also want to parse them.
A quick look at the awesome-cl#scripting list and we'll try unix-opts.
(ql:quickload "unix-opts")
We can call it with its opts
alias (nickname).
unix-opts
allows to declare the arguments with
opts:define-opts
:
(opts:define-opts (:name :help :description "print this help text" :short #\h :long "help") (:name :nb-results :description "maximum number of results to print." :short #\n :long "nb" :arg-parser #'parse-integer) (:name :magnet :description "get the magnet link of the given search result." :short #\m :long "magnet" :arg-parser #'parse-integer))
Here parse-integer
is a built-in CL function.
We parse and get them with opts:get-opts
, which returns two
values: the first is the list of valid options and the second the
remaining free arguments. We then must use multiple-value-bind
to
catch everything:
(multiple-value-bind (options free-args) ;; opts:get-opts returns the list of options, as parsed, ;; and the remaining free args as second value. ;; There is no error handling yet (specially for options not having their argument). (opts:get-opts) ...
We can explore this by giving a list of strings (as options) to
get-opts
:
(multiple-value-bind (options free-args) (opts:get-opts '("hello" "-h" "-n" "1")) (format t "Options: ~a~&" options) (format t "free args: ~a~&" free-args)) Options: (HELP T NB-RESULTS 1) free args: (hello) NIL
If we put an unknown option, we get into the debugger. We'll see error handling later on.
So options
is a property list, i.e. a list that alternates a key
and a value. It helps me to see that it looks like this:
[o|o]---[o|o]---[o|o]---[o|/] | | | | FOO "foo" BAR "bar"
We can use getf
and setf
with plists, so that's how we do our
logic. Below we print the help with opts:describe
and then exit
(portable way).
(multiple-value-bind (options free-args) ;; No error handling yet (for unknown options or one not having its argument). (opts:get-opts) (if (getf options :help) (progn (opts:describe :prefix "CL-torrents. Usage:" :args "[keywords]") (opts:exit))) ;; <= exit takes an optional return status. (if (getf options :nb-results) (setf *nb-results* (getf options :nb-results))) (torrents free-args) (if (getf options :magnet) ;; if we had caching we wouldn't have to search for torrents first. (progn (format t "~a~&" (magnet (getf options :magnet))) (opts:exit))))
The example in the unix-opts repository suggests a macro to do slightly better. And there is no error handling yet.
Here we are. We can build programs that run both in the REPL (well, that's obvious) and in the terminal.
If you like this and want more shiny stuff:
- cl-ansi-term - print colorized text, horizontal lines, progress bars, list, tables,…
- cl-readline - bindings to the GNU Readline library (Emacs and Vim input modes, history, basic expansion, custom completion,…). Example project: cl-repl. Demo example: cl-readine-example. shtookovina by the same author is an example command-line app that makes good use of both projects.
- the Awesome-cl/scripting link above.
4.6 Automatically building and delivering the executable
We already have a continuous integration system with Gitlab CI (see above, Tests and CI). We're then very close in having a system to automatically build and deliver our executable. Moreover, we'll do that only when we push a new tag.
For this, we'll declare a job artifact.
First, we create a new job to run our command that builds the executable:
build: script: - make build
We'll see:
To be able to browse the artifacts, and get an url to download the last one, we add:
build: script: - make build artifacts: paths: - torrents
And now, to only to this on a new tag:
build: only: # Only when a new tag is pushed. - tags script: - make build artifacts: paths: - torrents
(see the doc).
We'll add the rule to run tests first and to not build the software if
they fail (we use environments). We declare two names of "stages"
(test and build) that respect the given order, and add a "stage: xxx"
in each job. Our final .gitlab-ci.yml
is this:
image: daewok/lisp-devel stages: # <= new - test - build before_script: - apt-get update -qy - apt-get install -y git-core - git clone https://github.com/Chream/mockingbird ~/quicklisp/local-projects/ test: stage: test # <= new only: - tags script: - make test build: stage: build # <= new only: # Only when a new tag is pushed. - tags script: - make build artifacts: paths: # xxx include the version number - torrents
And voilà, that's it.
We have an url to download the latest artifact:
https://gitlab.com/vindarel/cl-torrents/-/jobs/artifacts/master/raw/torrents?job=build
Let's just put the link on an icon and add the version number on the executable's name.
4.7 Final words
As said before, we didn't do error handling yet, and this can be
annoying in the shell: on an error, including an user's C-c
,
we get a stacktrace and we are dropped into the Lisp debugger. We
fix this below.
5 Moving functions into their own package
We have written a few helpers functions to treat with colors of keywords. As we were trying out things, we wrote those in our main file. But we'd like them to go into their own "utils" file and package.
(If you want to see the helper functions, go to the next section.)
5.1 Dealing with symbol conflicts. shadowing-import, uninterning.
We move them in a new src/utils.lisp
file.
Our utils.lisp
file declares a new cl-torrents.utils
package,
exporting what's needed:
(in-package :cl-user) (defpackage cl-torrents.utils (:use :cl) (:export :sublist :helper-function-one)) (in-package :cl-torrents.utils)
We add this file to the .asd
:
:components ((:module "src" :components ((:file "utils") ;; <= added (:file "cl-torrents"))))
And on the other side into cl-torrents.lisp
, either we :use
all
the export
'ed utilities:
(:use :cl :cl-torrents.utils)
either we import explicitely what we need (and we'll do that as it is more explicit, although a bit redondant):
(defpackage cl-torrents (:use :cl) (:import-from :cl-torrents.utils :colorize-all-keywords :keyword-color-pairs :sublist) …
And then, as we quickload
cl-torrents with its new organization,
we get trapped trapped into the debugger:
IMPORT CL-TORRENTS.UTILS::KEYWORD-COLOR-PAIRS causes name-conflicts in #<PACKAGE "CL-TORRENTS"> between the following symbols: CL-TORRENTS.UTILS::KEYWORD-COLOR-PAIRS, CL-TORRENTS::KEYWORD-COLOR-PAIRS [Condition of type SB-EXT:NAME-CONFLICT] See also: Common Lisp Hyperspec, 11.1.1.2.5 [:section] Restarts: 0: [SHADOWING-IMPORT-IT] Shadowing-import CL-TORRENTS.UTILS::KEYWORD-COLOR-PAIRS, uninterning KEYWORD-COLOR-PAIRS. 1: [DONT-IMPORT-IT] Don't import CL-TORRENTS.UTILS::KEYWORD-COLOR-PAIRS, keeping KEYWORD-COLOR-PAIRS. 2: [RESOLVE-CONFLICT] Resolve conflict. 3: [RETRY] Retry compiling #<CL-SOURCE-FILE "cl-torrents" "src" "cl-torrents">. 4: [ACCEPT] Continue, treating compiling #<CL-SOURCE-FILE "cl-torrents" "src" "cl-torrents"> as having been successful. 5: [RETRY] Retry ASDF operation. […]
Fortunately there's nothing to be afraid of, it is pretty clear. We have a name conflict between those two symbols:
CL-TORRENTS.UTILS::KEYWORD-COLOR-PAIRS, CL-TORRENTS::KEYWORD-COLOR-PAIRS
- the first one being our new function into its new
cl-torrents.utils
package, the second being the old one, that was into cl-torrents' main file. The function has been moved, but the symbol is still around !
We need to give priority to the new symbol, and that's just what offers the first restart n°
0
, to "shadow import CL-TORRENTS.UTILS::KEYWORD-COLOR-PAIRS, uninterning KEYWORD-COLOR-PAIRS.". "uninterning" is getting rid of the symbol, shadowing-import is this process to import a symbol that hides one already present.We can press
0
and… we get the same messages. This time, for the other function that was being called directly in our main lisp code. We don't get this debugger for the "intermediate" functions likenext-color
, the ones that are called only by the two main ones.It was quick for two warnings. Still, we could use a way to see and accept everything in a row.
5.2 The helper functions. Ansi colors. Closures. Looping over a plist.
This was fun to do. Our goal was to colorize each keyword with a different color, like so:
This helps seeing the most relevant results, because sometimes the Pirate Bay website (at least this copy) returns not very related results.
We have the list of searched keywords in *keywords*
. We want to
associate each one with a different color.
We use the cl-ansi-text library, on Quicklisp. It offers to either colorize strings like this:
(with-color (:red) (princ "Gets printed red..."))
where :red
is a CL symbol, or with functions:
(import 'cl-ansi-text:yellow)
(yellow "Yellow string")
we'll prefer this method as we are treating with text.
We are dealing with ansi colors in the terminal (either Slime or the terminal). So what this does is adding an extra prefix and suffix to our string:
(yellow "yellow" ;; "[33myellow[0m"
We have to remember this.
We define the available colors into a variable:
(defparameter *colors* '( cl-ansi-text:blue cl-ansi-text:green cl-ansi-text:yellow cl-ansi-text:cyan cl-ansi-text:magenta cl-ansi-text:red ) "Functions to colorize text.")
There are 6 of them. Now we want a way to give each keyword a different color. And if we have more than 6 keywords, we cycle over the list. We want a function to cycle through a list… but it would not start at 0 everytime… that reminds me of closures !
;; closure to loop over the list of available colors. (let ((index 0)) (defun next-color () "At each call, return the next color of the list -and start over. Uses *colors*." (let ((nb-colors (length *colors*)) (color (elt *colors* index))) (incf index) (if (>= index nb-colors) (setf index 0)) color)) (defun reset-color () (setf index 0)) )
The function next-color
uses the index
that is defined outside
of it (and which is still not global, but internal to this let
only). So every time we call next-color
, index
is where it had
been left. Cool ! I'd have used a class for this in another
language I guess.
We define a function to associate a color to a keyword:
(defun keyword-color-pairs (&optional (keywords *keywords*)) "Associate each keyword with a different color and return a list of pairs." (mapcar (lambda (it) `(,it . ,(next-color))) keywords))
A good use of backquotes and comas.
We have a function to colorize a keyword inside a given string with a color:
(defun colorize-keyword-in-string (title keyword color-f) "Colorize the given keyword in the title. Keep the letters' possible mixed up or down case. `color-f': color function (cl-ansi-text)." ;; It colorizes only the first occurence of the word. ;; ;; We begin by looking if the string contains our keyword. "search" returns the index. (let ((start (search keyword (string-downcase title) :test #'equalp))) ;; case insensitive (if (numberp start) ;; if "start" is not nil, we get the position of the last letter of the keyword in the string. (let* ((end (+ start (length keyword))) ;; and we extract the substring. ;; In doing so we keep the original case of each letter. (sub (subseq title start end)) ;; We all our color function. (colored-sub (funcall color-f sub))) ;; We replace the substring with the colored substring, and return it. (str:replace-all sub colored-sub title)) title)))
And we do the same for all keywords. Looping over a plist is easy:
(loop for (word . color) in keywords-color do (...
(defun colorize-all-keywords (title kw-color) "Colorize all the user's search keywords in the given title. `kw-color': list of pairs with a keyword and a color (function)." (let ((new title)) (loop for (word . color) in kw-color do (progn (setf new (colorize-keyword-in-string new word color)))) new) )
(anyone have a functional version, with reduce
maybe ?)
Now we're happy, we run a search query again and… damn, everything's colorized but the seeders and leechers are not aligned any more :( Indeed, we print our search results with a format string:
(format stream "~3@a: ~65a ~3@a/~3@a~%" index title-colored seeders leechers)
See the fixed 65 ? But our colored titles have different sizes, due to the ansi prefixes and suffixes :(
5.3 Formatting a format string
No big deal ! We have to adapt this "65", by adding the difference of length between the normal title and the colored one.
(let* ((title (result-title it))
(title-colored (colorize-all-keywords title *keywords-colors*))
(title-padding (+ 65
(- (length title-colored)
(length title))))
But how can we insert a variable in lieu of "65" ? If we insert
another ~a
, it will print the number but it will not be
interpreted as an option of the format directive… we will then
format the format string.
In the following, ~~
is the directive to print a ~
, so ~~~aa
with the padding as argument will render as ~65a
(or ~75a
depending on the new length).
(defun display-results (&optional (results *last-search*) (stream t)) "Results: list of plump nodes. We want to print a numbered list with the needed information (torrent title, the number of seeders,... Print at most *nb-results*." (mapcar (lambda (it) (let* ((title (result-title it)) (title-colored (colorize-all-keywords title *keywords-colors*)) (title-padding (+ 65 (- (length title-colored) (length title)))) ;; ~~ prints a ~ so here ~~~aa with title-padding gives ~65a or ~75a. (format-string (format nil "~~3@a: ~~~aa ~~3@a/~~3@a~~%" title-padding))) (format stream format-string (position it *last-search*) title-colored (result-peers it) (result-leechers it)))) (reverse (sublist results 0 *nb-results*))) t)
To see the larger picture, see display-results
on the repository.
6 Error handling
6.1 Dealing with malformed command line arguments
If you try to give a bad argument on the command, you'll enter the debugger. Obviously, we prefer to print an informative message, print the help and exit.
unix-opts
will throw (or rather signal) conditions (hear
"exceptions" if you wish) on malformed options:
opts:unknown-option
,opts:missing-arg
,opts:parser-failed
andopts:missing-required-option
.It also has a complete example. This library is really nice.
The general form of condition handling is to use
handler-bind
:
(handler-bind ((one-condition #'its-handler)
(another-condition #'another-handler))
(some code))
where we are in charge in writing the #'handlers
. They take the
condition
object as argument. We give them below. They use
opts:describe
to print the help message and (opts:raw-arg
condition)
to read the argument in cause from the condition
object.
Yes, conditions are CLOS classes. When we create our owns,
we can define different slots, etc. The article on
z0ltan.wordpress.com will tell you more, and looking at
unix-opts
code may help (just keep if it's too much for now).
This is how it defines a condition: it first defines a generic
troublesome-option
class, that inherits from simple-error
and
has three slots, amongst which an option
that will store the bad
one. Then it defines unknown-option
, which inherits from
troublesome-option
.
(define-condition troublesome-option (simple-error) ((option :initarg :option :reader option)) (:report (lambda (c s) (format s "troublesome option: ~s" (option c)))) (:documentation "Generalization over conditions that have to do with some particular option.")) (define-condition unknown-option (troublesome-option) () (:report (lambda (c s) (format s "unknown option: ~s" (option c)))) (:documentation "This condition is thrown when parser encounters unknown (not previously defined with `define-opts') option."))
Then, when it encounters an error, it signals a condition and sets
its option
slot like this:
(error 'unknown-option :option opt)
See the source.
Anyway, these are our little functions to display an helpful error message and exit:
(defun unknown-option (condition) (format t "~s option is unknown.~%" (opts:option condition)) (opts:describe) (exit)) (defun missing-arg (condition) (format t "Bad options: ~a needs an argument.~&" (opts:option condition)) (opts:describe) (exit)) (defun arg-parser-failed (condition) (format t "Error: could not parse ~a as argument of ~a~&." (opts:raw-arg condition) (opts:option condition)) (opts:describe) (exit))
And so we surround the reading of the options with handler-bind
:
(handler-bind ((opts:unknown-option #'unknown-option) (opts:missing-arg #'missing-arg) (opts:arg-parser-failed #'arg-parser-failed) ;; (opts:missing-required-option #'...) ;; => in upcoming version (opts:get-opts)) ;; <= this will signal conditions, if any.
We can build and try:
$ ./torrents -p foo "-p" option is unknown. Available options: -h, --help print this help text -n, --nb ARG maximum number of results to print. -m, --magnet ARG get the magnet link of the given search result.
And all is well. We didn't catch everything though
6.2 Catching a user's C-c termination signal
Let's try a C-c
and read the stacktrace:
$ ./torrents matrix ^Csearching on the Pirate Bay… <== C-c debugger invoked on a SB-SYS:INTERACTIVE-INTERRUPT in thread <== condition name #<THREAD "main thread" RUNNING {1003156A03}>: Interactive interrupt at #x7FFFF6C6C170. Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL. restarts (invokable by number or by possibly-abbreviated name): 0: [CONTINUE ] Return from SB-UNIX:SIGINT. <== it was a SIGINT indeed 1: [RETRY-REQUEST] Retry the same request.
The signaled condition is named after our implementation:
sb-sys:interactive-interrupt
. We just have to surround our code
with a try/catch or rather, with handler-case
:
(handler-case (torrents free-args) (sb-sys:interactive-interrupt () (progn (format *error-output* "Abort.~&") (opts:exit))))
This code won't work with ECL or another implementation. We know about trivial-signal, but we were not satisfied with our test (it didn't work, see issue #3).
This will be good for now, but in case we found portable examples on other projects. We could do something like this (thanks Clack):
(handler-case <main> (#+sbcl sb-sys:interactive-interrupt #+ccl ccl:interrupt-signal-condition #+clisp system::simple-interrupt-condition #+ecl ext:interactive-interrupt #+allegro excl:interrupt-signal () <exit>))
here #+
, you know, includes the line at compile time depending on
the implementation. There's also #-
. Actually #+
looks at
symbols in the *features*
list. We can also combine symbols with
and
, or
and not
.
6.3 Scrapers errors
Web scraping is by definition unstable, we should be prepared to all sort of errors.
For every function that parses result, we surround it with a handler-case:
(defun result-leechers (node) (handler-case (parse-integer (elt (lquery:$ node ".ttable_col1" (text)) 2)) (error () -1)))
We do that generally for all torrents
functions:
(defun torrents (words &key (stream t)) "Return a list of..." (format stream "searching on Kat...") (handler-case (let* ((query (str:join "+" words)) ... (error () (format stream " no results.~&"))))))
We chose to just print "no results" and return nil
(the return
value of format
). We could also print the error message of the
condition (optional argument of "error": (error (condition)
...)
). We might want to not catch errors when we are in Slime
too. Is that possible ?
7 More features
7.1 Cache
We'd like a cache system. This was quick and easy with the Clache library.
We instantiate a cache on the file system (a file per item saved, not on a DB):
(defvar *cache-directory* (merge-pathnames #p".cl-torrents/cache/" (user-homedir-pathname)) "The directory where cl-torrents stores its cache.") (defun ensure-cache () (ensure-directories-exist (merge-pathnames *cache-directory*))) (defparameter *store* (progn (ensure-cache) (make-instance 'file-store :directory *cache-directory*)) "Cache. The directory must exist.")
the FS storage is provided by cl-store. We could also create an in-memory cache, suitable for memoization.
The function ensure-directories-exist
creates a nested directory
structure for us if needed. We wrap its call in the ensure-cache
function in order to call it at the beginning of the main
function, when cl-torrents is called from the command line.
Saving something in the cache and retrieving it is made with
(setcache key val store)
and getcache key store
. We just create
wrapper functions, to print a little confirmation message, and
because we might want to mock them in unit tests:
(defun save-results (terms val store) "Save results in cache." (format t "Saving results for ~a.~&" terms) (setcache terms val store)) (defun get-cached-results (terms store) (when (getcache terms store) (progn (format t "Got cached results for ~a.~&" terms) (getcache terms store))))
and we use it like this:
;; in the middle of `torrents'. (if (get-cached-results joined *store*) (getcache joined *store*) (tpb::torrents words :stream log-stream))
Better yet, with an optional store
argument that defaults to the
*store*
parameter:
(defun save-results (terms val &key (store *store*)) ...
setcache
takes an optional expiry time, in seconds.
We re-build the project (make build
), search for matrix once (we
wait a bit) and a second time: we hit the cache.
Quick and easy ! (and too quick, we don't use the expiry time yet)
8 Conclusion
That the end of our tutorial (for now ?).
You can see the full sources and comment on the repository: https://github.com/vindarel/cl-torrents
9 Appendice A: the same, with CL21
We like CL21. It cleans many CL inconsistencies and brings "modern" features.
Yet it is only a CL extension, that we can use isolated in just one
package of our code base if we don't want to use it everywhere, and we
can still use regular CL symbols prefixed by cl:
.
- http://cl21.org/
- https://lispcookbook.github.io/cl-cookbook/cl21.html
https://github.com/cl21/cl21/wiki
Let's see how much it costs to put it in our existing code base and most of all, if it would have eased our first contact with Lisp, our development, and the tutorial.
9.1 Installation
First, we need to install it from its own Quicklisp dist (repository).
(ql-dist:install-dist "http://dists.cl21.org/cl21.txt") (ql:quickload :cl21)
To try things out at the REPL:
(in-package :cl21)
9.2 Replacing CL by CL21
We have to use cl21
instead of cl
, and cl21-user
instead of
cl-user
.
Our package definition:
(in-package :cl-user) (defpackage torrents (:use :cl :clache) ...
becomes:
(in-package :cl21-user) (defpackage torrents (:use :cl21 :clache) ...
Our .asd
gained a cl21
dependency. This is not enough to
automatically install CL21 from its own dist though.
When we compiled our sources, the debugger complained about symbol conflicts and asked to choose between the old and the new one. Indeed, CL21 either redefines or re-exports all CL symbols, so that's a lot of conflicts in the running image. We restarted our Lisp.
(ql:quickload "cl21") ;; To load "cl21": ;; Load 1 ASDF system: ;; cl21 ;; ; ;; Loading "cl21" ;; . ;; ....... ;; ( ;; "cl21") ;; Compiled cl-torrents.asd (ql:quickload "cl-torrents") ;; To load "torrents": ;; Load 1 ASDF system: ;; torrents ;; ; Load ing "torrents" ;; .................................................. ;; [package torrents.utils].......................... ;; [package tpb]..................................... ;; [package torrentcd]............................... ;; [package kat]..................................... ;; [package torrents]................................ ;; [package torrents]................................ ;; [package torrents].. ;; ("torrents") (in-package :torrents) ;; ...
here we compiled the cl-torrents.lisp
sources (C-c C-k)
and had
an error on the debugger, complaining that mapcar
didn't
exist. Indeed, CL21 uses the classical map
name instead. It still
has mapcan
. BTW, "filter" is provided by keep[-if][-not]
(instead of remove[-if][not]
).
https://lispcookbook.github.io/cl-cookbook/cl21.html#mapping
And, that's it, we didn't have other conflicts and our test suite executed rightly.
9.3 Benefits ?
I know I have been frustrated by first
not being generic, thus
not working on arrays. I wanted to use it when trying things with
plump
and lquery
, but plump returns an array. And it is the
same with other functions and other data structures. And there are
inconsistencies, sometimes the first parameter is the list,
sometimes not.
It just feels wrong to me, specially today, and specially since CL has this genericity mecanism.
So some CL functions become generic (and extensible):
- getf (also for hash-maps,…)
- equalp
- emptyp
coerce
and CL21 also defines new functions:
append
flatten
elt
emptyp
equalp
split
,split-if
drop
,drop-while
take
,take-while
join
length
keep
,keep-if
,nkeep
,nkeep-if
partition
,partition-if
https://lispcookbook.github.io/cl-cookbook/cl21.html#generic-functions
Same story with
map
andmapcar
. You want to use what is namedmap
in all other languages, but you get error messages, have to google and learn aboutmapcar
andmap 'list
with a freaking first argument.In the end, we nearly didn't have to edit our code base to make it work with CL21. We don't have real benefits to do this to an existing code base either, they would have been noticeable (and huge for us) during development, and should be for future developments.
Allright, I'm sold on CL21, if it was only for that, but there's more, check it out !
9.4 Downsides ?
Is using CL21 worth the risk ? It doesn't show much activity on Github, although from time to time an issue shows activity, and it has open issues. I'll let you judge, this was only an example in a little project.
There's a real consequence to the long dependency list of CL21 though: the executable size. We had one of 78Mo, and the CL21 one is 93Mo.