RCS file: /var/cvs/ruby-xmltv/tv_grab_nl_upc,v Working file: tv_grab_nl_upc head: 1.133 branch: locks: strict access list: symbolic names: v0-9-9: 1.133 v0-9-8: 1.131 v0-9-7: 1.126 v0-9-6: 1.123 v0-9-5: 1.120 v0-9-4: 1.118 v0-9-3: 1.116 v0-9-2: 1.113 v0-9-1: 1.111 v0-9-0: 1.108 v0-8-9: 1.96 v0-8-8: 1.95 v0-8-7: 1.93 v0-8-6: 1.91 v0-8-5: 1.89 v0-8-4: 1.87 v0-8-3: 1.85 v0-8-2: 1.83 v0-8-1: 1.78 v0-8-0: 1.76 v0-7-2: 1.70 v0-7-1: 1.69 v0-7-0: 1.67 v0-6-1: 1.64 v0-6-0: 1.59 v0-5-1: 1.49 v0-5-0: 1.43 v0-4-0: 1.34 v0-3-0: 1.16 v0-2-0: 1.6 v0-1-0: 1.1.1.1 ruby-xmltv: 1.1.1.1 default: 1.1.1.1 caliban: 1.1.1 keyword substitution: kv total revisions: 134; selected revisions: 134 description: ---------------------------- revision 1.133 date: 2008/06/17 05:26:30; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.9.9. ---------------------------- revision 1.132 date: 2008/06/17 05:10:30; author: ianmacd; state: Exp; lines: +10 -8 --sanity-check's attempt to determine whether something's wrong with the guide data fetches a sample page for Nederland 1. If there are < 5 programmes on that page, we abort. This check isn't trustworthy, so now we check both Nederland 1 and RTL 4, aborting if there are < 10 programmes in total. ---------------------------- revision 1.131 date: 2008/06/01 12:13:21; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.9.8. ---------------------------- revision 1.130 date: 2008/06/01 12:12:59; author: ianmacd; state: Exp; lines: +3 -3 Add 'het' to the list of acceptable lower-case joiner words when searching for a subtitle. ---------------------------- revision 1.129 date: 2008/06/01 07:11:18; author: ianmacd; state: Exp; lines: +3 -3 Add 'or' to the list of acceptable lower-case joiner words when searching for a subtitle. ---------------------------- revision 1.128 date: 2008/05/18 22:31:15; author: ianmacd; state: Exp; lines: +3 -2 Use PASV mode for FTP'ing IMDB static ratings file. Thanks to Alain Hertog for suggesting this. ---------------------------- revision 1.127 date: 2008/04/26 10:30:03; author: ianmacd; state: Exp; lines: +17 -7 Compatibility fixes to run under Ruby 1.9. ---------------------------- revision 1.126 date: 2008/04/22 09:02:41; author: ianmacd; state: Exp; lines: +3 -3 Bump to 0.9.7. ---------------------------- revision 1.125 date: 2008/04/21 14:15:58; author: ianmacd; state: Exp; lines: +3 -3 Fix REXML insertion of newlines into XML in versions of Ruby 1.8.6 somewhere after patch level 36. ---------------------------- revision 1.124 date: 2008/04/21 13:25:59; author: ianmacd; state: Exp; lines: +4 -3 Category mapping Detective should go to Crime/Mystery, not Drama. New category mapping Documentaire => Documentary. ---------------------------- revision 1.123 date: 2008/02/22 21:59:00; author: ianmacd; state: Exp; lines: +4 -4 Bump to 0.9.6. ---------------------------- revision 1.122 date: 2008/02/21 21:31:35; author: ianmacd; state: Exp; lines: +8 -2 When creating the directory needed by --config-file, check to see if it exists but is not a directory. ---------------------------- revision 1.121 date: 2008/02/21 21:06:08; author: ianmacd; state: Exp; lines: +82 -53 Fix IMDB dynamic ratings after changes to the IMDB site made many look-ups fail. Further improve the chance of a successful IMDB look-up by retrying unfound titles of the form 'Foo Bar, The' as 'The Foo Bar'. Improve the final report with details of number of pages and programmes fetched per second. ---------------------------- revision 1.120 date: 2007/11/22 18:42:34; author: ianmacd; state: Exp; lines: +5 -5 Bump version to 0.9.5. ---------------------------- revision 1.119 date: 2007/11/22 14:24:57; author: ianmacd; state: Exp; lines: +13 -12 Fix fairly rare occurrence whereby programme that starts after midnight and runs until the next day does not have its end date adjusted accordingly. Programmes that start after midnight have a day added to their start and end time. Programmes whereby the end hour is less than the start hour have a day added to the end time. The bug occurred because these two conditions were in an if/elsif clause, but both conditions occasionally apply to a programme. For example, if program X starts at 05:00 on day N, it actually belongs to day N + 1, so the start and end dates are adjusted by +1. However, if its end hour is 00:00, then it actually runs until midnight on day N + 2, so we still need to add a day to the end date. ---------------------------- revision 1.118 date: 2007/08/27 20:47:16; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.9.4. ---------------------------- revision 1.117 date: 2007/08/27 19:18:19; author: ianmacd; state: Exp; lines: +24 -23 Changes to UPC's site broke --configure and caused bogus warnings when --sanity-check was used. ---------------------------- revision 1.116 date: 2007/08/17 13:21:07; author: ianmacd; state: Exp; lines: +3 -3 Updated to 0.9.3. ---------------------------- revision 1.115 date: 2007/08/17 13:18:35; author: ianmacd; state: Exp; lines: +14 -13 Regular expression for detecting movies did not have /x, with effect that films with the genre 'Romantiek' were not considered to be films. Also, any programme with the genre 'Speelfilm' is now considered to be a film, regardless of its length. ---------------------------- revision 1.114 date: 2007/08/16 20:04:33; author: ianmacd; state: Exp; lines: +20 -6 Category tag should ideally have a lang attribute: 'en' when category translation occurs, otherwise 'nl'. If we find an episode number, we should create an episode-num tag. ---------------------------- revision 1.113 date: 2007/08/16 00:25:47; author: ianmacd; state: Exp; lines: +4 -6 Update to 0.9.2. ---------------------------- revision 1.112 date: 2007/08/14 23:47:16; author: ianmacd; state: Exp; lines: +13 -4 When trying to derive a subtitle, we now check for a trailing episode string at the end of the description. If we find one, we append it to the already derived subtitle, if applicable, and use that as the subtitle. This increases the chance of finding a usable subtitle and also increases the chance of subtitle uniqueness, which in turn increases MythTV's chance of detecting duplicate programmes. By default, it does this by looking for a unique subtitle/description pair. ---------------------------- revision 1.111 date: 2007/07/20 19:33:54; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.9.1. ---------------------------- revision 1.110 date: 2007/07/20 16:45:00; author: ianmacd; state: Exp; lines: +4 -2 New category translations: Gezondheid => Health/Medical Sportmagazine => Sports ---------------------------- revision 1.109 date: 2007/07/20 15:05:45; author: ianmacd; state: Exp; lines: +38 -24 --[no-]ratings can now take a parameter, DIR. If given, DIR/ratings_cache.yaml is used for the ratings cache instead of ~/.xmltv/ratings_cache.yaml ---------------------------- revision 1.108 date: 2007/07/17 14:27:44; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.9.0. ---------------------------- revision 1.107 date: 2007/07/17 13:50:51; author: ianmacd; state: Exp; lines: +7 -8 Move a calculation outside of a loop, as it doesn't need to be recalculated on each iteration. ---------------------------- revision 1.106 date: 2007/07/16 19:55:45; author: ianmacd; state: Exp; lines: +29 -33 Simplify construction of usage message in option parser. ---------------------------- revision 1.105 date: 2007/07/16 13:59:26; author: ianmacd; state: Exp; lines: +7 -7 Change source-info-name to remove reference to Chello in accordance with UPC's abandonment of the brand name. ---------------------------- revision 1.104 date: 2007/07/15 00:41:21; author: ianmacd; state: Exp; lines: +29 -5 We now trap ^C at the command line to avoid displaying the call stack as we exit. We now allow debugging (normally turned on with --debug) to be toggled by sending the process a SIGUSR1. We now allow verbosity (normally turned on with --verbose) to be toggled by sending the process a SIGUSR2. ---------------------------- revision 1.103 date: 2007/07/14 00:52:07; author: ianmacd; state: Exp; lines: +6 -5 The name Chello seems to be on the way out at UPC, so switch the base URL from http://www.chello.nl to http://epg.upc.nl. ---------------------------- revision 1.102 date: 2007/07/14 00:42:53; author: ianmacd; state: Exp; lines: +7 -6 Don't split a programme title on the colon to form a subtitle if the colon looks like it is a time separator, i.e. HH:MM. ---------------------------- revision 1.101 date: 2007/07/13 08:40:11; author: ianmacd; state: Exp; lines: +8 -8 Reuse of a variable name caused unthreaded mode to crash if a TV programme had parsable presenter names. ---------------------------- revision 1.100 date: 2007/07/11 17:42:00; author: ianmacd; state: Exp; lines: +28 -15 New method pre_checks performs pre-execution sanity checks requested by --sanity-check. New sanity check aborts program if we're running with an effective UID of 0. ---------------------------- revision 1.99 date: 2007/07/11 12:05:15; author: ianmacd; state: Exp; lines: +138 -125 More consistent use of quotes and % operator for string interpolation. Better reporting of what was fetched, as reporting is non-linear in threaded mode. ---------------------------- revision 1.98 date: 2007/07/10 22:51:33; author: ianmacd; state: Exp; lines: +110 -92 New option --threads causes one thread per channel to be used for fetching programme data. This is heavy on network and server resources, but causes the program to execute in a fraction of the time required in unthreaded mode. ---------------------------- revision 1.97 date: 2007/07/10 22:40:17; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.8.9. ---------------------------- revision 1.96 date: 2007/07/10 13:04:02; author: ianmacd; state: Exp; lines: +20 -7 Short-circuit use of complex regex for extracting a program subtitle from the description, when the description cannot possibly contain one. The following program description was found to cause exponential backtracking when trying to determine a subtitle: 1300BST: US PGA Tour Golf, 1400BST: Challenge Series Golf, 1530BST: WTA Tennis, 1600BST: ICC Cricket, 1630BST: ATP Tennis. The problem was severe enough that this description would cause the program to loop for hours within the subtitle matching regex. ---------------------------- revision 1.95 date: 2007/07/10 11:11:33; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.8.8. ---------------------------- revision 1.94 date: 2007/06/30 13:32:36; author: ianmacd; state: Exp; lines: +27 -10 Create ~/.xmltv if it doesn't already exist. Otherwise, --configure will cause an error when the file comes to be written. Likewise, IMDB ratings files would be unable to be written. If --config-file is used, we may also need to create the directory path to the named file. --config-file did not properly expand its parameter. When the config file did not exist or there were no channels defined in it, the resulting error message would, itself, produce an error due to an incorrect variable name. ---------------------------- revision 1.93 date: 2007/06/15 15:10:09; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.8.7. ---------------------------- revision 1.92 date: 2007/06/13 00:29:51; author: ianmacd; state: Exp; lines: +33 -12 Improved some of the text messages. Better reporting of successful IMDB look-ups: --verbose now also displays the rating found. IMDB ratings cache entries now have their creation time stored along with their last hit time. ---------------------------- revision 1.91 date: 2007/05/11 00:38:15; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.8.6. ---------------------------- revision 1.90 date: 2007/05/11 00:37:58; author: ianmacd; state: Exp; lines: +3 -3 Rating did not receive suffix of /10 when --static-ratings was used. ---------------------------- revision 1.89 date: 2007/05/08 17:39:08; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.8.5. ---------------------------- revision 1.88 date: 2007/05/08 17:38:23; author: ianmacd; state: Exp; lines: +9 -6 Prevent infrequent case of division-by-zero error when reporting IMDB look-up percentages. ---------------------------- revision 1.87 date: 2007/04/26 14:01:10; author: ianmacd; state: Exp; lines: +3 -4 Bump version to 0.8.4. ---------------------------- revision 1.86 date: 2007/04/25 20:51:01; author: ianmacd; state: Exp; lines: +9 -9 Fixed bug whereby reporting of percentages of looked-up vs. cached IMDB ratings could add up to 101%. This was due to a rounding error, which occurred when the mantissa of both percentages was .5, causing them to both be rounded upwards. ---------------------------- revision 1.85 date: 2007/04/12 09:15:31; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.8.3. ---------------------------- revision 1.84 date: 2007/04/12 09:15:10; author: ianmacd; state: Exp; lines: +10 -7 Count of cached entries when reading ~/.xmltv/ratings_cache.yaml was including expired entries. ---------------------------- revision 1.83 date: 2007/04/09 21:28:58; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.8.2. ---------------------------- revision 1.82 date: 2007/04/09 13:27:45; author: ianmacd; state: Exp; lines: +9 -5 Update cache entry timestamp when a cache entry is accessed. This prevents recurring negative entries for bogus film titles like 'Channel Off Air' from ever expiring, which is a good thing. ---------------------------- revision 1.81 date: 2007/04/06 18:43:01; author: ianmacd; state: Exp; lines: +21 -17 Rewrite deletion of expired ratings cache entries. When loading ratings cache, report how many entries are positive and how many are negative. ---------------------------- revision 1.80 date: 2007/04/04 12:40:58; author: ianmacd; state: Exp; lines: +12 -15 Condense rating statistics somewhat. ---------------------------- revision 1.79 date: 2007/04/03 22:50:52; author: ianmacd; state: Exp; lines: +156 -43 Dynamically looking up ratings in IMDB now makes use of a persistent cache, which is written out to ~/.xmltv/ratings_cache.yaml. Dynamic look-ups are therefore now semi-static. Entries expire after seven days, so one can now make use of dynamic look-ups whilst dramatically reducing the traffic sent to IMDB. In accordance with the new functionality described above, end-reporting of dynamic ratings with --verbose is now much better, detailing how many look-ups were attempted, how many succeeded, how many of the successes and failures were respectively positive and negative cache hits, etc. The code has been cleaned up a bit, with calls to Time.now replaced by a constant when an up-to-date value since being run isn't required. ---------------------------- revision 1.78 date: 2007/03/31 12:26:08; author: ianmacd; state: Exp; lines: +5 -4 Bump version to 0.8.1 ---------------------------- revision 1.77 date: 2007/03/30 14:25:53; author: ianmacd; state: Exp; lines: +7 -6 Remove external dependencies on UNIX date(1). ---------------------------- revision 1.76 date: 2007/03/23 16:13:09; author: ianmacd; state: Exp; lines: +11 -3 Bump to 0.8.0. ---------------------------- revision 1.75 date: 2007/03/22 10:23:12; author: ianmacd; state: Exp; lines: +3 -3 Sort channels and theme list printed by --debug case-insensitively. ---------------------------- revision 1.74 date: 2007/03/21 16:24:04; author: ianmacd; state: Exp; lines: +8 -2 When --static-ratings is used with --verbose, display the number of ratings read. ---------------------------- revision 1.73 date: 2007/03/21 16:18:39; author: ianmacd; state: Exp; lines: +13 -5 --static-ratings was not finding a large number of the films that it should have, due to a faulty regex and lack of case-insensitive matching. ---------------------------- revision 1.72 date: 2007/03/21 15:27:48; author: ianmacd; state: Exp; lines: +130 -42 New option --static-ratings offers the ability to use IMDB for film ratings, but in accordance with the policy laid down here: http://www.imdb.com/help/show_leaf?usedatasoftware Consequently, a local ratings file is downloaded from ftp.funet.fi (by the new class method Rating.get_ratings_list) and placed in ~/.xmltv, where it is gunzipped and used for rating look-ups. The use of this file is described here: http://www.imdb.com/interfaces#plain The file is downloaded when is does not already exist AND when it's older than seven days (as determined by its mtime). --static-ratings probably won't work on most non-UNIX-like systems, because gunzip is needed to decompress the ratings file. Ratings are now cached using the new class method, Rating.cache_rating. ---------------------------- revision 1.71 date: 2007/03/20 18:32:32; author: ianmacd; state: Exp; lines: +18 -14 Remove warning when user tries to use schema 1. Issue warning about incorrect locale only when --quiet isn't used. When --days > 8, the warning that is issued now comes after the check for useable channels in the config file. --ratings now issues a warning about IMDB policy violation, as defined here: http://imdb.com/help/show_leaf?usedatasoftware ---------------------------- revision 1.70 date: 2007/03/05 15:25:35; author: ianmacd; state: Exp; lines: +5 -5 IMDB has slightly altered its title pages again, so look-ups were failing. Bump version to 0.7.2. ---------------------------- revision 1.69 date: 2007/02/19 23:10:07; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.7.1. ---------------------------- revision 1.68 date: 2007/02/19 23:09:50; author: ianmacd; state: Exp; lines: +4 -4 IMDB has redesigned its title pages, so look-ups were failing. ---------------------------- revision 1.67 date: 2007/02/16 19:33:22; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.7.0. ---------------------------- revision 1.66 date: 2007/02/16 13:58:53; author: ianmacd; state: Exp; lines: +118 -6 Added the 'manualconfig' capability, via the new option --configure and the new methods configure_grabber and get_channel_number. Channel numbers in the config file may now be preceded by the string 'channel '. This is treated case-insensitively. ---------------------------- revision 1.65 date: 2007/02/15 10:00:39; author: ianmacd; state: Exp; lines: +16 -9 Improve IMDB rating look-ups by working around ampersand entities in the title. ---------------------------- revision 1.64 date: 2007/02/14 00:23:51; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.6.1. ---------------------------- revision 1.63 date: 2007/02/14 00:22:29; author: ianmacd; state: Exp; lines: +21 -5 When looking up movie titles in IMDB, care must be taken with titles containing accented letters, as the ensuing screen-scraping currently fails. This is due to the accented letters in the title not matching the replacement HTML entities in the page. We therefore need to examine the data we receive from UPC and detect titles with UTF-8 accented letters. We convert these to Latin-1 and then render the matching of any non-alphanumeric characters optional. Finally, any accented letters are replaced by a regex that will match the equivalent HTML entity, whether it be numeric or alphabetic. ---------------------------- revision 1.62 date: 2007/02/13 00:48:42; author: ianmacd; state: Exp; lines: +7 -3 Don't convert apostrophes etc. in presenter names to entities on output. ---------------------------- revision 1.61 date: 2007/02/13 00:36:08; author: ianmacd; state: Exp; lines: +97 -205 Remove the last vestiges of schema 1. ---------------------------- revision 1.60 date: 2007/02/12 23:10:43; author: ianmacd; state: Exp; lines: +6 -6 When schema 2 turned up no programmes, schema 1 was erroneously still used for a refetch. ---------------------------- revision 1.59 date: 2007/02/12 16:15:33; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.6.0. ---------------------------- revision 1.58 date: 2007/02/12 01:06:50; author: ianmacd; state: Exp; lines: +10 -8 Warn in help text that --schema is no longer effective. ---------------------------- revision 1.57 date: 2007/02/11 19:02:38; author: ianmacd; state: Exp; lines: +4 -3 Catch Errno::ECONNRESET exceptions when fetching pages. ---------------------------- revision 1.56 date: 2007/02/11 18:03:38; author: ianmacd; state: Exp; lines: +9 -7 UPC seem to have abandoned their original URL scheme, so we now issue a warning if the user runs the program with --schema 1 and force the schema to be 2. We also no longer retry empty schema 2 fetches using schema 1. ---------------------------- revision 1.55 date: 2007/02/11 17:21:38; author: ianmacd; state: Exp; lines: +15 -11 When trying to obtain a film rating, use fuzzier matching on the title by making any non-alphanumeric characters optional. For example, this allows 'Mrs. Henderson' to match 'Mrs Henderson'. ---------------------------- revision 1.54 date: 2007/02/11 16:11:29; author: ianmacd; state: Exp; lines: +79 -77 New category mappings: Actie => Action Historisch => History Removed a lot of superfluous whitespace. ---------------------------- revision 1.53 date: 2007/02/11 13:36:48; author: ianmacd; state: Exp; lines: +167 -19 --[no-]ratings is a new option for obtaining film ratings from IMDB. A programme is judged to be a film when it's duration is between 80 minutes and 4 hours, and its genre is likely that of a film. New Rating class for dealing with programme ratings. The class method Rating::imdb_rating obtains film ratings for a given title from IMDB. Both positive and negative look-ups are cached to reduce network traffic and allow the programme to run as fast as possible. The get_page method now follows HTTP 3xx redirections, as these are sometimes given by IMDB (when only one match exists for a given title). --debug will inform you when a redirect is being followed. When --verbose is used, the number of ratings for each day per channel, each channel, and the entire program run is displayed. Furthermore, the name of each programme will be displayed as we attempt to rate it, as well as whether or not a rating was fetched or found in the cache. ---------------------------- revision 1.52 date: 2007/02/05 12:30:10; author: ianmacd; state: Exp; lines: +3 -3 Double quotes, not single, are needed here. ---------------------------- revision 1.51 date: 2007/02/03 01:08:31; author: ianmacd; state: Exp; lines: +3 -3 An 'exit' command was still commented out for debugging purposes. ---------------------------- revision 1.50 date: 2007/02/03 00:36:06; author: ianmacd; state: Exp; lines: +16 -5 When using --verbose AND --debug, the channel/theme list obtained from UPC is now displayed. The presenter section of the credits section was not properly detecting presenters in English language descriptions. Presenter names containing an apostrophe, e.g. Conan O'Brien, were erroneously being detected as two presenters, e.g. Conan and Brien. More presenters are now detected, by additionally looking for the string 'hosted by' in programme descriptions. ---------------------------- revision 1.49 date: 2007/01/24 14:31:56; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.5.1. ---------------------------- revision 1.48 date: 2007/01/24 14:26:29; author: ianmacd; state: Exp; lines: +4 -4 Programme start and stop times now have their time zones expressed as an hour offset from GMT, in accordance with http://www.xmltv.org/wiki/xmltvcapabilities.html. CET and CEST are not allowed. ---------------------------- revision 1.47 date: 2007/01/24 13:51:08; author: ianmacd; state: Exp; lines: +3 -3 ENV['LANG'] can be nil, as well as ''. ---------------------------- revision 1.46 date: 2007/01/24 13:47:52; author: ianmacd; state: Exp; lines: +4 -3 Locale warning contained a blank if $LANG was unset. This has been corrected. ---------------------------- revision 1.45 date: 2007/01/24 13:40:48; author: ianmacd; state: Exp; lines: +26 -10 Add --description and --capabilities, according to http://www.xmltv.org/wiki/xmltvcapabilities.html. ---------------------------- revision 1.44 date: 2007/01/12 20:29:13; author: ianmacd; state: Exp; lines: +45 -27 Handle exceptions that occur when trying to get the channel list from UPC, plus those that occur when we do the sample Nederland 1 page fetch. get_page() is the new method that does all of the page fetching. ---------------------------- revision 1.43 date: 2007/01/02 07:55:37; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.5.0. ---------------------------- revision 1.42 date: 2007/01/02 07:48:56; author: ianmacd; state: Exp; lines: +13 -11 When parsing for subtitles, a sentence may (erroneously, obviously) end with more than one punctuation character. We now catch this. When parsing for subtitles, a subtitle is judged the same as the title (and therefore removed) if it differs only in having a full-stop at the end. Catch Errno::ECONNREFUSED exceptions whilst fetching pages. Better exception reporting when we fail to fetch a page. ---------------------------- revision 1.41 date: 2007/01/01 21:26:24; author: ianmacd; state: Exp; lines: +6 -7 Remove unhelpful 'Niet beschikbaar' text at an earlier stage. ---------------------------- revision 1.40 date: 2006/12/31 06:23:18; author: ianmacd; state: Exp; lines: +25 -16 Allow colons when trying to derive a subtitle. Single letter words should be allowed in subtitles. Allow &, plus some articles, conjunctions and prepositions in subtitles. ---------------------------- revision 1.39 date: 2006/12/30 23:41:42; author: ianmacd; state: Exp; lines: +16 -16 Remove unnecessary whitespace. ---------------------------- revision 1.38 date: 2006/12/30 23:39:23; author: ianmacd; state: Exp; lines: +63 -11 Use /x regular expression formatting to make the regex for deriving a subtitle from the description more legible. (?:\xC3[\x80-\xBF] must be added to subtitle regex, because UPC TV guide pages are returned as UTF-8 and accented alphabetic characters are double byte. When --sanity-check is used, an additional check is made to see if we are running in the nl_NL locale. If not, a warning is issued. ---------------------------- revision 1.37 date: 2006/12/30 06:47:34; author: ianmacd; state: Exp; lines: +63 -13 An effort is now made to derive a suitable subtitle for each programme. If the programme's title contains one or more colons, we split on the first one. The left-hand side becomes the title, the right-hand side the subtitle. If that fails, we look to see whether the first sentence of the programme's description contains exclusively words that begin with a capital letter (digits and punctuation are also allowed). If so, we assume that it's actually an episode title and use that as the subtitle. The string is then removed from the description. If the subtitle happens to be the same string as the title, we abandon it. If the description consists of only 'Niet beschikbaar', we abandon it. ---------------------------- revision 1.36 date: 2006/12/28 15:07:44; author: ianmacd; state: Exp; lines: +31 -2 Use source-info-url, source-info-name and generator-info-url attributes in root tv tag to provide more information about our origin. Make an effort to include a basic credits section, by isolating the presenters of a programme, if the information is available. ---------------------------- revision 1.35 date: 2006/12/27 16:22:18; author: ianmacd; state: Exp; lines: +3 -2 New category translation: 'Talkshow' => 'Talk' ---------------------------- revision 1.34 date: 2006/12/14 01:07:49; author: ianmacd; state: Exp; lines: +3 -3 Update to 0.4.0. ---------------------------- revision 1.33 date: 2006/12/13 15:52:39; author: ianmacd; state: Exp; lines: +3 -2 New category translation: 'Musical' => 'Movies' ---------------------------- revision 1.32 date: 2006/12/13 02:45:48; author: ianmacd; state: Exp; lines: +29 -11 New option, --[no-]cattrans. cattrans is the default. If no-cattrans is used, programme category translation will not take place. Programme category translation is only useful in combination with MythTV and it turns out that some people are using tv_grab_nl_upc for other purposes, so logically this option is needed. ---------------------------- revision 1.31 date: 2006/12/12 20:31:52; author: ianmacd; state: Exp; lines: +4 -3 New category translation: 'Theater / dans' => 'Arts/Culture' ---------------------------- revision 1.30 date: 2006/12/11 01:45:51; author: ianmacd; state: Exp; lines: +3 -3 Catch EOFError exceptions when doing HTTP traffic. ---------------------------- revision 1.29 date: 2006/12/11 01:45:08; author: ianmacd; state: Exp; lines: +5 -5 Also print channel when displaying total number of programmes found for each channel. ---------------------------- revision 1.28 date: 2006/12/05 20:12:51; author: ianmacd; state: Exp; lines: +11 -2 --verbose will now also display the total running time and number of page fetches on exit. ---------------------------- revision 1.27 date: 2006/12/05 12:18:21; author: ianmacd; state: Exp; lines: +4 -4 Fix display bug when sample programma data fetch of NED 1 returns fewer than 5 programmes. ---------------------------- revision 1.26 date: 2006/12/02 19:03:49; author: ianmacd; state: Exp; lines: +11 -5 Displaying missing programme categories is only really useful for the programme author (i.e. me). Therefore, this report is now generated by the new --debug option, no longer by --verbose. ---------------------------- revision 1.25 date: 2006/12/02 12:01:48; author: ianmacd; state: Exp; lines: +6 -4 New category translations: Algemeen -> Misc Klussen -> HowTo Fixed category translations: Tuinieren -> HowTo (was: Educational) ---------------------------- revision 1.24 date: 2006/12/02 11:52:50; author: ianmacd; state: Exp; lines: +13 -7 The missing category report will now detail the programmes whose category was not recognised. This will aid in adding new categories to the code. ---------------------------- revision 1.23 date: 2006/11/29 20:49:03; author: ianmacd; state: Exp; lines: +22 -13 Add --tries option to allow the user to determine the number of HTTP requests we attempt for each page. The default is 3. As a consequence, get_tvguide() is back to taking 4 parameters. ---------------------------- revision 1.22 date: 2006/11/29 20:31:43; author: ianmacd; state: Exp; lines: +30 -10 Try to get each page a maximum of three times, catching HTTP timeouts that may occur. get_tvguide() now takes a fifth parameter, the status of options.quiet. ---------------------------- revision 1.21 date: 2006/11/28 16:31:07; author: ianmacd; state: Exp; lines: +5 -5 Strip trailing whitespace from channel name when reading config. ---------------------------- revision 1.20 date: 2006/11/28 14:11:46; author: ianmacd; state: Exp; lines: +11 -8 Missing data report should contain channel numbers as well as names. ---------------------------- revision 1.19 date: 2006/11/28 00:26:07; author: ianmacd; state: Exp; lines: +7 -4 Take into account 24 hour continuous programmes like the data display on the Weerkanaal. These are usually denoted as programme entries that run from midnight to midnight, which look like programmes with a 0 minute running time. ---------------------------- revision 1.18 date: 2006/11/28 00:03:14; author: ianmacd; state: Exp; lines: +4 -4 Display channel names, not numbers, in missing data report. ---------------------------- revision 1.17 date: 2006/11/26 00:31:45; author: ianmacd; state: Exp; lines: +25 -15 Avoid exception caused by failure to find channel icon path. ---------------------------- revision 1.16 date: 2006/11/24 00:48:35; author: ianmacd; state: Exp; lines: +6 -6 Update to 0.3.0. ---------------------------- revision 1.15 date: 2006/11/23 00:30:04; author: ianmacd; state: Exp; lines: +363 -50 * New --schema option to select which chello.nl URL tree to pull data from. Previous versions used 1: http://www.chello.nl/Entertainment/TVGids/singlechannel/x/y/allday where x is the channel number as y is the day number for which we want the programmes, with 0 being today. As of now, we also offer 2: http://www.chello.nl/Entertainment/TV_gids/Zenders/Algemeen/Gids/?channels=x×cope=y where x is the channel name (URL encoded, of course) and y is the day name for which we want the programmes, with the suffix _all appended. Today's programmes use 'today_all', tomorrow's use 'tomorrow_all', but days after that use 'monday_all', 'tuesday_all', etc. * Vastly expanded category translation table to cope with more detailed categories offered by URL schema 2. * By default, schema 2 is now used for fetching data, because it provides more detailed programme categories. Otherwise, the data is more or less the same as that obtained from schema 1. * Parts of the code have now been separated into methods to improve readability. These are get_available_channels, check_channel, read_config, get_tvguide, get_programmes and clean. * Because schema 2 uses channel names rather than numbers, the name of the channel given in the config file must match exactly that used by UPC within the chello.nl site. For this reason, if schema 2 is used in combination with --sanity and --verbose, the program will pull the entire list of available UPC channels from http://www.chello.nl/cgi-bin/WebObjects/EPG.woa/wa/Events/?country=nl&template=Json_channelsGenres and check the channel names in the config file against this, making suggestions if certain channels cannot be matched. * The program will now abort if a sample page fetch for Nederland 1 for day 0 (today) yields fewer than 5 programmes. This would indicate a severe guide failure. * Channels are now processed in numeric order, starting with the lowest. The order was previously unpredictable. * --verbose will now report the number of programmes found per channel per day, as well as the total number for each channel, plus the total for all channels. * We now report any unknown programme categories found in the guide when --verbose is used. * --verbose will now print a report, containing details of which days contained no data for certain channels. If a channel yielded no programmes on any day, this fact will be emphasised. * If a schema yields no programmes for a channel on a certain day, we retry using the other schema. This is reported when --verbose is used. ---------------------------- revision 1.14 date: 2006/11/19 18:24:47; author: ianmacd; state: Exp; lines: +13 -15 Simplify screen-scraping, so that we just use String#scan to do all of the work. ---------------------------- revision 1.13 date: 2006/11/19 18:09:00; author: ianmacd; state: Exp; lines: +18 -18 Use puts instead of printf in most cases. ---------------------------- revision 1.12 date: 2006/11/19 00:14:43; author: ianmacd; state: Exp; lines: +7 -5 Off-by-one error in missing programme data detection. ---------------------------- revision 1.11 date: 2006/11/18 21:58:30; author: ianmacd; state: Exp; lines: +5 -6 Another bug in the missing data report. ---------------------------- revision 1.10 date: 2006/11/18 20:24:30; author: ianmacd; state: Exp; lines: +4 -4 The previous fix should have used printf, not puts. ---------------------------- revision 1.9 date: 2006/11/18 20:22:17; author: ianmacd; state: Exp; lines: +4 -4 Accidental use of abort instead of exit. ---------------------------- revision 1.8 date: 2006/11/18 19:52:30; author: ianmacd; state: Exp; lines: +8 -6 The check for yesterday's guide did not work. This has been fixed. When the guide is neither yesterday's nor today's, the error message now includes the date string from the guide, as this will aid troubleshooting. The missing programme data report produced by --verbose would print some headings, even when there was no missing data. This has been fixed. ---------------------------- revision 1.7 date: 2006/11/18 14:06:13; author: ianmacd; state: Exp; lines: +47 -2 If --verbose is used, the program will now produce a report at exit time, detailing which days had channels with no programme data and which of these channels had no data on any day. ---------------------------- revision 1.6 date: 2006/11/12 13:50:33; author: ianmacd; state: Exp; lines: +3 -3 Bump version to 0.2.0. ---------------------------- revision 1.5 date: 2006/11/12 13:48:15; author: ianmacd; state: Exp; lines: +13 -11 Do a better job of translating programme categories to what MythTV expects. ---------------------------- revision 1.4 date: 2006/09/18 16:14:14; author: ianmacd; state: Exp; lines: +25 -11 Added --sleep SECS option to sleep after each page fetch. The default amount of time is 1.0 seconds. ---------------------------- revision 1.3 date: 2006/09/18 15:51:10; author: ianmacd; state: Exp; lines: +45 -18 Added --xmltvid-suffix to add a string other than '.chello.nl' to the channel number to form the XMLTV ID. Added --version for displaying the program version. Improve usage message by giving default values. ---------------------------- revision 1.2 date: 2006/09/18 13:14:43; author: ianmacd; state: Exp; lines: +83 -25 Added --sanity-check option. If this is used, sanity checks will be made before pulling guide data. Currently, the only check is to ascertain that the guide is the correct one for the day at run-time. If running shortly after midnight, for example, it's possible that the guide will still be for yesterday. If that's the case, offset adjustments are made and the correct guide data is still pulled, unless the program is run after 05:00, by which time guide rotation really should have occurred. In that case, we abort. Even with the offset adjustments, however, we will still have a big problem if the guide happens to be rotated _during_ execution. It's best to avoid this race condition entirely by running the program later in the day. If the guide has a discrepancy of more than 1 day, we abort, as there's no good explanation for this. ---------------------------- revision 1.1 date: 2006/09/17 16:16:25; author: ianmacd; state: Exp; branches: 1.1.1; Initial revision ---------------------------- revision 1.1.1.1 date: 2006/09/17 16:16:25; author: ianmacd; state: Exp; lines: +0 -0 Create ruby-xmltv repo. Version 0.1.0 of grabber. =============================================================================