Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA249      CA318      CA425      CA651

w2mind.computing.dcu.ie      w2mind.org


How to write a program to download videos from YouTube

How to write a Shell script to download videos from YouTube. From first principles.

By Mark Humphrys. March 2012.




Introduction

For some years I have been writing - for classes and for fun - a Shell script to download videos from YouTube.

The script needs re-writing every year or so, because YouTube keep changing their formats.

Thanks to Stefan Swerk for finding the strategy that works in 2012.



Usage

Usage like:
youtube (url)
e.g.:
youtube "http://www.youtube.com/watch?v=OuSdU8tbcHY"
(which is "Titanic in 5 seconds")
Finds the FLV file.
Downloads it to   $HOME/tmp/youtube.flv



We pick this video to debug with because it is small!




Strategy

  1. The URL you see in your browser:
      http://www.youtube.com/watch?v=ID
    is a permanent "home page" for the movie, with comments, related movies, etc.
    It is not the URL of the movie itself.
    However if we look inside the HTML source of this page, we can find the URL of the movie.
    The movie URL is obscured and hard to extract.
    The movie will also only exist at that URL for the next few minutes.

  2. The URL of the movie is in this section of the code (View Source to see):
    <div id="watch-player" ...
    var swf = ...
    
  3. We need to parse the line:
    var swf = ...
  4. There are quite a few URLs there. They are marked by url=
  5. The url= bits are hard to see because they are Percent encoded, so they look like: url%3D
  6. Of these, we want the one marked as x-flv format.

  7. If there are multiple such lines, the first one is our movie.
  8. We will need to remove some bits at start and end of URL.
  9. We will need to percent-decode twice to get a working URL.


How to write it - The recipe

  1. Grab the "home page" for the video:

     wget  -O - "$1"  > file
    

  2. Extract the URL:

    cat file | extracturl
    

  3. We make a separate Shell script "extracturl" consisting of these programs piped together:

    1. grep 'var swf'
      (extract that line)
    2. put a new line in front of every string "url" (can use sed to do this):
      sed 's|url|\
      \
      url|g' 
      
    3. grep url
      (extract only the URLs)
    4. grep "x-flv"
      (extract the FLV format URLs)
    5. head -1
      (get the first one)

    TIP: See what the first one does, then the first 2, then the first 3, until you have all of them piped together.

  4. When you have that debugged, capture the output in a variable:

    url=`cat file | extracturl`
    echo $url
    

  5. You should now have a URL that looks like this:

    url%3Dhttp%253A%252F%252Fo-o.preferred.dub06s01.v4.lscache8.c.youtube.com%252Fvideoplayback%253Fsparams%253Dalgorithm%25252Cburst%25252Ccp%25252Cfactor%25252Cid%25252Cip%25252Cipbits%25252Citag%25252Csource%25252Cexpire%2526fexp%253D904549%25252C909902%25252C910207%25252C901604%2526algorithm%253Dthrottle-factor%2526itag%253D34%2526ip%253D136.0.0.0%2526burst%253D40%2526sver%253D3%2526signature%253D40D33F0FAA075B69E335B03BDDB7BE053ABA1F9C.B9FA2FC4BA73C348FAE94CD249136DB62F7064DB%2526source%253Dyoutube%2526expire%253D1331161680%2526key%253Dyt1%2526ipbits%253D8%2526factor%253D1.25%2526cp%253DU0hSRlFRT19NSkNOMl9JS1NHOm9yNXZ5aFV3c1BN%2526id%253D3ae49d53cb5b7076%26quality%3Dmedium%26fallback_host%3Dtc.v4.cache8.c.youtube.com%26type%3Dvideo%252Fx-flv%26itag%3D34%2C

    [2 out of 5]

  6. We percent-decode it for the first time with a little Perl script I will provide. Run this:

    percentdecode "$url"
    

    where "percentdecode" is this Perl script:

    #!/usr/bin/perl
     
    use URI::Escape;
     
    my $encodedurl = $ARGV[0];
    
    my $url = uri_unescape($encodedurl);
     
    print "$url\n";
    

  7. When you have that working, capture the output in a variable:

    url=`percentdecode "$url"`
    echo $url
    

  8. You should now have a URL that looks like this:

    url=http%3A%2F%2Fo-o.preferred.dub06s01.v4.lscache8.c.youtube.com%2Fvideoplayback%3Fsparams%3Dalgorithm%252Cburst%252Ccp%252Cfactor%252Cid%252Cip%252Cipbits%252Citag%252Csource%252Cexpire%26fexp%3D904549%252C909902%252C910207%252C901604%26algorithm%3Dthrottle-factor%26itag%3D34%26ip%3D136.0.0.0%26burst%3D40%26sver%3D3%26signature%3D40D33F0FAA075B69E335B03BDDB7BE053ABA1F9C.B9FA2FC4BA73C348FAE94CD249136DB62F7064DB%26source%3Dyoutube%26expire%3D1331161680%26key%3Dyt1%26ipbits%3D8%26factor%3D1.25%26cp%3DU0hSRlFRT19NSkNOMl9JS1NHOm9yNXZ5aFV3c1BN%26id%3D3ae49d53cb5b7076&quality=medium&fallback_host=tc.v4.cache8.c.youtube.com&type=video%2Fx-flv&itag=34,

  9. There is some rubbish at the start and end of the URL.
  10. To get rid of the url= bit at the start:

    url=`echo "$url" | sed 's|^url=||'`  
    echo $url
    

  11. We also need to get rid of the itag= .. bit at the end. (Thanks to Stefan Swerk for finding this!)
  12. To do this:

    url=`echo "$url" | sed 's|itag=.*$||'`  
    echo $url
    

  13. You should now have a URL that looks like this:

    http%3A%2F%2Fo-o.preferred.dub06s01.v4.lscache8.c.youtube.com%2Fvideoplayback%3Fsparams%3Dalgorithm%252Cburst%252Ccp%252Cfactor%252Cid%252Cip%252Cipbits%252Citag%252Csource%252Cexpire%26fexp%3D904549%252C909902%252C910207%252C901604%26algorithm%3Dthrottle-factor%26itag%3D34%26ip%3D136.0.0.0%26burst%3D40%26sver%3D3%26signature%3D40D33F0FAA075B69E335B03BDDB7BE053ABA1F9C.B9FA2FC4BA73C348FAE94CD249136DB62F7064DB%26source%3Dyoutube%26expire%3D1331161680%26key%3Dyt1%26ipbits%3D8%26factor%3D1.25%26cp%3DU0hSRlFRT19NSkNOMl9JS1NHOm9yNXZ5aFV3c1BN%26id%3D3ae49d53cb5b7076&quality=medium&fallback_host=tc.v4.cache8.c.youtube.com&type=video%2Fx-flv&

    [3 out of 5]

  14. One more percent-decode (!) and you get a URL that looks like this:

    http://o-o.preferred.dub06s01.v4.lscache8.c.youtube.com/videoplayback?sparams=algorithm%2Cburst%2Ccp%2Cfactor%2Cid%2Cip%2Cipbits%2Citag%2Csource%2Cexpire&fexp=904549%2C909902%2C910207%2C901604&algorithm=throttle-factor&itag=34&ip=136.0.0.0&burst=40&sver=3&signature=40D33F0FAA075B69E335B03BDDB7BE053ABA1F9C.B9FA2FC4BA73C348FAE94CD249136DB62F7064DB&source=youtube&expire=1331161680&key=yt1&ipbits=8&factor=1.25&cp=U0hSRlFRT19NSkNOMl9JS1NHOm9yNXZ5aFV3c1BN&id=3ae49d53cb5b7076&quality=medium&fallback_host=tc.v4.cache8.c.youtube.com&type=video/x-flv&

  15. This is in fact the http address of the actual video, which can be downloaded with wget.

  16. Finally, Google and YouTube sometimes try to stop scripts accessing them. They only want to serve users behind browsers.
    I doubt if they will care about this modest exercise though.
    So for the moment, let us tell them that we are web browsers, not scripts.

    UserAgent="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
      
    wget -U "$UserAgent"  -O - "$url"  > file.flv
    

  17. Download the FLV file to your disk.

    [5 out of 5]



Notes




The FLV file is now on your disk and can be played in RealPlayer or other players.



Feeds      HumphrysFamilyTree.com

Bookmark and Share           On Internet since 1987.