Dalke Scientific Software: More science. Less time. Products
[ older | newer ]     /home/writings/diary/archive/2003/09/04/ebi-ftp-size

How much data at EBI?

Step 1: get the ls -lR

% ftp ftp.ebi.ac.uk
Connected to alpha4.ebi.ac.uk.
220 ftp1.ebi.ac.uk FTP server (Version wu-2.6.2(2) Wed Aug 20 08:58:45 BST 2003) ready.
Name (ftp.ebi.ac.uk:dalke): anonymous
331 Guest login ok, send your complete e-mail address as password.
Password:
230-Welcome anonymous@h-66-167-134-137.PHNDAZ91.covad.net
230-
230-This is (mostly) a genetic database entry server.
230-If you have any unusual problems, please report them via e-mail to
230-    nethelp@ebi.ac.uk
230-If you do have problems, please try using a dash (-) as the first
230-character of your password -- this will turn off the continuation
230-messages that may be confusing your ftp client.
230-
230-You are currently user number 33 out of 180 allowed.
230-
230-All transfers are logged.
230-
230 Guest login ok, access restrictions apply.
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> cd pub
250 CWD command successful.
ftp> ls -lR ebi.ls-lR.dat
output to local-file: ebi.ls-lR.dat [anpqy?]? y
227 Entering Passive Mode (193,62,196,103,179,174)
150 Opening ASCII mode data connection for /bin/ls.
226 Transfer complete.
ftp> 
ftp> quit
221-You have transferred 0 bytes in 0 files.
221-Total traffic for this session was 23921687 bytes in 4 transfers.
221-Thank you for using the FTP service on ftp1.ebi.ac.uk.
221 Goodbye.
%

Step 2: parse the output

def size_scale(n):
    names = ["B ", "KB", "MB", "GB"]
    i = 0
    while n > 1024:
        i = i + 1
        n = n // 1024
    return "%s%3s %s%s" % ("   " * (3-i), n, names[i], "   " * i)
    


def main():
    (SKIP,        # skip the top-level directory
     SKIP_TOTAL,  # skip a line
     PARSE_ENTRY, # parse a file/directory entry
     PARSE_DIR,   # parse the top-level directory info
     ) = range(4)

    infile = open("ebi.ls-lR.dat")
    sizes = {}

    dirname = None
    state = SKIP
    for lineno, line in enumerate(infile):
        if state is PARSE_ENTRY:
            if line == "\n":
                state = PARSE_DIR
            else:
                if line.startswith("d"):
                    pass
                else:
                    words = line.split()
                    size = long(words[4])
                    sizes[dirname] += size
        elif state is PARSE_DIR:
            assert line.startswith(".")
            dirname = line[:-2].split("/")[1]
            state = SKIP_TOTAL
        elif state is SKIP_TOTAL:
            if line == "\n":
                # empty directory
                state = PARSE_DIR
            else:
                state = PARSE_ENTRY
        elif state is SKIP:
            if line == "\n":
                state = PARSE_DIR
            elif line.startswith("d"):
                words = line.split()
                sizes[words[8]] = 0L
                
    items = sizes.items()
    items.sort(lambda x, y: cmp(x[0].lower(), y[0].lower()))
    for k, v in items:
        print "%-16s %s" % (k, size_scale(v))
    print "   total ==> ", size_scale(sum(sizes.values()))

if __name__ == "__main__":
    main()
(There are some bugs in it. Won't handle a directory with a space in it and will include the number of characters in a link. Fixable, but not needed for quick and dirty estimate.)

Step 3: the results

16S_RNA               5 MB      
3d_ali               14 MB      
3Dee                  1 MB      
ace                        10 B 
alu                    147 KB   
androgenr             2 MB      
arrayexpress         44 MB      
ASD                 659 MB      
berlin                 808 KB   
bio_catal             1 MB      
blocks              527 MB      
blocks.old            2 MB      
camoddssp             4 MB      
camodhssp             4 MB      
camodpdb             15 MB      
cd40lbase             7 MB      
clustr               84 MB      
codonusage             148 KB   
cpgisle               5 MB      
cutg                327 MB      
dali                       63 B 
databanks             2 MB      
dbcat                  333 KB   
dbEST             19 GB         
dbGSS              4 GB         
dbSTS               474 MB      
DictyDB             111 MB      
domo                 21 MB      
dssp                451 MB      
ecdc                 24 MB      
edgp                185 MB      
edpcc                       0 B 
EGI                  14 MB      
embl              33 GB         
embnet                  34 KB   
emdb                577 MB      
emp                   4 MB      
emvec                 2 MB      
enzyme               15 MB      
epd                 442 MB      
fans_ref               200 KB   
fingerPRINTScan     408 MB      
flybase             776 MB      
fssp                300 MB      
geneticcode              8 KB   
GO                  236 MB      
gpcrdbsup             3 MB      
haema                  143 KB   
haemb                  125 KB   
hla                       362 B 
hovergen            353 MB      
hssp               9 GB         
imgt                189 MB      
info                  4 MB      
interpro            207 MB      
IPI                1 GB         
journals_toc          7 MB      
kabat                71 MB      
limb                   290 KB   
lista                 5 MB      
MassSpecDB         1 GB         
mdm2                    41 KB   
methyl                 183 KB   
misfolded             1 MB      
models                 525 KB   
msd               13 GB         
mutres                 119 KB   
nrdb90              118 MB      
nrsub                 7 MB      
nucleosomal_dna         18 KB   
p53                   3 MB      
p53APC                1 MB      
parasites           809 MB      
pdb_finder            6 MB      
pdb_select             244 KB   
pdb_seq               7 MB      
PeptideSearch        52 MB      
Pfam                558 MB      
pir                 143 MB      
pir2sptr              1 MB      
piraln               10 MB      
pkcdd                  167 KB   
plmitrna                33 KB   
primers                174 KB   
prints              284 MB      
prodom             3 GB         
prof_pat             99 MB      
prosite              23 MB      
puu                   1 MB      
ras                   2 MB      
rcsb              11 GB         
rdp                  90 MB      
rebase               11 MB      
relibrary              199 KB   
repbase              17 MB      
RESID                 2 MB      
RHdb                312 MB      
rldb                  5 MB      
rrna                 43 MB      
sbase                47 MB      
seqanalref            2 MB      
smallrna                90 KB   
sp_tr_nrdb         1 GB         
SPproteomes        2 GB         
srp                   1 MB      
stackdb                  6 KB   
stride               50 MB      
SubtiList            33 MB      
swissprot          2 GB         
taxonomy             45 MB      
testsets               943 KB   
tfd                   4 MB      
tmp                   7 MB      
transfac              9 MB      
transterm             3 MB      
trembl             1 GB         
trna                  1 MB      
Unigene            6 GB         
UTR                  50 MB      
variantdbs         2 GB         
xray                   287 KB   
yeast                24 MB      
   total ==>  123 GB         

Step 4: the largest directories

% python get_ebi_size.py | grep GB | grep -v total | awk '{print $2, $1}' | sort -rn | head
33 Embl
19 Dbest
13 Msd
11 Rcsb
9 Hssp
6 Unigene
4 Dbgss
3 Prodom
2 Variantdbs
2 Swissprot
%

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB