How much data at EBI?
Step 1: get the ls -lR
% ftp ftp.ebi.ac.uk Connected to alpha4.ebi.ac.uk. 220 ftp1.ebi.ac.uk FTP server (Version wu-2.6.2(2) Wed Aug 20 08:58:45 BST 2003) ready. Name (ftp.ebi.ac.uk:dalke): anonymous 331 Guest login ok, send your complete e-mail address as password. Password: 230-Welcome anonymous@h-66-167-134-137.PHNDAZ91.covad.net 230- 230-This is (mostly) a genetic database entry server. 230-If you have any unusual problems, please report them via e-mail to 230- nethelp@ebi.ac.uk 230-If you do have problems, please try using a dash (-) as the first 230-character of your password -- this will turn off the continuation 230-messages that may be confusing your ftp client. 230- 230-You are currently user number 33 out of 180 allowed. 230- 230-All transfers are logged. 230- 230 Guest login ok, access restrictions apply. Remote system type is UNIX. Using binary mode to transfer files. ftp> cd pub 250 CWD command successful. ftp> ls -lR ebi.ls-lR.dat output to local-file: ebi.ls-lR.dat [anpqy?]? y 227 Entering Passive Mode (193,62,196,103,179,174) 150 Opening ASCII mode data connection for /bin/ls. 226 Transfer complete. ftp> ftp> quit 221-You have transferred 0 bytes in 0 files. 221-Total traffic for this session was 23921687 bytes in 4 transfers. 221-Thank you for using the FTP service on ftp1.ebi.ac.uk. 221 Goodbye. %
Step 2: parse the output
def size_scale(n): names = ["B ", "KB", "MB", "GB"] i = 0 while n > 1024: i = i + 1 n = n // 1024 return "%s%3s %s%s" % (" " * (3-i), n, names[i], " " * i) def main(): (SKIP, # skip the top-level directory SKIP_TOTAL, # skip a line PARSE_ENTRY, # parse a file/directory entry PARSE_DIR, # parse the top-level directory info ) = range(4) infile = open("ebi.ls-lR.dat") sizes = {} dirname = None state = SKIP for lineno, line in enumerate(infile): if state is PARSE_ENTRY: if line == "\n": state = PARSE_DIR else: if line.startswith("d"): pass else: words = line.split() size = long(words[4]) sizes[dirname] += size elif state is PARSE_DIR: assert line.startswith(".") dirname = line[:-2].split("/")[1] state = SKIP_TOTAL elif state is SKIP_TOTAL: if line == "\n": # empty directory state = PARSE_DIR else: state = PARSE_ENTRY elif state is SKIP: if line == "\n": state = PARSE_DIR elif line.startswith("d"): words = line.split() sizes[words[8]] = 0L items = sizes.items() items.sort(lambda x, y: cmp(x[0].lower(), y[0].lower())) for k, v in items: print "%-16s %s" % (k, size_scale(v)) print " total ==> ", size_scale(sum(sizes.values())) if __name__ == "__main__": main()(There are some bugs in it. Won't handle a directory with a space in it and will include the number of characters in a link. Fixable, but not needed for quick and dirty estimate.)
Step 3: the results
16S_RNA 5 MB 3d_ali 14 MB 3Dee 1 MB ace 10 B alu 147 KB androgenr 2 MB arrayexpress 44 MB ASD 659 MB berlin 808 KB bio_catal 1 MB blocks 527 MB blocks.old 2 MB camoddssp 4 MB camodhssp 4 MB camodpdb 15 MB cd40lbase 7 MB clustr 84 MB codonusage 148 KB cpgisle 5 MB cutg 327 MB dali 63 B databanks 2 MB dbcat 333 KB dbEST 19 GB dbGSS 4 GB dbSTS 474 MB DictyDB 111 MB domo 21 MB dssp 451 MB ecdc 24 MB edgp 185 MB edpcc 0 B EGI 14 MB embl 33 GB embnet 34 KB emdb 577 MB emp 4 MB emvec 2 MB enzyme 15 MB epd 442 MB fans_ref 200 KB fingerPRINTScan 408 MB flybase 776 MB fssp 300 MB geneticcode 8 KB GO 236 MB gpcrdbsup 3 MB haema 143 KB haemb 125 KB hla 362 B hovergen 353 MB hssp 9 GB imgt 189 MB info 4 MB interpro 207 MB IPI 1 GB journals_toc 7 MB kabat 71 MB limb 290 KB lista 5 MB MassSpecDB 1 GB mdm2 41 KB methyl 183 KB misfolded 1 MB models 525 KB msd 13 GB mutres 119 KB nrdb90 118 MB nrsub 7 MB nucleosomal_dna 18 KB p53 3 MB p53APC 1 MB parasites 809 MB pdb_finder 6 MB pdb_select 244 KB pdb_seq 7 MB PeptideSearch 52 MB Pfam 558 MB pir 143 MB pir2sptr 1 MB piraln 10 MB pkcdd 167 KB plmitrna 33 KB primers 174 KB prints 284 MB prodom 3 GB prof_pat 99 MB prosite 23 MB puu 1 MB ras 2 MB rcsb 11 GB rdp 90 MB rebase 11 MB relibrary 199 KB repbase 17 MB RESID 2 MB RHdb 312 MB rldb 5 MB rrna 43 MB sbase 47 MB seqanalref 2 MB smallrna 90 KB sp_tr_nrdb 1 GB SPproteomes 2 GB srp 1 MB stackdb 6 KB stride 50 MB SubtiList 33 MB swissprot 2 GB taxonomy 45 MB testsets 943 KB tfd 4 MB tmp 7 MB transfac 9 MB transterm 3 MB trembl 1 GB trna 1 MB Unigene 6 GB UTR 50 MB variantdbs 2 GB xray 287 KB yeast 24 MB total ==> 123 GB
Step 4: the largest directories
% python get_ebi_size.py | grep GB | grep -v total | awk '{print $2, $1}' | sort -rn | head 33 Embl 19 Dbest 13 Msd 11 Rcsb 9 Hssp 6 Unigene 4 Dbgss 3 Prodom 2 Variantdbs 2 Swissprot %
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2020 Andrew Dalke Scientific AB