[Date Index]
[Thread Index]

[Date Prev] bullet [Date Next] bullet [Thread Prev] bullet [Thread Next]
bullet
Status Report
bullet bullet bullet bullet

I'm pleased to report that the EDGAR Internet system is up and running on schedule. We received our first tape on January 5 and released our first data to the Internet on January 15. The system is still in rough form, but we are actively servicing user requests. As of 5PM today, 36,922 documents have been transferred using the anonymous FTP service and 10,199 request have been received by the electronic mail server.

We spent the period from the initial award in October through the beginning of December when we received our first funds dealing with the public relations fallout on the project and planning how to integrate the new systems into the resources. We were able to cut our first purchase orders in the middle of December and began to receive our equipment after the New Year.

I. Initial Data Processing

During the first two weeks, we configured a 400 MIP processor from Sun Microsystems with 25 Gigabytes of disk space and integrated the system into our network. This was a non-trivial task, involving a full security check, installation of new FTP daemons, installation of a mail server, reorganization of the name space, and various other administrative matters that were designed to provide a secure, solid operating environment.

Once we received our initial data, we wrote programs using the PERL and Bourne Shell Scripts to parse the incoming data. The data was broken first into files by accession numbers and cleaned up. We found some anomalies in the incoming data format which we cleaned up. The raw data from each day is then packed up in compressed file archives for bulk distribution.

The individual files are then placed in a subdirectory and the header information converted from SGML into English. We keep the SGML header files around as they will be quite useful for text indexing and for World Wide Web-based access techniques.

In addition to the basic data files, we provide rudimentary indices. There is a company index and a form index in both raw text, Macintosh .sit format, and PC .zip format. We generate daily indices as well as full files.

II. Issues for Basic Data

For the basic data, there are two concerns. First, the indices are growing by approximately 10 Kbytes per day. We don't think people should be pulling down a full index every day, but they seem to be doing that. Our hope is that adding a text search engine to both the mail and ftp servers will help in that process.

The second problem is the way we structure the file system. Currently, each file is named by the unique accession number and placed in a data subdirectory. That directory has enough files in it that some iconic FTP interfaces have problems because they attempt to parse the entire subdirectory before allowing the user to examine a file.

We propose to fix this problem by moving the files into subdirectories organized by the SEC-assigned Central Index Key (CIK). One problem with the CIK is that a given document may have several different CIKs (e.g., in the case of a 14D1/A filing). That will entail a variety of soft links to place a given real file into several subdirectories.

III. Initial Access Methods

Our initial access is provided using FTP and electronic mail. The FTP software is based on the Washington University FTP daemon, considered to be the most powerful and flexible solution for large-scale access. The mail server is based on some code that comes from Amsterdam and Germany, which was further developed at the University of Colorado and by my staff.

The mail server had some initial problems handling the load. We modified the code to queue requests and believe things have stabilized. We lost a few hundred messages the first day, but had the incoming addresses and sent people notes. We also lost an additional 209 requests, but were again able to notify people.

The FTP software has worked without a hitch. A few users got confused about how to log in, but our new banner messages appear to solve that problem. We've also added special banners for the telnet and finger interfaces for people who get lost in cyberspace and attempt to walk in the wrong door.

With the exception of a 30-minute period when we had to switch to battery backup during the D.C. power emergency, we've provided non-stop service on the system. There have been a few brief (less than 2 minute) service interruptions.

IV. Future Development

We currently have our World Wide Web, WAIS, Gopher, and LQ subsystems compiled and are beginning to get ready to put them into operation. As you know the subsystems have the following functions:

World Wide Web provides a hypertext interface to data archives. We are using the WWW server provided by the National Center for Supercomputer Applications at the University of Illinois.

WAIS is a keyword interface that uses the Z39.50 protocol. We are using the "Freewais" software from the Center for Networked Information, Discovery, and Retrieval (CNIDR) at the University of North Carolina.

LQ is a text indexing engine from the University of California at Berkeley.

Gopher is a menu-based retrieval system developed at the University of Minnesota. We are using the Minnesota server software, but are considering alternatives.

Once we're convinced that the server platforms are stable and secure, we'll turn our attention to generating the data-specific scripts to make the SEC data available via these interfaces. Our goal in all cases is to allow people to type in the name of a company and get back the list of forms filed by that company.

The other enhancement we're currently working on is for digital signatures on the data. We feel it important that there be some authoritative digital stamp on the documents since the SEC and the EDGAR Dissemination Service have provided no mechanisms to insure the continued integrity of the data once it leaves their systems.

We are using the RSA Reference Implementation which provides us with a public/private key pair. We are investigating two solutions, MIT Crypt and RIPEM, to actually generate the digital signatures for each document. By using our public key and the digital signature, a user with appropriate software will be able to ensure that the document they are looking at is the same document that we released.

Needless to say, we'd be happier if the digital signatures were applied upstream, since that would provide data integrity from all the distribution channels. Until then, however, we feel it very important to provide this service to our users.

Appendix 1: FTP Server Summary Statistics

TOTALS FOR SUMMARY PERIOD Wed Jan 5 1994 TO Thu Feb 3 1994

Files Transmitted During Summary Period 38774 Bytes Transmitted During Summary Period 4656705644 Systems Using Archives 0

Average Files Transmitted Daily 1846 Average Bytes Transmitted Daily 221747888

Daily Transmission Statistics

Number Of Number of Average Percent Of Percent Of Date Files Sent Bytes Sent Xmit Rate Files Sent Bytes Sent --------------- ---------- ----------- ---------- ---------- ---------- Wed Jan 5 1994 1 257 0.3 KB/s 0.00 0.00 Sat Jan 15 1994 1 148 0.1 KB/s 0.00 0.00 Sun Jan 16 1994 218 132897170 15.5 KB/s 0.56 2.85 Mon Jan 17 1994 6342 499366028 8.8 KB/s 16.36 10.72 Tue Jan 18 1994 2871 214809644 7.1 KB/s 7.40 4.61 Wed Jan 19 1994 1943 315440851 8.8 KB/s 5.01 6.77 Thu Jan 20 1994 5181 437053262 8.3 KB/s 13.36 9.39 Fri Jan 21 1994 2090 255587792 5.1 KB/s 5.39 5.49 Sat Jan 22 1994 1483 228889918 14.1 KB/s 3.82 4.92 Sun Jan 23 1994 595 36484465 4.3 KB/s 1.53 0.78 Mon Jan 24 1994 2104 230100100 5.5 KB/s 5.43 4.94 Tue Jan 25 1994 2268 195191864 6.4 KB/s 5.85 4.19 Wed Jan 26 1994 1643 204294834 5.9 KB/s 4.24 4.39 Thu Jan 27 1994 1974 249368147 7.9 KB/s 5.09 5.36 Fri Jan 28 1994 2912 289582628 11.0 KB/s 7.51 6.22 Sat Jan 29 1994 1379 247572748 12.9 KB/s 3.56 5.32 Sun Jan 30 1994 444 123203268 10.7 KB/s 1.15 2.65 Mon Jan 31 1994 1165 239943775 5.8 KB/s 3.00 5.15 Tue Feb 1 1994 1826 364662619 8.2 KB/s 4.71 7.83 Wed Feb 2 1994 1613 247836020 6.5 KB/s 4.16 5.32 Thu Feb 3 1994 721 144420106 0.7 KB/s 1.86 3.10

Total Transfers from each Archive Section (By bytes)

---- Percent Of ---- Archive Section Files Sent Bytes Sent Files Sent Bytes Sent ------------------------- ---------- ----------- ---------- ---------- /edgar/data1 22879 1819541537 59.01 39.07 /edgar/data1/Feed 173 1268833890 0.45 27.25 /edgar 10796 850114732 27.84 18.26 /radio/private/Current 168 599103337 0.43 12.87 /edgar/full-index 215 32108998 0.55 0.69 /edgar/daily-index 1091 19834032 2.81 0.43 /edgar/docs 1650 17003516 4.26 0.37 /radio/private/Production 3 13531968 0.01 0.29 /other/fed/h_15 52 10839945 0.13 0.23 /other/fed/money 125 9435354 0.32 0.20 /edgar/master-index 104 3828007 0.27 0.08 /other/fed/others 42 3015399 0.11 0.06 /other/fed/flow 88 2937140 0.23 0.06 /other/fed/h_4_2 9 2075867 0.02 0.04 /other/fed/g_17 36 1971715 0.09 0.04 /usr/lib 15 1449428 0.04 0.03 Index/Informational Files 911 559146 2.35 0.01 /other/fed/h_3 9 229682 0.02 0.00 /other/fed 128 109568 0.33 0.00 /other/fed/g_17_his 6 96273 0.02 0.00 /radio 163 70579 0.42 0.00 /other 91 10465 0.23 0.00 /etc 20 5066 0.05 0.00

Total Transfer Amount By Domain

Number Of Number of Average Percent Of Percent Of Domain Name Files Sent Bytes Sent Xmit Rate Files Sent Bytes Sent ----------- ---------- ------------ ---------- ---------- ---------- au 242 212395005 6.6 KB/s 0.62 4.56 ca 288 42976906 1.2 KB/s 0.74 0.92 ch 27 8635418 4.7 KB/s 0.07 0.19 cl 1 6122 6.1 KB/s 0.00 0.00 de 129 9654448 1.1 KB/s 0.33 0.21 dk 2 221169 2.7 KB/s 0.01 0.00 ee 4 23885 1.3 KB/s 0.01 0.00 fi 25 1628115 6.6 KB/s 0.06 0.03 fr 13 1540101 2.0 KB/s 0.03 0.03 hk 1 6122 6.1 KB/s 0.00 0.00 hu 1 614 0.2 KB/s 0.00 0.00 ie 2 12244 6.1 KB/s 0.01 0.00 il 23 1205572 0.9 KB/s 0.06 0.03 it 4 13900 1.2 KB/s 0.01 0.00 jp 7 566046 1.4 KB/s 0.02 0.01 nl 62 1953054 1.3 KB/s 0.16 0.04 no 12 378553 4.5 KB/s 0.03 0.01 pl 1 6123 6.1 KB/s 0.00 0.00 pt 5 110633 2.0 KB/s 0.01 0.00 se 16 342009 1.5 KB/s 0.04 0.01 sg 8 402472 0.4 KB/s 0.02 0.01 th 1 928 0.9 KB/s 0.00 0.00 uk 58 2017597 3.3 KB/s 0.15 0.04 us 98 7942550 2.3 KB/s 0.25 0.17 com 19799 2079830602 5.1 KB/s 51.06 44.66 edu 9414 943692733 10.1 KB/s 24.28 20.27 gov 1376 311658667 15.9 KB/s 3.55 6.69 mil 2144 148487043 22.5 KB/s 5.53 3.19 net 963 428015138 6.3 KB/s 2.48 9.19 org 1135 86851697 5.3 KB/s 2.93 1.87 arpa 12 445761 4.5 KB/s 0.03 0.01 wustl.edu 69 188347006 14.8 KB/s 0.18 4.04 unresolved 2832 177337411 2.5 KB/s 7.30 3.81

These figures only reflect ANONYMOUS FTP transfers.

Hourly Transmission Statistics

Number Of Number of Average Percent Of Percent Of Time Files Sent Bytes Sent Xmit Rate Files Sent Bytes Sent --------------- ---------- ----------- ---------- ---------- ---------- 00 630 158458232 8.0 KB/s 1.62 3.40 01 453 62501464 5.5 KB/s 1.17 1.34 02 317 55373539 7.3 KB/s 0.82 1.19 03 3882 491112955 34.4 KB/s 10.01 10.55 04 599 76796057 12.9 KB/s 1.54 1.65 05 150 19262050 5.9 KB/s 0.39 0.41 06 141 25939037 5.5 KB/s 0.36 0.56 07 272 69795881 6.4 KB/s 0.70 1.50 08 510 64846798 7.5 KB/s 1.32 1.39 09 1342 71804906 3.9 KB/s 3.46 1.54 10 2140 250608731 7.5 KB/s 5.52 5.38 11 3644 464807330 9.9 KB/s 9.40 9.98 12 2231 347421018 8.6 KB/s 5.75 7.46 13 3072 271657357 6.2 KB/s 7.92 5.83 14 2686 382335604 7.1 KB/s 6.93 8.21 15 3018 285721665 1.4 KB/s 7.78 6.14 16 2698 281572338 4.4 KB/s 6.96 6.05 17 2803 380722163 8.2 KB/s 7.23 8.18 18 1271 162302737 4.9 KB/s 3.28 3.49 19 1368 148467615 6.6 KB/s 3.53 3.19 20 1346 112854826 5.8 KB/s 3.47 2.42 21 1320 231524508 9.3 KB/s 3.40 4.97 22 1291 126697206 4.8 KB/s 3.33 2.72 23 1590 114121627 6.4 KB/s 4.10 2.45

bullet
[Date Prev] bullet [Date Next] bullet [Thread Prev] bullet [Thread Next]
bullet