Scrawl 15

1/15/2024

‍ The HTTP response metadata is most likely to be of interest to Common Crawl users. If you want to inspect the file yourself, you can use one of the many formatting tools available, such as JSONFormatter.io. To keep the file sizes as small as possible, the JSON is stored with all unnecessary whitespace stripped, resulting in a relatively unreadable format for humans. If the information crawled is HTML, the computed metadata includes the HTTP headers returned and the links (including the type of link) listed on the page. This metadata is computed for each of the three types of records (metadata, request, and response).

The WAT Format ‍ WAT files contain important metadata about the records stored in the WARC format. expires=Sun, 02-Aug-15 09:52:13 GMT path=/ domain=bbc.co.uk BBC NEWS | Africa | Namibia braces for Nujoma exit. See the full WARC extract WARC/1.0 WARC-Type: response WARC-Date: WARC-Record-ID: Content-Length: 43428 Content-Type: application/http msgtype=response WARC-Warcinfo-ID: WARC-Concurrent-To: WARC-IP-Address: 212.58.244.61 WARC-Target-URI: WARC-Payload-Digest: sha1:M63W6MNGFDWXDSLTHF7GWUPCJUH4JK3J WARC-Block-Digest: sha1:YHKQUSBOS4CLYFEKQDVGJ457OAPD6IJO WARC-Truncated: lengthHTTP/1.1 200 OK Server: Apache Vary: X-CDN Cache-Control: max-age=0 Content-Type: text/html Date: Sat, 09:52:13 GMT Expires: Sat, 09:52:13 GMT Connection: close Set-Cookie: BBC-UID=. We can also see the page was served from the Apache web server, sets caching details, and attempts to set a cookie (shortened for display here). In the example below, we can see the crawler contacted and received HTML in response. This not only includes the response itself, (what you would get if you downloaded the file) but also the HTTP header information, which can be used to glean a number of interesting insights. For the HTTP responses themselves, the raw response is stored. Not only does the format store the HTTP response from the websites it contacts (WARC-Type: response), it also stores information about how that information was requested (WARC-Type: request) and metadata on the crawl process itself (WARC-Type: metadata). The WARC Format The WARC format is the raw data from the crawl, providing a direct mapping to the crawl process.

For more information check the AWS CLI user guide or call the command-line help (here for the cp command): aws s3 cp help You may first look at the data e.g, to list all WARC files of a specific segment of the April 2018 crawl: > aws -no-sign-request s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/ 10:27:49 931210633 10:28:32 935833042 10:29:51 940140704 ‍ The command to download the first file in the listing is: aws -no-sign-request s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/ AWS CLI supports recursive copying, and allows for pattern–based inclusion/exclusion of files. Once the AWS CLI is installed, the command to copy a file to your local machine is: aws -no-sign-request s3 cp s3://commoncrawl/path_to_file/local_path/The argument -no-sign-request allows for anonymous access without the need to own an AWS account.

Please follow the installation instructions. It’s easy to install on most operating systems (Windows, macOS, Linux). The AWS Command Line Interface can be used to access the data from anywhere (including EC2). Once the AWS CLI is installed, the command to copy a file to your local machine is: aws s3 cp s3://commoncrawl/path_to_file ‍ You may first look at the data e.g, to list all WARC files of a specific segment of the April 2018 crawl: > aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/ 10:27:49 931210633 10:28:32 935833042 10:29:51 940140704 ‍ The command to download the first file in the listing is: aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/ The AWS CLI supports recursive copying, and allows for pattern–based inclusion/exclusion of files. Please see our blog announcement for more information. Please note, access to data from the Amazon cloud using the S3 API is only allowed for authenticated users.

0 Comments

Scrawl 15

Leave a Reply.

Author

Archives

Categories