A clone of 'official' Archive::BagIt reposity by CPAN author Rob Schmidt (https://github.com/rjeschmi/Archive-BagIt). This repository contains patches to update Archive::BagIt to version 1.0 of BagIt, see RFC 8493 (https://tools.ietf.org/html/rfc8493)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1324 lines
36 KiB

1 year ago
1 year ago
1 year ago
1 year ago
  1. package Archive::BagIt;
  2. use strict;
  3. use warnings;
  4. use utf8;
  5. use open ':std', ':encoding(UTF-8)';
  6. use Encode qw(decode);
  7. use File::Find;
  8. use File::Spec;
  9. use Digest::MD5;
  10. use Class::Load qw(load_class);
  11. use Carp;
  12. use POSIX qw(strftime);
  13. use Moo;
  14. with "Archive::BagIt::Role::Portability";
  15. # VERSION
  16. # ABSTRACT: The main module to handle bags.
  17. =head1 NAME
  18. Achive::BagIt - The main module to handle Bags
  19. =head1 SYNOPSIS
  20. This modules will hopefully help with the basic commands needed to create
  21. and verify a bag. This part supports BagIt 1.0 according to RFC 8493 ([https://tools.ietf.org/html/rfc8493](https://tools.ietf.org/html/rfc8493)).
  22. You only need to know the following methods first:
  23. =head2 read a BagIt
  24. use Archive::BagIt;
  25. #read in an existing bag:
  26. my $bag_dir = "/path/to/bag";
  27. my $bag = Archive::BagIt->new($bag_dir);
  28. =head2 construct a BagIt around a payload
  29. use Archive::BagIt;
  30. my $bag2 = Archive::BagIt->make_bag($bag_dir);
  31. =head2 verify a BagIt-dir
  32. use Archive::BagIt;
  33. # Validate a BagIt archive against its manifest
  34. my $bag3 = Archive::BagIt->new($bag_dir);
  35. my $is_valid1 = $bag3->verify_bag();
  36. # Validate a BagIt archive against its manifest, report all errors
  37. my $bag4 = Archive::BagIt->new($bag_dir);
  38. my $is_valid2 = $bag4->verify_bag( {report_all_errors => 1} );
  39. =head2 read a BagIt-dir, change something, store
  40. Because all methods operate lazy, you should ensure to parse parts of the bag *BEFORE* you modify it.
  41. Otherwise it will be overwritten!
  42. use Archive::BagIt;
  43. my $bag5 = Archive::BagIt->new($bag_dir); # lazy, nothing happened
  44. $bag5->load(); # this updates the object representation by parsing the given $bag_dir
  45. $bag5->store(); # this writes the bag new
  46. =head1 SOURCE
  47. The original development version was on github at L<http://github.com/rjeschmi/Archive-BagIt>
  48. and may be cloned from there.
  49. The actual development version is available at L<https://art1pirat.spdns.org/art1/Archive-BagIt>
  50. =head1 Conformance to RFC8493
  51. The module should fulfill the RFC requirements, with following limitations:
  52. =over
  53. =item only encoding UTF-8 is supported
  54. =item version 0.97 or 1.0 allowed
  55. =item version 0.97 requires tag-/manifest-files with md5-fixity
  56. =item version 1.0 requires tag-/manifest-files with sha512-fixity
  57. =item BOM is not supported
  58. =item Carriage Return in bagit-files are not allowed
  59. =item fetch.txt is unsupported
  60. =back
  61. At the moment only filepaths in linux-style are supported.
  62. To get an more detailled overview, see the testsuite under F<t/verify_bag.t> and corresponding test bags from the BagIt conformance testsuite of Library of Congress under F<bagit_conformance_suite/>.
  63. See L<https://datatracker.ietf.org/doc/rfc8493/?include_text=1> for details.
  64. =head1 TODO
  65. =over
  66. =item enhanced testsuite
  67. =item reduce complexity
  68. =item use modern perl code
  69. =item add flag to enable very strict verify
  70. =back
  71. =head1 METHODS
  72. =cut
  73. around 'BUILDARGS' , sub {
  74. my $orig = shift;
  75. my $class = shift;
  76. if (@_ == 1 && !ref $_[0]) {
  77. return $class->$orig(bag_path=>$_[0]);
  78. } else {
  79. return $class->$orig(@_);
  80. }
  81. };
  82. =head2 Constructor
  83. The constructor sub, will create a bag with a single argument,
  84. use Archive::BagIt;
  85. #read in an existing bag:
  86. my $bag_dir = "/path/to/bag";
  87. my $bag = Archive::BagIt->new($bag_dir);
  88. or use hashreferences
  89. use Archive::BagIt;
  90. #read in an existing bag:
  91. my $bag_dir = "/path/to/bag";
  92. my $bag = Archive::BagIt->new(
  93. bag_path => $bag_dir,
  94. );
  95. The arguments are:
  96. =over 1
  97. =item C<bag_path> - path to bag-directory
  98. =item C<force_utf8> - if set the warnings about non portable filenames are disabled (default: enabled)
  99. =back
  100. The bag object will use $bag_dir, BUT an existing $bag_dir is not read. If you use C<store()> an existing bag will be overwritten!
  101. See C<load()> if you want to parse/modify an existing bag.
  102. =cut
  103. sub BUILD {
  104. my ($self, $args) = @_;
  105. return $self->load_plugins(("Archive::BagIt::Plugin::Manifest::MD5", "Archive::BagIt::Plugin::Manifest::SHA512"));
  106. }
  107. ###############################################
  108. =head2 has_force_utf8()
  109. to check if force_utf8() was set.
  110. If set it ignores warnings about potential filepath problems.
  111. =cut
  112. has 'force_utf8' => (
  113. is => 'rw',
  114. lazy => 1,
  115. );
  116. sub has_force_utf8 {
  117. my $self = shift;
  118. if ((exists $self->{force_utf8}) && ($self->{force_utf8})) {
  119. return 1;
  120. } else {
  121. return;
  122. }
  123. }
  124. ###############################################
  125. =head2 bag_path([$new_value])
  126. Getter/setter for bag path
  127. =cut
  128. has 'bag_path' => (
  129. is => 'rw',
  130. );
  131. ###############################################
  132. has 'bag_path_arr' => (
  133. is => 'ro',
  134. lazy => 1,
  135. builder => '_build_bag_path_arr',
  136. );
  137. ###############################################
  138. =head2 metadata_path()
  139. Getter for metadata path
  140. =cut
  141. has 'metadata_path' => (
  142. is=> 'ro',
  143. lazy => 1,
  144. builder => '_build_metadata_path',
  145. );
  146. sub _build_metadata_path {
  147. my ($self) = @_;
  148. return $self->bag_path;
  149. }
  150. ###############################################
  151. has 'metadata_path_arr' => (
  152. is =>'ro',
  153. lazy => 1,
  154. builder => '_build_metadata_path_arr',
  155. );
  156. ###############################################
  157. has 'rel_metadata_path' => (
  158. is => 'ro',
  159. lazy => 1,
  160. builder => '_build_rel_metadata_path',
  161. );
  162. ###############################################
  163. =head2 payload_path()
  164. Getter for payload path
  165. =cut
  166. has 'payload_path' => (
  167. is => 'ro',
  168. lazy => 1,
  169. builder => '_build_payload_path',
  170. );
  171. sub _build_payload_path {
  172. my ($self) = @_;
  173. return File::Spec->catdir($self->bag_path, "data");
  174. }
  175. ###############################################
  176. has 'payload_path_arr' => (
  177. is => 'ro',
  178. lazy => 1,
  179. builder => '_build_payload_path_arr',
  180. );
  181. ###############################################
  182. has 'rel_payload_path' => (
  183. is => 'ro',
  184. lazy => 1,
  185. builder => '_build_rel_payload_path',
  186. );
  187. ###############################################
  188. =head2 checksum_algos()
  189. Getter for registered Checksums
  190. =cut
  191. has 'checksum_algos' => (
  192. is => 'ro',
  193. lazy => 1,
  194. builder => '_build_checksum_algos',
  195. );
  196. ###############################################
  197. =head2 bag_version()
  198. Getter for bag version
  199. =cut
  200. has 'bag_version' => (
  201. is => 'ro',
  202. lazy => 1,
  203. builder => '_build_bag_version',
  204. );
  205. ###############################################
  206. =head2 bag_encoding()
  207. Getter for bag encoding.
  208. HINT: the current version of Archive::BagIt only supports UTF-8, but the method could return other values depending on given Bags.
  209. =cut
  210. has 'bag_encoding' => (
  211. is => 'ro',
  212. lazy => 1,
  213. builder => '_build_bag_encoding',
  214. );
  215. ###############################################
  216. =head2 bag_info([$new_value])
  217. Getter/Setter for bag info. Expects/returns an array of HashRefs implementing simple key-value pairs.
  218. HINT: RFC8493 does not allow *reordering* of entries!
  219. =head2 has_bag_info()
  220. returns true if bag info exists.
  221. =cut
  222. has 'bag_info' => (
  223. is => 'rw',
  224. lazy => 1,
  225. builder => '_build_bag_info',
  226. predicate => 1
  227. );
  228. ###############################################
  229. =head2 errors()
  230. Getter to return collected errors after a C<verify_bag()> call with Option C<report_all_errors>
  231. =cut
  232. has 'errors' => (
  233. is => 'ro',
  234. lazy => 1,
  235. builder => sub { my $self = shift; return [];},
  236. );
  237. ###############################################
  238. =head2 digest_callback()
  239. This method could be reimplemented by derived classes to handle fixity checks in own way. The
  240. getter returns an anonymous function with following interface:
  241. my $digest = $self->digest_callback;
  242. &$digest( $digestobject, $filename);
  243. This anonymous function MUST use the C<get_hash_string()> function of the C<Archive::BagIt::Role::Algorithm> role,
  244. which is implemented by each C<Archive::BagIt::Plugin::Algorithm::XXXX> module.
  245. See C<Archive::BagIt::Fast> for details.
  246. =cut
  247. has 'digest_callback' => (
  248. is => 'ro',
  249. lazy => 1,
  250. builder => sub {
  251. my $sub = sub {
  252. my ($digestobj, $filename) = @_;
  253. open(my $fh, "<:raw", "$filename") or croak ("Cannot open $filename, $!");
  254. binmode($fh);
  255. my $digest = $digestobj->get_hash_string($fh);
  256. close $fh || croak("could not close file '$filename', $!");
  257. return $digest;
  258. };
  259. return $sub;
  260. }
  261. );
  262. ###############################################
  263. =head2 get_baginfo_values_by_key($searchkey)
  264. Returns all values which match $searchkey, undef otherwise
  265. =cut
  266. sub get_baginfo_values_by_key {
  267. my ($self, $searchkey) = @_;
  268. my $info = $self->bag_info();
  269. my @values;
  270. if (defined $searchkey) {
  271. my $lc_flag = $self->is_baginfo_key_reserved( $searchkey );
  272. foreach my $entry (@{ $info }) {
  273. my ($key, $value) = %{ $entry };
  274. if ( __case_aware_compare_for_baginfo( $key, $searchkey, $lc_flag) ) {
  275. push @values, $value;
  276. }
  277. }
  278. }
  279. return @values if (scalar(@values) > 0);
  280. return;
  281. }
  282. ###############################################
  283. =head2 is_baginfo_key_reserved_as_uniq($searchkey)
  284. returns true if key is reserved and should be uniq
  285. =cut
  286. sub is_baginfo_key_reserved_as_uniq {
  287. my ($self, $searchkey) = @_;
  288. return $searchkey =~ m/^(Bagging-Date)|(Bag-Size)|(Payload-Oxum)|(Bag-Group-Identifier)|(Bag-Count)$/i;
  289. }
  290. ###############################################
  291. =head2 is_baginfo_key_reserved( $searchkey )
  292. returns true if key is reserved
  293. =cut
  294. sub is_baginfo_key_reserved {
  295. my ($self, $searchkey) = @_;
  296. return $searchkey =~ m/^
  297. (Source-Organization)|
  298. (Organisation-Adress)|
  299. (Contact-Name)|
  300. (Contact-Phone)|
  301. (Contact-Email)|
  302. (External-Description)|
  303. (Bagging-Date)|
  304. (External-Identifier)|
  305. (Bag-Size)|
  306. (Payload-Oxum)|
  307. (Bag-Group-Identifier)|
  308. (Bag-Count)|
  309. (Internal-Sender-Identifier)|
  310. (Internal-Sender-Description)$/ix
  311. }
  312. ###############################################
  313. sub __case_aware_compare_for_baginfo {
  314. my ($internal_key, $search_key, $lc_flag) = @_;
  315. return (defined $internal_key) && (
  316. ( $lc_flag && ((lc $internal_key) eq (lc $search_key)) ) # for reserved keys use caseinsensitive search
  317. ||
  318. ( (!$lc_flag) && ($internal_key eq $search_key) ) # for other keys sensitive search
  319. )
  320. }
  321. ###############################################
  322. sub _find_baginfo_idx {
  323. my ($self, $searchkey) = @_;
  324. if (defined $searchkey) {
  325. if ($searchkey =~ m/:/) {croak "key should not contain a colon! (searchkey='$searchkey')";}
  326. my $info = $self->bag_info();
  327. my $size = scalar(@{$info});
  328. my $lc_flag = $self->is_baginfo_key_reserved($searchkey);
  329. foreach my $idx (0.. $size-1) {
  330. my %entry = %{$info->[$idx]};
  331. my ($key, $value) = %entry;
  332. if (__case_aware_compare_for_baginfo($key, $searchkey, $lc_flag)) {
  333. return $idx;
  334. }
  335. }
  336. }
  337. return;
  338. }
  339. ###############################################
  340. =head2 verify_baginfo()
  341. checks baginfo-keys, returns true if all fine, otherwise returns undef and the message is pushed to C<errors()>.
  342. =cut
  343. sub verify_baginfo {
  344. my ($self) = @_;
  345. my %keys;
  346. my $info = $self->bag_info();
  347. my $ret = 1;
  348. if (defined $info) {
  349. foreach my $entry (@{$self->bag_info()}) {
  350. my ($key, $value) = %{$entry};
  351. if ($self->is_baginfo_key_reserved($key)) {
  352. $keys{ lc $key }++;
  353. }
  354. else {
  355. $keys{ $key }++
  356. }
  357. }
  358. foreach my $key (keys %keys) {
  359. if ($self->is_baginfo_key_reserved_as_uniq($key)) {
  360. if ($keys{$key} > 1) {
  361. push @{$self->{errors}}, "Baginfo key '$key' exists $keys{$key}, but should be uniq!";
  362. $ret = undef;
  363. }
  364. }
  365. }
  366. }
  367. # check for payload oxum
  368. my ($loaded_payloadoxum) = $self->get_baginfo_values_by_key('Payload-Oxum');
  369. if (defined $loaded_payloadoxum) {
  370. my ($octets, $streamcount) = $self->calc_payload_oxum();
  371. if ("$octets.$streamcount" ne $loaded_payloadoxum) {
  372. push @{$self->{errors}}, "Payload-Oxum differs, calculated $octets.$streamcount but $loaded_payloadoxum was expected by bag-info.txt";
  373. $ret = undef;
  374. }
  375. } else {
  376. push @{$self->{errors}}, "Payload-Oxum was expected in bag-info.txt, but not found!";
  377. $ret = undef;
  378. }
  379. return $ret;
  380. }
  381. ###############################################
  382. =head2 delete_baginfo_by_key( $searchkey )
  383. deletes an entry of given $searchkey if exists
  384. =cut
  385. sub delete_baginfo_by_key {
  386. my ($self, $searchkey) = @_;
  387. my $idx = $self->_find_baginfo_idx($searchkey);
  388. if (defined $idx) {
  389. delete $self->{bag_info}[$idx];
  390. }
  391. return 1;
  392. }
  393. ###############################################
  394. =head2 exists_baginfo_key( $searchkey )
  395. returns true if a given $searchkey exists
  396. =cut
  397. sub exists_baginfo_key {
  398. my ($self, $searchkey) =@_;
  399. return (defined $self->_find_baginfo_idx($searchkey));
  400. }
  401. ###############################################
  402. sub _replace_baginfo_by_first_match {
  403. my ($self, $searchkey, $newvalue) = @_;
  404. my $idx = $self->_find_baginfo_idx( $searchkey);
  405. if (defined $idx) {
  406. $self->{bag_info}[$idx] = {$searchkey => $newvalue};
  407. return $idx;
  408. }
  409. return;
  410. }
  411. ###############################################
  412. =head2 append_baginfo_by_key($searchkey, $newvalue)
  413. Appends a key value pair to bag_info.
  414. HINT: check return code if append was successful, because some keys needs to be uniq.
  415. =cut
  416. sub append_baginfo_by_key {
  417. my ($self, $searchkey, $newvalue) = @_;
  418. if (defined $searchkey) {
  419. if ($searchkey =~ m/:/) { croak "key should not contain a colon! (searchkey='$searchkey')"; }
  420. if ($self->is_baginfo_key_reserved_as_uniq($searchkey)) {
  421. if (defined $self->get_baginfo_values_by_key($searchkey)) {
  422. # hmm, search key is mrked as uniq and still exists
  423. return;
  424. }
  425. }
  426. push @{$self->{bag_info}}, {$searchkey => $newvalue};
  427. }
  428. return 1;
  429. }
  430. ###############################################
  431. =head2 add_or_replace_baginfo_by_key($searchkey, $newvalue)
  432. It replaces the first entry with $newvalue if $searchkey exists, otherwise it appends.
  433. =cut
  434. sub add_or_replace_baginfo_by_key {
  435. my ($self, $searchkey, $newvalue) = @_;
  436. if (defined $searchkey) {
  437. if ($searchkey =~ m/:/) { croak "key should not contain a colon! (searchkey='$searchkey')"; }
  438. if (defined $self->{bag_info}) {
  439. my $idx = $self->_replace_baginfo_by_first_match( $searchkey, $newvalue);
  440. if (defined $idx) { return $idx;}
  441. }
  442. $self->append_baginfo_by_key( $searchkey, $newvalue );
  443. return -1;
  444. }
  445. }
  446. ###############################################
  447. =head2 forced_fixity_algorithm()
  448. Getter to return the forced fixity algorithm depending on BagIt version
  449. =cut
  450. has 'forced_fixity_algorithm' => (
  451. is => 'ro',
  452. lazy => 1,
  453. builder => '_build_forced_fixity_algorithm',
  454. );
  455. ###############################################
  456. =head2 manifest_files()
  457. Getter to find all manifest-files
  458. =cut
  459. has 'manifest_files' => (
  460. is => 'ro',
  461. lazy => 1,
  462. builder => '_build_manifest_files',
  463. );
  464. ###############################################
  465. =head2 tagmanifest_files()
  466. Getter to find all tagmanifest-files
  467. =cut
  468. has 'tagmanifest_files' => (
  469. is => 'ro',
  470. lazy => 1,
  471. builder => '_build_tagmanifest_files',
  472. );
  473. ###############################################
  474. =head2 payload_files()
  475. Getter to find all payload-files
  476. =cut
  477. has 'payload_files' => ( # relatively to bagit base
  478. is => 'ro',
  479. lazy => 1,
  480. builder => '_build_payload_files',
  481. );
  482. ###############################################
  483. =head2 non_payload_files()
  484. Getter to find all non payload-files
  485. =cut
  486. has 'non_payload_files' => (
  487. is=>'ro',
  488. lazy => 1,
  489. builder => '_build_non_payload_files',
  490. );
  491. ###############################################
  492. =head2 plugins()
  493. Getter/setter to algorithm plugins
  494. =cut
  495. has 'plugins' => (
  496. is=>'rw',
  497. #isa=>'HashRef',
  498. );
  499. ###############################################
  500. =head2 manifests()
  501. Getter/Setter to all manifests (objects)
  502. =cut
  503. has 'manifests' => (
  504. is => 'rw',
  505. lazy => 1,
  506. builder => '_build_manifests'
  507. #isa=>'HashRef',
  508. );
  509. ###############################################
  510. =head2 algos()
  511. Getter/Setter to all registered Algorithms
  512. =cut
  513. has 'algos' => (
  514. is=>'rw',
  515. #isa=>'HashRef',
  516. );
  517. ###############################################
  518. sub _build_bag_path_arr {
  519. my ($self) = @_;
  520. my @split_path = File::Spec->splitdir($self->bag_path);
  521. return @split_path;
  522. }
  523. sub _build_payload_path_arr {
  524. my ($self) = @_;
  525. my @split_path = File::Spec->splitdir($self->payload_path);
  526. return @split_path;
  527. }
  528. sub _build_rel_payload_path {
  529. my ($self) = @_;
  530. my $rel_path = File::Spec->abs2rel( $self->payload_path, $self->bag_path ) ;
  531. return $rel_path;
  532. }
  533. sub _build_metadata_path_arr {
  534. my ($self) = @_;
  535. my @split_path = File::Spec->splitdir($self->metadata_path);
  536. return @split_path;
  537. }
  538. sub _build_rel_metadata_path {
  539. my ($self) = @_;
  540. my $rel_path = File::Spec->abs2rel( $self->metadata_path, $self->bag_path ) ;
  541. return $rel_path;
  542. }
  543. sub _build_checksum_algos {
  544. my($self) = @_;
  545. my $checksums = [ 'md5', 'sha1', 'sha256', 'sha512' ];
  546. return $checksums;
  547. }
  548. sub _build_manifest_files {
  549. my($self) = @_;
  550. my @manifest_files;
  551. foreach my $algo (@{$self->checksum_algos}) {
  552. my $manifest_file = File::Spec->catfile($self->metadata_path, "manifest-$algo.txt");
  553. if (-f $manifest_file) {
  554. push @manifest_files, $manifest_file;
  555. }
  556. }
  557. return \@manifest_files;
  558. }
  559. sub _build_tagmanifest_files {
  560. my ($self) = @_;
  561. my @tagmanifest_files;
  562. foreach my $algo (@{$self->checksum_algos}) {
  563. my $tagmanifest_file = File::Spec->catfile($self->metadata_path,"tagmanifest-$algo.txt");
  564. if (-f $tagmanifest_file) {
  565. push @tagmanifest_files, $tagmanifest_file;
  566. }
  567. }
  568. return \@tagmanifest_files;
  569. }
  570. sub __file_find { # own implementation, because File::Find has problems with UTF8 encoded Paths under MSWin32
  571. # finds recursively all files in given directory.
  572. # if $excludedir is defined, the content will be excluded
  573. my ($self,$dir, $excludedir) = @_;
  574. if (defined $excludedir) {
  575. $excludedir = File::Spec->rel2abs( $excludedir);
  576. }
  577. my @file_paths;
  578. my $rx_portable = qr/^[a-zA-Z0-9._-]+$/;
  579. my $finder;
  580. $finder = sub {
  581. my ($current_dir) = @_; #absolute path
  582. my @todo;
  583. my @tmp_file_paths;
  584. opendir( my $dh, $current_dir);
  585. my @paths = File::Spec->no_upwards ( readdir $dh );
  586. closedir $dh;
  587. foreach my $local_entry (@paths) {
  588. my $is_portable = $local_entry =~ m/$rx_portable/;
  589. if (! $is_portable) {
  590. my $local_entry_utf8 = decode("UTF-8", $local_entry);
  591. if ((!$self->has_force_utf8)) {
  592. my $hexdump = "0x" . unpack('H*', $local_entry);
  593. $local_entry =~m/[^a-zA-Z0-9._-]/; # to find PREMATCH, needed nextline
  594. carp "possible non portable pathname detected in $dir,\n",
  595. "got path (hexdump)='$hexdump'(hex),\n",
  596. "decoded path='$local_entry_utf8'\n",
  597. " "." "x length($`)."^"."------- first non portable char\n"; #$` eq $PREMATCH
  598. }
  599. $local_entry = $local_entry_utf8;
  600. }
  601. my $path_entry = File::Spec->catdir($current_dir, $local_entry);
  602. if (-f $path_entry) {
  603. push @tmp_file_paths, $path_entry;
  604. } elsif (-d $path_entry) {
  605. next if ((defined $excludedir) && ($path_entry eq $excludedir));
  606. push @todo, $path_entry;
  607. } else {
  608. croak "not a file nor a dir found '$path_entry'";
  609. }
  610. }
  611. push @file_paths, sort @tmp_file_paths;
  612. foreach my $subdir (sort @todo) {
  613. &$finder($subdir);
  614. }
  615. };
  616. my $absolute = File::Spec->rel2abs( $dir );
  617. &$finder($absolute);
  618. @file_paths = map { File::Spec->abs2rel( $_, $dir)} @file_paths;
  619. return @file_paths;
  620. }
  621. sub _build_payload_files{
  622. my ($self) = @_;
  623. my $payload_dir = $self->payload_path;
  624. my $reldir = File::Spec->abs2rel($payload_dir, $self->bag_path());
  625. $reldir =~ s/^\.$//;
  626. my @payload = map {
  627. $reldir eq "" ? $_ : File::Spec->catfile($reldir, $_)
  628. } $self->__file_find($payload_dir, File::Spec->rel2abs($self->metadata_path));
  629. return wantarray ? @payload : \@payload;
  630. }
  631. sub __build_read_bagit_txt {
  632. my($self) = @_;
  633. my $bagit = $self->metadata_path;
  634. my $file = File::Spec->catfile($bagit, "bagit.txt");
  635. open(my $BAGIT, "<:encoding(UTF-8)", $file) or croak("Cannot read '$file': $!");
  636. my $version_string = <$BAGIT>;
  637. my $encoding_string = <$BAGIT>;
  638. close($BAGIT);
  639. if (defined $version_string) {
  640. $version_string =~ s/[\r\n]//;
  641. }
  642. if (defined $encoding_string) {
  643. $encoding_string =~s/[\r\n]//;
  644. }
  645. return ($version_string, $encoding_string, $file);
  646. }
  647. sub _build_bag_version {
  648. my($self) = @_;
  649. my ($version_string, $encoding_string, $file) = $self->__build_read_bagit_txt();
  650. croak "Version line missed in '$file" unless defined $version_string;
  651. if ($version_string =~ /^BagIt-Version: ([01]\.[0-9]+)$/) {
  652. return $1;
  653. } else {
  654. $version_string =~ s/\r/<CR>/;
  655. $version_string =~ s/^\N{U+FEFF}/<BOM>/;
  656. croak "Version string '$version_string' of '$file' is incorrect";
  657. };
  658. }
  659. sub _build_bag_encoding {
  660. my($self) = @_;
  661. my ($version_string, $encoding_string, $file) = $self->__build_read_bagit_txt();
  662. croak "Encoding line missed in '$file" unless defined $encoding_string;
  663. croak "Encoding '$encoding_string' of '$file' not supported by current Archive::BagIt module!" unless ($encoding_string !~ m/^UTF-8$/);
  664. return $encoding_string;
  665. }
  666. sub __sort_bag_info {
  667. my @sorted = sort {
  668. my %tmpa = %{$a};
  669. my %tmpb = %{$b};
  670. my ($ka, $va) = each %tmpa;
  671. my ($kb, $vb) = each %tmpb;
  672. my $kres = $ka cmp $kb;
  673. if ($kres != 0) {
  674. return $kres;
  675. } else {
  676. return $va cmp $vb;
  677. }
  678. } @_;
  679. return @sorted;
  680. }
  681. sub _parse_bag_info { # parses a bag-info textblob
  682. my ($self, $textblob) = @_;
  683. # metadata elements are OPTIONAL and MAY be repeated. Because "bag-
  684. # info.txt" is intended for human reading and editing, ordering MAY be
  685. # significant and the ordering of metadata elements MUST be preserved.
  686. #
  687. # A metadata element MUST consist of a label, a colon ":", a single
  688. # linear whitespace character (space or tab), and a value that is
  689. # terminated with an LF, a CR, or a CRLF.
  690. #
  691. # The label MUST NOT contain a colon (:), LF, or CR. The label MAY
  692. # contain linear whitespace characters but MUST NOT start or end with
  693. # whitespace.
  694. #
  695. # It is RECOMMENDED that lines not exceed 79 characters in length.
  696. # Long values MAY be continued onto the next line by inserting a LF,
  697. # CR, or CRLF, and then indenting the next line with one or more linear
  698. # white space characters (spaces or tabs). Except for linebreaks, such
  699. # padding does not form part of the value.
  700. #
  701. # Implementations wishing to support previous BagIt versions MUST
  702. # accept multiple linear whitespace characters before and after the
  703. # colon when the bag version is earlier than 1.0; such whitespace does
  704. # not form part of the label or value.
  705. # find all labels
  706. my @labels;
  707. while ($textblob =~ s/^([^:\s]+)\s*:\s*//m) { # label if starts with chars not colon or whitespace followed by zero or more spaces, a colon, zero or more spaces
  708. # label found
  709. my $label = $1; my $value="";
  710. if ($textblob =~ s/(.+?)(?=^\S)//ms) {
  711. # value if rest string starts with chars not \r and/or \n until a non-whitespace after \r\n
  712. $value = chomp_portable($1);
  713. } elsif ($textblob =~ s/(.*)//s) {
  714. $value = chomp_portable($1);
  715. }
  716. if (defined $label) {
  717. push @labels, { "$label" => "$value" };
  718. }
  719. }
  720. # The RFC does not allow reordering:
  721. #my @sorted = __sort_bag_info(@labels);
  722. #return \@sorted;
  723. return \@labels;
  724. }
  725. sub _build_bag_info {
  726. my ($self) = @_;
  727. my $bagit = $self->metadata_path;
  728. my $file = File::Spec->catfile($bagit, "bag-info.txt");
  729. if (-e $file) {
  730. open(my $BAGINFO, "<:encoding(UTF-8)", $file) or croak("Cannot read $file: $!");
  731. my @lines;
  732. while ( my $line = <$BAGINFO>) {
  733. push @lines, $line;
  734. }
  735. close($BAGINFO);
  736. my $lines = join("", @lines);
  737. return $self->_parse_bag_info($lines);
  738. }
  739. # bag-info.txt is optional
  740. return;
  741. }
  742. sub _build_non_payload_files {
  743. my ($self) = @_;
  744. my $non_payload_dir = $self->metadata_path();
  745. my $reldir = File::Spec->abs2rel($non_payload_dir, $self->bag_path());
  746. $reldir =~ s/^\.$//;
  747. my @non_payload = map {
  748. $reldir eq "" ? $_ : File::Spec->catfile($reldir, $_)
  749. } $self->__file_find($non_payload_dir, File::Spec->rel2abs($self->payload_path));
  750. return wantarray ? @non_payload : \@non_payload;
  751. }
  752. sub _build_forced_fixity_algorithm {
  753. my ($self) = @_;
  754. if ($self->bag_version() >= 1.0) {
  755. return Archive::BagIt::Plugin::Algorithm::SHA512->new(bagit => $self);
  756. }
  757. else {
  758. return Archive::BagIt::Plugin::Algorithm::MD5->new(bagit => $self);
  759. }
  760. }
  761. ###############################################
  762. =head2 load_plugins
  763. As default SHA512 and MD5 will be loaded and therefore used. If you want to create a bag only with one or a specific
  764. checksum-algorithm, you could use this method to (re-)register it. It expects list of strings with namespace of type:
  765. Archive::BagIt::Plugin::Algorithm::XXX where XXX is your chosen fixity algorithm.
  766. =cut
  767. sub load_plugins {
  768. my ($self, @plugins) = @_;
  769. #p(@plugins);
  770. my $loaded_plugins = $self->plugins;
  771. @plugins = grep { not exists $loaded_plugins->{$_} } @plugins;
  772. return if @plugins == 0;
  773. foreach my $plugin (@plugins) {
  774. load_class ($plugin) or croak ("Can't load $plugin");
  775. $plugin->new({bagit => $self});
  776. }
  777. return 1;
  778. }
  779. ###############################################
  780. =head2 load()
  781. Triggers loading of an existing bag
  782. =cut
  783. sub load {
  784. my ($self) = @_;
  785. # call trigger
  786. $self->bag_path;
  787. $self->bag_version;
  788. $self->bag_encoding;
  789. $self->bag_info;
  790. $self->payload_path;
  791. $self->manifest_files;
  792. $self->checksum_algos;
  793. $self->tagmanifest_files;
  794. return 1;
  795. }
  796. ###############################################
  797. =head2 verify_bag($opts)
  798. A method to verify a bag deeply. If C<$opts> is set with C<{return_all_errors}> all fixity errors are reported.
  799. The default ist to croak with error message if any error is detected.
  800. HINT: You might also want to check Archive::BagIt::Fast to see a more direct way of accessing files (and thus faster).
  801. =cut
  802. sub verify_bag {
  803. my ($self,$opts) = @_;
  804. #removed the ability to pass in a bag in the parameters, but might want options
  805. #like $return all errors rather than dying on first one
  806. my $bagit = $self->bag_path;
  807. my $version = $self->bag_version(); # to call trigger
  808. my $encoding = $self->bag_encoding(); # to call trigger
  809. my $baginfo = $self->verify_baginfo(); #to call trigger
  810. my $forced_fixity_alg = $self->forced_fixity_algorithm()->name();
  811. my $fetch_file = File::Spec->catfile($self->metadata_path, "fetch.txt");
  812. my $manifest_file = File::Spec->catfile($self->metadata_path, "manifest-$forced_fixity_alg.txt");
  813. my $payload_dir = $self->payload_path;
  814. my $return_all_errors = $opts->{return_all_errors};
  815. if (-f $fetch_file) {
  816. croak("Fetching via file '$fetch_file' is not supported by current Archive::BagIt implementation")
  817. }
  818. croak("Manifest '$manifest_file' is not a regular file or does not exist for given bagit version '$version'") unless -f ($manifest_file);
  819. croak("Payload-directory '$payload_dir' is not a directory or does not exist") unless -d ($payload_dir);
  820. unless ($version > .95) {
  821. croak ("Bag Version $version is unsupported");
  822. }
  823. # check forced fixity
  824. my @errors;
  825. # check for manifests
  826. foreach my $algorithm ( keys %{ $self->manifests }) {
  827. my $res = $self->manifests->{$algorithm}->verify_manifest($self->payload_files, $return_all_errors);
  828. if ((defined $res) && ($res ne "1")) { push @errors, $res; }
  829. }
  830. #check for tagmanifests
  831. foreach my $algorithm ( keys %{ $self->manifests }) {
  832. my $res = $self->manifests->{$algorithm}->verify_tagmanifest($self->non_payload_files, $return_all_errors);
  833. if ((defined $res) && ($res ne "1")) { push @errors, $res; }
  834. }
  835. push @{$self->{errors}}, @errors;
  836. my $err = $self->errors();
  837. my @err = @{ $err };
  838. if (scalar( @err ) > 0) {
  839. croak join("\n","bag verify for bagit version '$version' failed with invalid files.", @err);
  840. }
  841. return 1;
  842. }
  843. =head2 calc_payload_oxum()
  844. returns an array with octets and streamcount of payload-dir
  845. =cut
  846. sub calc_payload_oxum {
  847. my($self) = @_;
  848. my @payload = @{$self->payload_files};
  849. my $octets=0;
  850. my $streamcount = scalar @payload;
  851. foreach my $local_name (@payload) {# local_name is relative to bagit base
  852. my $file = File::Spec->catfile($self->bag_path(), $local_name);
  853. if (-e $file) {
  854. my $filesize = 0;
  855. $filesize = -s $file or carp "empty file $file detected";
  856. $octets += $filesize;
  857. } else { croak "file $file does not exist, $!"; }
  858. }
  859. return ($octets, $streamcount);
  860. }
  861. =head2 calc_bagsize()
  862. returns a string with human readable size of paylod
  863. =cut
  864. sub calc_bagsize {
  865. my($self) = @_;
  866. my ($octets,$streamcount) = $self->calc_payload_oxum();
  867. if ($octets < 1024) { return "$octets B"; }
  868. elsif ($octets < 1024*1024) {return sprintf("%0.1f kB", $octets/1024); }
  869. elsif ($octets < 1024*1024*1024) {return sprintf "%0.1f MB", $octets/(1024*1024); }
  870. elsif ($octets < 1024*1024*1024*1024) {return sprintf "%0.1f GB", $octets/(1024*1024*1024); }
  871. else { return sprintf "%0.2f TB", $octets/(1024*1024*1024*1024); }
  872. }
  873. =head2 create_bagit()
  874. creates a bagit.txt file
  875. =cut
  876. sub create_bagit {
  877. my($self) = @_;
  878. my $metadata_path = $self->metadata_path();
  879. my $bagit_path = File::Spec->catfile( $metadata_path, "bagit.txt");
  880. open(my $BAGIT, ">:encoding(UTF-8)", $bagit_path) or croak("Can't open $bagit_path for writing: $!");
  881. print($BAGIT "BagIt-Version: 1.0\nTag-File-Character-Encoding: UTF-8");
  882. close($BAGIT);
  883. return 1;
  884. }
  885. =head2 create_baginfo()
  886. creates a bag-info.txt file
  887. Hint: the entries 'Bagging-Date', 'Bag-Software-Agent', 'Payload-Oxum' and 'Bag-Size' will be automagically set,
  888. existing values in internal bag-info representation will be overwritten!
  889. =cut
  890. sub create_baginfo {
  891. my($self) = @_; # because bag-info.txt allows multiple key-value-entries, hash is replaced
  892. $self->add_or_replace_baginfo_by_key('Bagging-Date', POSIX::strftime("%Y-%m-%d", gmtime(time)));
  893. $self->add_or_replace_baginfo_by_key('Bag-Software-Agent', 'Archive::BagIt <https://metacpan.org/pod/Archive::BagIt>');
  894. my ($octets, $streams) = $self->calc_payload_oxum();
  895. $self->add_or_replace_baginfo_by_key('Payload-Oxum', "$octets.$streams");
  896. $self->add_or_replace_baginfo_by_key('Bag-Size', $self->calc_bagsize());
  897. # The RFC does not allow reordering:
  898. my $metadata_path = $self->metadata_path();
  899. my $bag_info_path = File::Spec->catfile( $metadata_path, "bag-info.txt");
  900. open(my $BAGINFO, ">:encoding(UTF-8)", $bag_info_path) or croak("Can't open $bag_info_path for writing: $!");
  901. foreach my $entry (@{ $self->bag_info() }) {
  902. my %tmp = %{ $entry };
  903. my ($key, $value) = %tmp;
  904. if ($key =~ m/:/) { carp "key should not contain a colon! (searchkey='$key')"; }
  905. print($BAGINFO "$key: $value\n");
  906. }
  907. close($BAGINFO);
  908. return 1;
  909. }
  910. =head2 store()
  911. store a bagit-obj if bagit directory-structure was already constructed.
  912. =cut
  913. sub store {
  914. my($self) = @_;
  915. $self->create_bagit();
  916. $self->create_baginfo();
  917. # it is important to create all manifest files first, because tagmanifest should include all manifest-xxx.txt
  918. foreach my $algorithm ( keys %{ $self->manifests }) {
  919. $self->manifests->{$algorithm}->create_manifest();
  920. }
  921. foreach my $algorithm ( keys %{ $self->manifests }) {
  922. $self->manifests->{$algorithm}->create_tagmanifest();
  923. }
  924. # retrigger builds
  925. $self->{checksum_algos} = $self->_build_checksum_algos();
  926. $self->{tagmanifest_files} = $self->_build_tagmanifest_files();
  927. $self->{manifest_files} = $self->_build_manifest_files();
  928. return 1;
  929. }
  930. =head2 init_metadata()
  931. A constructor that will just create the metadata directory
  932. This won't make a bag, but it will create the conditions to do that eventually
  933. =cut
  934. sub init_metadata {
  935. my ($class, $bag_path, $options) = @_;
  936. $bag_path =~ s#/$##; # replace trailing slash
  937. unless ( -d $bag_path) { croak ( "source bag directory '$bag_path' doesn't exist"); }
  938. my $self = $class->new(bag_path=>$bag_path, %$options);
  939. carp "no payload path" if ! -d $self->payload_path;
  940. unless ( -d $self->payload_path) {
  941. rename ($bag_path, $bag_path.".tmp");
  942. mkdir ($bag_path);
  943. rename ($bag_path.".tmp", $self->payload_path);
  944. }
  945. unless ( -d $self->metadata_path) {
  946. #metadata path is not the root path for some reason
  947. mkdir ($self->metadata_path);
  948. }
  949. $self->store();
  950. return $self;
  951. }
  952. =head2 make_bag( $bag_path )
  953. A constructor that will make and return a bag from a directory,
  954. It expects a preliminary bagit-dir exists.
  955. If there a data directory exists, assume it is already a bag (no checking for invalid files in root)
  956. =cut
  957. sub make_bag {
  958. my ($class, $bag_path, $options) = @_;
  959. my $isa = ref $class;
  960. if ($isa eq "Archive::BagIt") { # not a class, but an object!
  961. croak "make_bag() only a class subroutine, not useable with objects. Try store() instead!\n";
  962. }
  963. my $self = $class->init_metadata($bag_path, $options);
  964. return $self;
  965. }
  966. =head1 FAQ
  967. =head2 How to access the manifest-entries directly?
  968. Try this:
  969. foreach my $algorithm ( keys %{ $self->manifests }) {
  970. my $entries_ref = $self->manifests->{$algorithm}->manifest_entries();
  971. # $entries_ref returns a hashref of form:
  972. # $entries_ref->{$algorithm}->{$file} = $digest;
  973. }
  974. Similar for tagmanifests
  975. =head2 How fast is C<Archive::BagIt::Fast>?
  976. It depends. On my system with SSD and a 38MB bag with 48 payload files the results for C<verify_bag()> are:
  977. Rate Base Fast
  978. Base 102% -- -10%
  979. Fast 125% 11% --
  980. On network filesystem (CIFS, 1Gb) with same Bag:
  981. Rate Fast Base
  982. Fast 2.20/s -- -11%
  983. Base 2.48/s 13% --
  984. But you should measure which variant is best for you. In general the default C<Archive::BagIt> is fast enough.
  985. =head2 How to update an old bag of version v0.97 to v1.0?
  986. You could try this:
  987. use Archive::BagIt;
  988. my $bag=Archive::BagIt->new( $my_old_bag_filepath );
  989. $bag->load();
  990. $bag->store();
  991. =cut
  992. =head2 How to create UTF-8 based paths under MS Windows?
  993. For versions < Windows10: I have no idea and suggestions for a portable solution are very welcome!
  994. For Windows 10: Thanks to L<https://superuser.com/questions/1033088/is-it-possible-to-set-locale-of-a-windows-application-to-utf-8/1451686#1451686>
  995. you have to enable UTF-8 support via 'System Administration' -> 'Region' -> 'Administrative'
  996. -> 'Region Settings' -> Flag 'Use Unicode UTF-8 for worldwide language support'
  997. Hint: The better way is to use only portable filenames. See L<perlport> for details.
  998. =cut
  999. __PACKAGE__->meta->make_immutable;
  1000. 1;