A crazy fast scanner for media image files, useful to pre-sort media from digital estates.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
Andreas Romeyke eb5539e2b6 - fixed module dependency 1 month ago
bin - fixed header for model output 2 months ago
lib/File/FormatIdentification - added POD for tools 1 month ago
t - fixed tests using correct module 2 months ago
Changes - init 2 months ago
LICENSE.txt Dateien hochladen nach „“ 4 months ago
README.md - added corpus info 2 months ago
dist.ini - fixed module dependency 1 month ago

README.md

crazy-fast-media-image-scan

A crazy fast scanner for media image files, useful to pre-sort media from digital estates.

The project is a little research project to evaluate if a random sample based media type scanner with details on file level is possible.

The ideas are following:

  • random sampling to improve scanning (we need very fast, not very accurate results)
  • category check (what kind of data could be there in general?)
  • filetype identification using bigram based estimation, learned by decision tree over files (using format-corpus https://github.com/openpreserve/format-corpus and Mime::Types)

TODO ideas

  • plot typebased output (color?) to see distribution over media
  • improved autotune to scan only "few seconds"
  • scan EWF images, too

This will be the base for an upcoming standalone application.