Configuration · ArchiveBox/ArchiveBox Wiki · GitHub (repo_wiki) (https://github.githubassets.com/) (https://github.com/opensearch.xml) (GitHub) (https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#public_index--public_snapshots--public_add_view) Skip to content Navigation Menu (https://github.com/) (https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2FArchiveBox%2FArchiveBox%2Fwiki%2FConfiguration) Sign in Product (https://github.com/features/actions) Actions Automate any workflow (https://github.com/features/packages) Packages Host and manage packages (https://github.com/features/security) Security Find and fix vulnerabilities (https://github.com/features/codespaces) Codespaces Instant dev environments (https://github.com/features/copilot) GitHub Copilot Write better code with AI (https://github.com/features/code-review) Code review Manage code changes (https://github.com/features/issues) Issues Plan and track work (https://github.com/features/discussions) Discussions Collaborate outside of code Explore (https://github.com/features) All features (https://docs.github.com/) Documentation (https://skills.github.com/) GitHub Skills (https://github.blog/) Blog Solutions By size (https://github.com/enterprise) Enterprise (https://github.com/team) Teams (https://github.com/enterprise/startups) Startups By industry (https://github.com/solutions/industries/healthcare) Healthcare (https://github.com/solutions/industries/financial-services) Financial services (https://github.com/solutions/industries/manufacturing) Manufacturing By use case (https://github.com/solutions/ci-cd) CI/CD & Automation (https://github.com/solutions/devops) DevOps (https://github.com/solutions/devsecops) DevSecOps Resources Topics (https://github.com/resources/articles/ai) AI (https://github.com/resources/articles/devops) DevOps (https://github.com/resources/articles/security) Security (https://github.com/resources/articles/software-development) Software Development (https://github.com/resources/articles) View all Explore (https://resources.github.com/learn/pathways) Learning Pathways (https://resources.github.com/) White papers, Ebooks, Webinars (https://github.com/customer-stories) Customer Stories (https://partner.github.com/) Partners Open Source (https://github.com/sponsors) GitHub Sponsors Fund open source developers (https://github.com/readme) The ReadME Project GitHub community articles Repositories (https://github.com/topics) Topics (https://github.com/trending) Trending (https://github.com/collections) Collections Enterprise (https://github.com/enterprise) Enterprise platform AI-powered developer platform Available add-ons (https://github.com/enterprise/advanced-security) Advanced Security Enterprise-grade security features (https://github.com/features/copilot#enterprise) GitHub Copilot Enterprise-grade AI features (https://github.com/premium-support) Premium Support Enterprise-grade 24/7 support (https://github.com/pricing) Pricing (Search or jump to...) Search or jump to... (https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2FArchiveBox%2FArchiveBox%2Fwiki%2FConfiguration) Sign in (https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E%2Fwiki%2Fshow&source=header-repo&source_repo=ArchiveBox%2FArchiveBox) Sign up Reseting focus (https://github.com/ArchiveBox) ArchiveBox / (https://github.com/ArchiveBox/ArchiveBox) ArchiveBox Public Sponsor (https://github.com/login?return_to=%2FArchiveBox%2FArchiveBox) Notifications You must be signed in to change notification settings (https://github.com/login?return_to=%2FArchiveBox%2FArchiveBox) Fork (1,113) 1.1k (https://github.com/login?return_to=%2FArchiveBox%2FArchiveBox) Star (20,696) 20.7k (https://github.com/ArchiveBox/ArchiveBox) Code (Not available) (https://github.com/ArchiveBox/ArchiveBox/issues) Issues (162) 162 (https://github.com/ArchiveBox/ArchiveBox/pulls) Pull requests (3) 3 (https://github.com/ArchiveBox/ArchiveBox/discussions) Discussions (Not available) (https://github.com/ArchiveBox/ArchiveBox/actions) Actions (Not available) (https://github.com/ArchiveBox/ArchiveBox/projects) Projects (1) 1 (https://github.com/ArchiveBox/ArchiveBox/wiki) Wiki (Not available) (https://github.com/ArchiveBox/ArchiveBox/security) Security (1) 1 (https://github.com/ArchiveBox/ArchiveBox/pulse) Insights (Not available) Additional navigation options Configuration Jump to bottom Nick Sweeting edited this page (May 9, 2024, 1:02 AM UTC) May 9, 2024 · (https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration/_history) 160 revisions Configuration Configuration of ArchiveBox is done by using the archivebox config command, modifying the ArchiveBox.conf file in the data folder, or by using environment variables. All three methods work equivalently when using Docker as well. Some equivalent examples of setting some configuration options: archivebox config --set CHROME_BINARY=google-chrome-stable # OR echo " CHROME_BINARY=google-chrome-stable" >> ArchiveBox.conf # OR env CHROME_BINARY=google-chrome-stable archivebox add ~ /Downloads/bookmarks_export.html (archivebox config --set CHROME_BINARY=google-chrome-stable # OR echo "CHROME_BINARY=google-chrome-stable" >> ArchiveBox.conf # OR env CHROME_BINARY=google-chrome-stable archivebox add ~/Downloads/bookmarks_export.html) Environment variables take precedence over the config file, which is useful if you only want to use a certain option temporarily during a single run. For more examples see (https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#run-archivebox-with-configuration-options) Usage: Configuration ... Available Configuration Options: General Settings: Archiving process, output format, and timing. Archive Method Toggles: On/off switches for methods. Archive Method Options: Method tunables and parameters. Shell Options: Format & behavior of CLI output. Dependency Options: Specify exact paths to dependencies. In case this document is ever out of date, it's recommended to read the code that loads the config directly: (https://github.com/ArchiveBox/ArchiveBox/blob/master/archivebox/config.py#L27) archivebox/config.py ➡️ General Settings General options around the archiving process, output format, and timing. OUTPUT_PERMISSIONS Possible Values: [755 ]/644 /...Permissions to set the output directory and file contents to. This is useful when running ArchiveBox inside Docker as root and you need to explicitly set the permissions to something that the users on the host can access. Related options: PUID / PGID PUID / PGID Possible Values: [911 ]/1000 /... User and Group ownership to set the output directory and file contents to. Only settable as environment variables when using ArchiveBox in Docker. This is useful on some Docker setups when you want the data dir to be owned by the same UID/GID on the host and inside the container. PUID=0 is not allowed ((https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#do-not-run-as-root) do not run as root ), and PGID=0 is allowed but not recommended. PUID s and PGID s below 100 cause many issues because they're often (https://github.com/ArchiveBox/ArchiveBox/discussions/1366) already in use by an existing linux user in docker, if the files must be owned by a low value ID e.g. 33 (www-data ), you may need to use (https://github.com/clecherbauer/docker-volume-bindfs) bindfs to remap the permissions. Make sure if using NFS/SMB/FUSE that the volume allows setting ownership on files (e.g. don't set root_squash or all_squash on NFS shares). Learn more: (https://docs.linuxserver.io/general/understanding-puid-and-pgid/) https://docs.linuxserver.io/general/understanding-puid-and-pgid/ (https://github.com/ArchiveBox/ArchiveBox/wiki/Troubleshooting#docker-permissions-issues) https://github.com/ArchiveBox/ArchiveBox/wiki/Troubleshooting#docker-permissions-issues (https://github.com/ArchiveBox/ArchiveBox/issues/1304) https://github.com/ArchiveBox/ArchiveBox/issues/1304 (https://github.com/ArchiveBox/ArchiveBox/discussions/1366) https://github.com/ArchiveBox/ArchiveBox/discussions/1366 (https://github.com/ArchiveBox/ArchiveBox/blob/main/bin/docker_entrypoint.sh) https://github.com/ArchiveBox/ArchiveBox/blob/main/bin/docker_entrypoint.sh Related options: OUTPUT_PERMISSIONS ONLY_NEW Possible Values: [True ]/False Toggle whether or not to attempt rechecking old links when adding new ones, or leave old incomplete links alone and only archive the new links. By default, ArchiveBox will only archive new links on each import. If you want it to go back through all links in the index and download any missing files on every run, set this to False . Note: Regardless of how this is set, ArchiveBox will never re-download sites that have already succeeded previously. When this is False it only attempts to fix previous pages have missing archive extractor outputs, it does not re-archive pages that have already been successfully archived. TIMEOUT Possible Values: [60 ]/120 /...Maximum allowed download time per archive method for each link in seconds. If you have a slow network connection or are seeing frequent timeout errors, you can raise this value. Note: Do not set this to anything less than 15 seconds as it will cause Chrome to hang indefinitely and many sites to fail completely. MEDIA_TIMEOUT Possible Values: [3600 ]/120 /...Maximum allowed download time for fetching media when SAVE_MEDIA=True in seconds. This timeout is separate and usually much longer than TIMEOUT because media downloaded with youtube-dl can often be quite large and take many minutes/hours to download. Tweak this setting based on your network speed and maximum media file size you plan on downloading. Note: Do not set this to anything less than 10 seconds as it can often take 5-10 seconds for youtube-dl just to parse the page before it starts downloading media files. Related options: SAVE_MEDIA ADMIN_USERNAME / ADMIN_PASSWORD Possible Values: [None ]/"admin" /... Only used on first run / initial setup in Docker. ArchiveBox will create an admin user with the specified username and password when these options are found in the environment. Useful for setting up a Docker instance of ArchiveBox without needing to run a shell command to create the admin user. Equivalent to: $ archivebox manage createsuperuser Username: < ADMIN_USERNAME> Password: < ADMIN_PASSWORD> Password (again): < ADMIN_PASSWORD> ($ archivebox manage createsuperuser Username: Password: Password (again): ) More info: (https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Authentication) https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Authentication (https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#configuration) https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#configuration Related options: PUBLIC_INDEX / PUBLIC_SNAPSHOTS / PUBLIC_ADD_VIEW PUBLIC_INDEX / PUBLIC_SNAPSHOTS / PUBLIC_ADD_VIEW Possible Values: [True ]/False Configure whether or not login is required to use each area of ArchiveBox. archivebox manage createsuperuser # set a password before disabling public access # these are the default values archivebox config - - set PUBLIC_INDEX = True # True = allow users to view main snapshots list without logging in archivebox config - - set PUBLIC_SNAPSHOTS = True # True = allow users to view snapshot content without logging in archivebox config - - set PUBLIC_ADD_VIEW = False # True = allow users to submit new URLs to archive without logging in (archivebox manage createsuperuser # set a password before disabling public access # these are the default values archivebox config --set PUBLIC_INDEX=True # True = allow users to view main snapshots list without logging in archivebox config --set PUBLIC_SNAPSHOTS=True # True = allow users to view snapshot content without logging in archivebox config --set PUBLIC_ADD_VIEW=False # True = allow users to submit new URLs to archive without logging in) More info: (https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Authentication) https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Authentication (https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#ui-usage) https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#ui-usage CUSTOM_TEMPLATES_DIR Possible Values: [None ]//path/to/custom_templates /... Path to a directory containing custom html/css/images for overriding the default UI styling. Files found in the folder at the specified path can override any of the defaults in the (https://github.com/ArchiveBox/ArchiveBox/tree/dev/archivebox/templates) TEMPLATES_DIR directory (copy files from that default dir into your custom dir to get started making a custom theme). If you've used django before, this works exactly the same way that django template overrides work (because it uses django under the hood). pip show -f archivebox | grep Location: | awk ' {print $2}' # /opt/homebrew/lib/python3.11/site-packages pip show -f archivebox | grep archivebox/templates # archivebox/templates/admin/app_index.html # archivebox/templates/admin/base.html # archivebox/templates/admin/login.html # ... # copy default templates into a directory somewhere, edit as needed, then point archivebox to it, e.g. cp -r /opt/homebrew/lib/python3.11/site-packages/archivebox/templates ~ /archivebox/custom_templates archivebox config --set CUSTOM_TEMPLATES_DIR=~ /archivebox/data/custom_templates (pip show -f archivebox | grep Location: | awk '{print $2}' # /opt/homebrew/lib/python3.11/site-packages pip show -f archivebox | grep archivebox/templates # archivebox/templates/admin/app_index.html # archivebox/templates/admin/base.html # archivebox/templates/admin/login.html # ... # copy default templates into a directory somewhere, edit as needed, then point archivebox to it, e.g. cp -r /opt/homebrew/lib/python3.11/site-packages/archivebox/templates ~/archivebox/custom_templates archivebox config --set CUSTOM_TEMPLATES_DIR=~/archivebox/data/custom_templates) Related options: FOOTER_INFO REVERSE_PROXY_USER_HEADER Possible Values: [Remote-User ]/X-Remote-User /... HTTP header containing user name from authenticated proxy. More info: (https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Authentication) https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Authentication (https://github.com/ArchiveBox/ArchiveBox/pull/866) https://github.com/ArchiveBox/ArchiveBox/pull/866 Related options: REVERSE_PROXY_WHITELIST , LOGOUT_REDIRECT_URL REVERSE_PROXY_WHITELIST Possible Values: [ ],172.16.0.0/16 ,2001:d80::/26 /... Comma separated list of IP CIDRs which are allowed to use reverse proxy authentication. Both IPv4 and IPv6 IPs can be used next to each other. Empty string means "deny all". More info: (https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Authentication) https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Authentication (https://github.com/ArchiveBox/ArchiveBox/pull/866) https://github.com/ArchiveBox/ArchiveBox/pull/866 Related options: REVERSE_PROXY_USER_HEADER , LOGOUT_REDIRECT_URL LOGOUT_REDIRECT_URL Possible Values: [/ ]/https://example.com/some/other/app /... URL to redirect users back to on logout when using reverse proxy authentication. More info: (https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Authentication) https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Authentication (https://github.com/ArchiveBox/ArchiveBox/pull/866) https://github.com/ArchiveBox/ArchiveBox/pull/866 Related options: REVERSE_PROXY_USER_HEADER , REVERSE_PROXY_WHITELIST LDAP Possible Values: [False ]/True Whether to use an external (https://jumpcloud.com/blog/what-is-ldap-authentication) LDAP server for authentication (e.g. OpenLDAP, MS Active Directory, OpenDJ, etc.). # first, install optional ldap addon to use this feature pip install archivebox[ldap] (# first, install optional ldap addon to use this feature pip install archivebox[ldap]) Then set these configuration values to finish configuring LDAP: LDAP : True LDAP_SERVER_URI : " ldap://ldap.example.com:3389" LDAP_BIND_DN : " ou=archivebox,ou=services,dc=ldap.example.com" LDAP_BIND_PASSWORD : " secret-bind-user-password" LDAP_USER_BASE : " ou=users,ou=archivebox,ou=services,dc=ldap.example.com" LDAP_USER_FILTER : " (objectClass=user)" LDAP_USERNAME_ATTR : " uid" LDAP_FIRSTNAME_ATTR : " givenName" LDAP_LASTNAME_ATTR : " sn" LDAP_EMAIL_ATTR : " mail" (LDAP: True LDAP_SERVER_URI: "ldap://ldap.example.com:3389" LDAP_BIND_DN: "ou=archivebox,ou=services,dc=ldap.example.com" LDAP_BIND_PASSWORD: "secret-bind-user-password" LDAP_USER_BASE: "ou=users,ou=archivebox,ou=services,dc=ldap.example.com" LDAP_USER_FILTER: "(objectClass=user)" LDAP_USERNAME_ATTR: "uid" LDAP_FIRSTNAME_ATTR: "givenName" LDAP_LASTNAME_ATTR: "sn" LDAP_EMAIL_ATTR: "mail") More info: (https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Authentication) https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Authentication (https://github.com/ArchiveBox/ArchiveBox/pull/1214) https://github.com/ArchiveBox/ArchiveBox/pull/1214 (https://github.com/django-auth-ldap/django-auth-ldap#example-configuration) https://github.com/django-auth-ldap/django-auth-ldap#example-configuration (https://jumpcloud.com/blog/what-is-ldap-authentication) https://jumpcloud.com/blog/what-is-ldap-authentication SNAPSHOTS_PER_PAGE Possible Values: [40 ]/100 /... Maximum number of Snapshots to show per page on Snapshot list pages. Lower this value on slower machines to make the UI faster. Related options: SEARCH_BACKEND_TIMEOUT FOOTER_INFO Possible Values: [Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests. ]/Operated by ACME Co. /...Some text to display in the footer of the archive index. Useful for providing server admin contact info to respond to takedown requests. Related options: TEMPLATES_DIR URL_DENYLIST Possible Values: [\.(css|js|otf|ttf|woff|woff2|gstatic\.com|googleapis\.com/css)(\?.*)?$ ]/.+\.exe$ /http(s)?:\/\/(.+)?example.com\/.* /... A regex expression used to exclude certain URLs from archiving. You can use if there are certain domains, extensions, or other URL patterns that you want to ignore whenever they get imported. Blacklisted URLs wont be included in the index, and their page content wont be archived. When building your exclusion list, you can check whether a given URL matches your regex expression in python like so: >> > import re >> > URL_DENYLIST = r'^http(s)?:\/\/(.+\.)?(youtube\.com)|(amazon\.com)\/.*$' # replace this with your regex to test >> > URL_DENYLIST_PTN = re .compile (URL_DENYLIST , re .IGNORECASE | re .UNICODE | re .MULTILINE ) >> > bool (URL_DENYLIST_PTN .search ('https://test.youtube.com/example.php?abc=123' )) # replace this with the URL to test True # this URL would not be archived because it matches the exclusion pattern (>>> import re >>> URL_DENYLIST = r'^http(s)?:\/\/(.+\.)?(youtube\.com)|(amazon\.com)\/.*$' # replace this with your regex to test >>> URL_DENYLIST_PTN = re.compile(URL_DENYLIST, re.IGNORECASE | re.UNICODE | re.MULTILINE) >>> bool(URL_DENYLIST_PTN.search('https://test.youtube.com/example.php?abc=123')) # replace this with the URL to test True # this URL would not be archived because it matches the exclusion pattern) Note: all assets required to render each page are still archived, URL_DENYLIST /URL_ALLOWLIST do not apply to images, css, video, etc. visible inline within the page. Note 2: These options used to be called URL_WHITELIST & URL_BLACKLIST before (https://github.com/ArchiveBox/ArchiveBox/releases) v0.7.1 . Related options: URL_ALLOWLIST , SAVE_MEDIA , SAVE_GIT , GIT_DOMAINS URL_ALLOWLIST Possible Values: [None ]/^http(s)?:\/\/(.+)?example\.com\/?.*$ /... A regex expression used to exclude all URLs that don't match the given pattern from archiving. You can use if there are certain domains, extensions, or other URL patterns that you want to restrict the scope of archiving to (e.g. to only archive a single domain, subdirectory, or filetype, etc.. When building your whitelist, you can check whether a given URL matches your regex expression in python like so: >> > import re >> > URL_ALLOWLIST = r'^http(s)?:\/\/(.+)?example\.com\/?.*$' # replace this with your regex to test >> > URL_ALLOWLIST_PTN = re .compile (URL_ALLOWLIST , re .IGNORECASE | re .UNICODE | re .MULTILINE ) >> > bool (URL_ALLOWLIST_PTN .search ('https://test.example.com/example.php?abc=123' )) True # this URL would be archived >> > bool (URL_ALLOWLIST_PTN .search ('https://test.youtube.com/example.php?abc=123' )) False # this URL would be excluded from archiving (>>> import re >>> URL_ALLOWLIST = r'^http(s)?:\/\/(.+)?example\.com\/?.*$' # replace this with your regex to test >>> URL_ALLOWLIST_PTN = re.compile(URL_ALLOWLIST, re.IGNORECASE | re.UNICODE | re.MULTILINE) >>> bool(URL_ALLOWLIST_PTN.search('https://test.example.com/example.php?abc=123')) True # this URL would be archived >>> bool(URL_ALLOWLIST_PTN.search('https://test.youtube.com/example.php?abc=123')) False # this URL would be excluded from archiving) This option is useful for recursive archiving of all the pages under a given domain or subfolder (aka crawling/spidering), without following links to external domains / parent folders. # temporarily enforce a whitelist by setting the option as an environment variable export URL_ALLOWLIST=' ^http(s)?:\/\/(.+)?example\.com\/?.*$' # then run your archivebox commands in the same shell archivebox add --depth=1 ' https://example.com' archivebox list https://example.com | archivebox add --depth=1 archivebox list https://example.com | archivebox add --depth=1 archivebox list https://example.com | archivebox add --depth=1 # repeat up to desired depth ... # all URLs that don't match *.example.com will be excluded, e.g. a link to youtube.com would not be followed (# temporarily enforce a whitelist by setting the option as an environment variable export URL_ALLOWLIST='^http(s)?:\/\/(.+)?example\.com\/?.*$' # then run your archivebox commands in the same shell archivebox add --depth=1 'https://example.com' archivebox list https://example.com | archivebox add --depth=1 archivebox list https://example.com | archivebox add --depth=1 archivebox list https://example.com | archivebox add --depth=1 # repeat up to desired depth ... # all URLs that don't match *.example.com will be excluded, e.g. a link to youtube.com would not be followed) Note: all assets required to render each page are still archived, URL_DENYLIST /URL_ALLOWLIST do not apply to images, css, video, etc. visible inline within the page. Related options: URL_DENYLIST , SAVE_MEDIA , SAVE_GIT , GIT_DOMAINS Archive Method Toggles High-level on/off switches for all the various methods used to archive URLs. SAVE_TITLE Possible Values: [True ]/False By default ArchiveBox uses the title provided by the import file, but not all types of imports provide titles (e.g. Plain texts lists of URLs). When this is True, ArchiveBox downloads the page (and follows all redirects), then it attempts to parse the link's title from the first tag found in the response. It may be buggy or not work for certain sites that use JS to set the title, disabling it will lead to links imported without a title showing up with their URL as the title in the UI. Related options: ONLY_NEW , CHECK_SSL_VALIDITY SAVE_FAVICON Possible Values: [True ]/False Fetch and save favicon for the URL from Google's public favicon service: https://www.google.com/s2/favicons?domain={domain} . Set this to FALSE if you don't need favicons. Related options: TEMPLATES_DIR , CHECK_SSL_VALIDITY , CURL_BINARY SAVE_WGET Possible Values: [True ]/False Fetch page with wget, and save responses into folders for each domain, e.g. example.com/index.html , with .html appended if not present. For a full list of options used during the wget download process, see the archivebox/archive_methods.py:save_wget(...) function. Related options: TIMEOUT , SAVE_WGET_REQUISITES , CHECK_SSL_VALIDITY , COOKIES_FILE , WGET_USER_AGENT , SAVE_WARC , WGET_BINARY SAVE_WARC Possible Values: [True ]/False Save a timestamped WARC archive of all the page requests and responses during the wget archive process. Related options: TIMEOUT , SAVE_WGET_REQUISITES , CHECK_SSL_VALIDITY , COOKIES_FILE , WGET_USER_AGENT , SAVE_WGET , WGET_BINARY SAVE_PDF Possible Values: [True ]/False Print page as PDF. Related options: TIMEOUT , CHECK_SSL_VALIDITY , CHROME_USER_DATA_DIR , CHROME_BINARY SAVE_SCREENSHOT Possible Values: [True ]/False Fetch a screenshot of the page. Related options: RESOLUTION , TIMEOUT , CHECK_SSL_VALIDITY , CHROME_USER_DATA_DIR , CHROME_BINARY SAVE_DOM Possible Values: [True ]/False Fetch a DOM dump of the page. Related options: TIMEOUT , CHECK_SSL_VALIDITY , CHROME_USER_DATA_DIR , CHROME_BINARY SAVE_SINGLEFILE Possible Values: [True ]/False Fetch an HTML file with all assets embedded using (https://github.com/gildas-lormeau/SingleFile) Single File . Related options: TIMEOUT , CHECK_SSL_VALIDITY , CHROME_USER_DATA_DIR , CHROME_BINARY , SINGLEFILE_BINARY SAVE_READABILITY Possible Values: [True ]/False Extract article text, summary, and byline using Mozilla's (https://github.com/mozilla/readability) Readability library. Unlike the other methods, this does not download any additional files, so it's practically free from a disk usage perspective. It works by using any existing downloaded HTML version (e.g. wget, DOM dump, singlefile) and piping it into readability. Related options: TIMEOUT , SAVE_WGET , SAVE_DOM , SAVE_SINGLEFILE , SAVE_MERCURY SAVE_MERCURY Possible Values: [True ]/False Extract article text, summary, and byline using the (https://github.com/postlight/mercury-parser) Mercury library. Unlike the other methods, this does not download any additional files, so it's practically free from a disk usage perspective. It works by using any existing downloaded HTML version (e.g. wget, DOM dump, singlefile) and piping it into Mercury. Related options: TIMEOUT , SAVE_WGET , SAVE_DOM , SAVE_SINGLEFILE , SAVE_READABILITY SAVE_GIT Possible Values: [True ]/False Fetch any git repositories on the page. Related options: TIMEOUT , GIT_DOMAINS , CHECK_SSL_VALIDITY , GIT_BINARY SAVE_MEDIA Possible Values: [True ]/False Fetch all audio, video, annotations, and media metadata on the page using youtube-dl . Warning, this can use up a lot of storage very quickly. Related options: MEDIA_TIMEOUT , CHECK_SSL_VALIDITY , YOUTUBEDL_BINARY SAVE_ARCHIVE_DOT_ORG Possible Values: [True ]/False Submit the page's URL to be archived on Archive.org. (The Internet Archive) Related options: TIMEOUT , CHECK_SSL_VALIDITY , CURL_BINARY Archive Method Options Specific options for individual archive methods above. Some of these are shared between multiple archive methods, others are specific to a single method. CHECK_SSL_VALIDITY Possible Values: [True ]/False Whether to enforce HTTPS certificate and HSTS chain of trust when archiving sites. Set this to False if you want to archive pages even if they have expired or invalid certificates. Be aware that when False you cannot guarantee that you have not been man-in-the-middle'd while archiving content, so the content cannot be verified to be what's on the original site. SAVE_WGET_REQUISITES Possible Values: [True ]/False Fetch images/css/js with wget. (True is highly recommended, otherwise your won't download many critical assets to render the page, like images, js, css, etc.) Related options: TIMEOUT , SAVE_WGET , SAVE_WARC , WGET_BINARY RESOLUTION Possible Values: [1440,2000 ]/1024,768 /...Screenshot resolution in pixels width,height. Related options: SAVE_SCREENSHOT CURL_USER_AGENT Possible Values: [Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/) curl/{CURL_VERSION} ]/"Mozilla/5.0 ..." /...This is the user agent to use during curl archiving. You can set this to impersonate a more common browser like Chrome or Firefox if you're getting blocked by servers for having an unknown/blacklisted user agent. Related options: USE_CURL , SAVE_TITLE , CHECK_SSL_VALIDITY , CURL_BINARY , WGET_USER_AGENT , CHROME_USER_AGENT WGET_USER_AGENT Possible Values: [Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/) wget/{WGET_VERSION} ]/"Mozilla/5.0 ..." /...This is the user agent to use during wget archiving. You can set this to impersonate a more common browser like Chrome or Firefox if you're getting blocked by servers for having an unknown/blacklisted user agent. Related options: SAVE_WGET , SAVE_WARC , CHECK_SSL_VALIDITY , WGET_BINARY , CHROME_USER_AGENT CHROME_USER_AGENT Possible Values: [Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/) ]/"Mozilla/5.0 ..." /... This is the user agent to use during Chrome headless archiving. If you're experiencing being blocked by many sites, you can set this to hide the Headless string that reveals to servers that you're using a headless browser. Related options: SAVE_PDF , SAVE_SCREENSHOT , SAVE_DOM , CHECK_SSL_VALIDITY , CHROME_USER_DATA_DIR , CHROME_HEADLESS , CHROME_BINARY , WGET_USER_AGENT GIT_DOMAINS Possible Values: [github.com,bitbucket.org,gitlab.com,gist.github.com,codeberg.org,gitea.com,git.sr.ht ]/git.example.com /...Domains to attempt download of git repositories on using git clone . Related options: SAVE_GIT , CHECK_SSL_VALIDITY COOKIES_FILE Possible Values: [None ]//path/to/cookies.txt /... Cookies file to pass to wget , curl , yt-dlp and other extractors that don't use Chrome (with its CHROME_USER_DATA_DIR ) for authentication. To capture sites that require a user to be logged in, you configure this option to point to a (http://www.cookiecentral.com/faq/#3.5) netscape-format cookies.txt file containing all the cookies you want to use during archiving. You can generate this cookies.txt file by using a number of different (https://chromewebstore.google.com/detail/get-cookiestxt-locally/cclelndahbckbenkjhflpdbgdldlbecc) browser extensions that can export your cookies in this format, or by using wget on the command line with --save-cookies + --user=... --password=... . Warning Make sure you use separate burner credentials dedicated to archiving, e.g. don't re-use your normal daily Facebook/Instagram/Youtube/etc. account cookies as server responses often contain your name/email/PII, session tokens, etc. which then get preserved in your snapshots! Future viewers of your archive may be able to use any reflected (https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#%EF%B8%8F-things-to-watch-out-for-%EF%B8%8F) archived session tokens to log in as you, or at the very least, associate the content with your real identity. Even if this tradeoff seems acceptable now or you plan to keep your archive data private, you may want to share a snapshot with others in the future, and snapshots are very hard to sanitize/anonymize after-the-fact! Related options: SAVE_WGET , SAVE_WARC , CHECK_SSL_VALIDITY , WGET_BINARY CHROME_USER_DATA_DIR Possible Values: [~/.config/google-chrome ]//tmp/chrome-profile /... Path to a (https://chromium.googlesource.com/chromium/src/+/HEAD/docs/user_data_dir.md) Chrome user profile directory . To capture sites that require a user to be logged in, you can specify a path to a Chrome user profile (which loads the cookies needed for the user to be logged in). If you don't have an existing Chrome profile, create one with chromium-browser --user-data-dir=/tmp/chrome-profile , and log into the sites you need. Then set CHROME_USER_DATA_DIR=/tmp/chrome-profile to make ArchiveBox use that profile. For a guide on how to set this up, see our (https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile) Chromium Install: Setting up a profile wiki. Note: Make sure the path does not have Default at the end (it should the the parent folder of Default ), e.g. set it to CHROME_USER_DATA_DIR=~/.config/chromium and not CHROME_USER_DATA_DIR=~/.config/chromium/Default . Warning Make sure you use separate burner credentials dedicated to archiving, e.g. don't log in with your normal daily Facebook/Instagram/Youtube/etc. accounts as server responses and page content will often contain your name/email/PII, session cookies, private tokens, etc. which then get preserved in your snapshots! Future viewers of your archive may be able to use any reflected (https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#%EF%B8%8F-things-to-watch-out-for-%EF%B8%8F) archived session tokens to log in as you, or at the very least, associate the content with your real identity. Even if this tradeoff seems acceptable now or you plan to keep your archive data private, you may want to share a snapshot with others in the future, and snapshots are very hard to sanitize/anonymize after-the-fact! When set to None , ArchiveBox