37 KiB
37 KiB
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[Unreleased]
Added
Changed
Deprecated
Removed
Fixed
Security
[4.9.9] - 2025-05-28
Added
Changed
apache-beamversion is pinned at<2.65.0until related tests are fixed, see issue 11055.
Deprecated
Removed
Fixed
- CroissantBuilder now supports Croissant files without patch version (i.e. only {major.minor} are provided).
- Various small bug fixes.
Security
[4.9.8] - 2025-03-13
Added
- New Beam writer
NoShuffleBeamWriterthat doesn't shuffle, which speeds up dataset generation significantly, but does not have deterministic order guarantees. Can be enabled with the flag--nondeterministic_order. - CroissantBuilder now supports Croissant files that define splits; and new feature types: feature dictionaries and multidimensional arrays.
- New datasets.
Changed
Deprecated
Removed
Fixed
- Various small bug fixes.
- Various performance improvements.
Security
[4.9.7] - 2024-10-30
Added
- New datasets.
Changed
CroissantBuilder's API to generate TFDS datasets from Croissant files.
Deprecated
Removed
Fixed
- Versions for existing datasets.
Security
[4.9.6] - 2024-06-04
Added
- Full support for Python 3.12.
[4.9.5] - 2024-05-30
Added
-
Support to download and prepare datasets using the Parquet data format.
builder = tfds.builder('fashion_mnist', file_format='parquet') builder.download_and_prepare() ds = builder.as_dataset(split='train') print(next(iter(ds))) -
tfds.data_sourceis pickable, thus working smoothly with PyGrain. Learn more by following the tutorial. -
TFDS plays nicely with Croissant. Learn more by following the recipe.
Changed
Deprecated
Removed
Fixed
Security
[4.9.4] - 2023-12-16
Added
- A new CroissantBuilder which initializes a DatasetBuilder based on a Croissant metadata file.
- New conversion options between different bounding boxes formats.
- Better support for
HuggingfaceDatasetBuilder. - A script to convert a dataset from one format to another.
Changed
Deprecated
- Python 3.9 support. TFDS now uses Python 3.10
Removed
Fixed
Security
[4.9.3] - 2023-09-08
Added
- Segment Anything (SA-1B) dataset.
Changed
- Hugging Face datasets accept
Nonevalues for any features. TFDS has notfds.features.Optional, soNonevalues are converted to default values. Those default values used to be0and0.0for int and float. Now, it's-infas defined by NumPy (e.g.,np.iinfo(np.int32).minornp.finfo(np.float32).min). This avoids ambiguous values when0and0.0exist in the values of the dataset. The roadmap is to implementtfds.features.Optional.
Deprecated
- Python 3.8 support. As per NEP 29, TFDS now uses Python>=3.9.
Removed
Fixed
Security
[4.9.2] - 2023-04-13
Added
- [Experimental] A list of freeform text tags can now be attached to a
BuilderConfig. For example:
The tags are recorded with the dataset metadata and can later be retrieved using the info object:BUILDER_CONFIGS = [ tfds.core.BuilderConfig(name="foo", tags=["foo", "live"]), tfds.core.BuilderConfig(name="bar", tags=["bar", "old"]), ]
This feature is experimental and there are no guidelines on tags format.builder.info.config_tags # ["foo", "live"]
Changed
Deprecated
Removed
Fixed
- Fixed generated proto files (see issue 4858).
Security
[4.9.1] - 2023-04-11
Added
Changed
Deprecated
Removed
Fixed
- The installation on macOS now works (see issues 4805 and 4852). The ArrayRecord dependency is lazily loaded, so the TensorFlow-less path is not possible at the moment on macOS. A fix for this will follow soon.
Security
[4.9.0] - 2023-04-04
Added
- Native support for JAX and PyTorch. TensorFlow is no longer a dependency for reading datasets. See the documentation.
- Added minival split to LVIS dataset.
- Mixed-human and machine-generated robomimic datasets.
- WebVid dataset.
- ImagenetPI dataset.
- Wikipedia for 20230201.
Changed
- Support for
tensorflow=2.12.
Deprecated
Removed
Fixed
Security
[4.8.3] - 2023-02-27
Added
Changed
Deprecated
- Python 3.7 support: this version and future version use Python 3.8.
Removed
Fixed
- Flag
ignore_verificationsfrom Hugging Face'sdatasets.load_datasetis deprecated, and used to cause errors intfds.load(huggingface:foo).
Security
[4.8.2] - 2023-01-17
Deprecated
- Python 3.7 support: this is the last version of TFDS supporting Python 3.7. Future versions will use Python 3.8.
Fixed
tfds newandtfds buildbetter support the new recommended datasets organization, where individual datasets have their own package underdatasets/, builder class is calledBuilderand is defined within module${dsname}_dataset_builder.py.
Security
[4.8.1] - 2023-01-02
Changed
- Added file
valid_tags.txtto not break builds. - TFDS no longer relies on TensorFlow DTypes. We chose NumPy DTypes to keep the
typing expressiveness, while dropping the heavy dependency on TensorFlow. We
migrated all our internal datasets. Please, migrate accordingly:
tf.bool:np.bool_tf.string:np.str_tf.int64,tf.int32, etc:np.int64,np.int32, etctf.float64,tf.float32, etc:np.float64,np.float32, etc
[4.8.0] - 2022-12-21
Added
- [API]
DatasetBuilder's description and citations can be specified in dedicatedREADME.mdandCITATIONS.bibfiles, within the dataset package (see https://www.tensorflow.org/datasets/add_dataset). - Tags can be associated to Datasets, in the
TAGS.txtfile. For now, they are only used in the generated documentation. - [API][Experimental] New
ViewBuilderto define datasets as transformations of existing datasets. Also addstfds.transformwith functionality to apply transformations. - Loggers are also called on
tfds.as_numpy(...), baseLoggerclass has a new corresponding method. tfds.core.DatasetBuildercan have a default limit for the number of simultaneous downloads.tfds.download.DownloadConfigcan override it.tfds.features.Audiosupports storing raw audio data for lazy decoding.- The number of shards can be overridden when preparing a dataset:
builder.download_and_prepare(download_config=tfds.download.DownloadConfig(num_shards=42)). Alternatively, you can configure the min and max shard size if you want TFDS to compute the number of shards for you, but want to have control over the shard sizes.
Changed
Deprecated
Removed
Fixed
Security
[4.7.0] - 2022-10-04
Added
- [API] Added TfDataBuilder that is handy for storing experimental ad hoc TFDS datasets in notebook-like environments such that they can be versioned, described, and easily shared with teammates.
- [API] Added options to create format-specific dataset builders. The new API now includes a number of NLP-specific builders, such as:
- [API] Added
tfds.beam.inc_counterto reducebeam.metrics.Metrics.counterboilerplate - [API] Added options to group together existing TFDS datasets into dataset collections and to perform simple operations over them.
- [Documentation] update, specifically:
- [TFDS CLI] Supports custom config through Json (e.g.
tfds build my_dataset --config='{"name": "my_custom_config", "description": "Abc"}') - New datasets:
- conll2003
- universal_dependency 2.10
- bucc
- i_naturalist2021
- mtnt Machine Translation of Noisy Text.
- placesfull
- tatoeba
- user_libri_audio
- user_libri_text
- xtreme_pos
- yahoo_ltrc
- Updated datasets:
- C4 was updated to version 3.1.
- common_voice was updated to a more recent snapshot.
- wikipedia was
updated with the
20220620snapshot.
- New dataset collections, such as xtreme and LongT5
Changed
- The base
Loggerclass expects more information to be passed to theas_datasetmethod. This should only be relevant to people who have implemented and registered customLoggerclass(es). - You can set
DEFAULT_BUILDER_CONFIG_NAMEin aDatasetBuilderto change the default config if it shouldn't be the first builder config defined inBUILDER_CONFIGS.
Deprecated
Removed
Fixed
- Various datasets
- In Linux, when loading a dataset from a directory that is not your home
(
~) directory, a new~directory is not created in the current directory (fixes #4117).
Security
[4.6.0] - 2022-06-01
Added
- Support for community datasets on GCS.
- [API]
tfds.builder_from_directoryandtfds.builder_from_directories, see https://www.tensorflow.org/datasets/external_tfrecord#directly_from_folder. - [API] Dash ("-") support in split names.
- [API]
file_formatargument todownload_and_preparemethod, allowing user to specify an alternative file format to store prepared data (e.g. "riegeli"). - [API]
file_formattoDatasetInfostring representation. - [API] Expose the return value of Beam pipelines. This allows for users to read the Beam metrics.
- [API] Expose Feature
tf_example_specto public. - [API]
dockwarg onFeatures, to describe a feature. - [Documentation] Features description is shown on TFDS Catalog.
- [Documentation] More metadata about HuggingFace datasets in TFDS catalog.
- [Performance] Parallel load of metadata files.
- [Testing] TFDS tests are now run using GitHub actions - misc improvements such as caching and sharding.
- [Testing] Improvements to MockFs.
- New datasets.
Changed
- [API]
num_shardsis now optional in the shard name.
Removed
- TFDS pathlib API, migrated to a self-contained
etils.epath(see https://github.com/google/etils).
Fixed
- Various datasets.
- Dataset builders that are defined adhoc (e.g. in Colab).
- Better
DatasetNotFoundErrormessages. - Don't set
deterministicon a global level but locally in interleave, so it only apply to interleave and not all transformations. - Google drive downloader.
[4.5.2] - 2022-01-31
Added
- [API]
split=tfds.split_for_jax_process('train')(alias oftfds.even_splits('train', n=jax.process_count())[jax.process_index()]). - [Documentation] update.
Fixed
- Import bug on Windows (#3709).
[4.5.0] - 2022-01-25
Added
- [API] Better split API:
- Splits can be selected using shards:
split='train[3shard]'. - Underscore supported in numbers for better readability:
split='train[:500_000]'. - Select the union of all splits with
split='all'. tfds.even_splitsis more precise and flexible:- Return splits exactly of the same size when passed
tfds.even_splits('train', n=3, drop_remainder=True). - Works on subsplits
tfds.even_splits('train[:75%]', n=3)or even nested. - Can be composed with other splits:
tfds.even_splits('train', n=3)[0] + 'test'.
- Splits can be selected using shards:
- [API]
serialize_example/deserialize_examplemethods on features to encode/decode example to proto:example_bytes = features.serialize_example(example_data). - [API]
Audiofeature now supportsencoding='zlib'for better compression. - [API] Features specs are exposed in proto for better compatibility with other languages.
- [API] Create beam pipeline using TFDS as input with tfds.beam.ReadFromTFDS.
- [API] Support setting the file formats in
tfds build --file_format=tfrecord. - [API] Typing annotations exposed in
tfds.typing. - [API]
tfds.ReadConfighas a newassert_cardinality=Falseargument to disable cardinality. - [API]
tfds.display_progress_bar(True)for functional control. - [API] DatasetInfo exposes
.release_notes. - Support for huge number of shards (>99999).
- [Performance] Faster dataset generation (using tfrecords).
- [Testing] Mock dataset now supports nested datasets
- [Testing] Customize the number of sub examples
- [Documentation] Community datasets: https://www.tensorflow.org/datasets/community_catalog/overview.
- [Documentation] Guide on TFDS and determinism.
- [RLDS] Support for nested datasets features.
- [RLDS] New datasets: Robomimic, D4RL Ant Maze, RLU Real World RL, and RLU Atari with ordered episodes.
- New datasets.
Deprecated
- Python 3.6 support: this is the last version of TFDS supporting Python 3.6. Future versions will use Python 3.7.
Fixed
- Misc bugs.
[4.4.0] - 2021-07-28
Added
- [API]
PartialDecodingsupport, to decode only a subset of the features (for performances). - [API]
tfds.features.LabeledImagefor semantic segmentation (like image but with additionalinfo.features['image_label'].namelabel metadata). - [API] float32 support for
tfds.features.Image(e.g. for depth map). - [API] Loading datasets from files now supports custom
tfds.features.FeatureConnector. - [API] All FeatureConnector can now have a
Nonedimension anywhere (previously restricted to the first position). - [API]
tfds.features.Tensor()can have arbitrary number of dynamic dimension (Tensor(..., shape=(None, None, 3, None))). - [API]
tfds.features.Tensorcan now be serialised as bytes, instead of float/int values (to allow better compression):Tensor(..., encoding='zlib'). - [API] Support for datasets with
Noneintfds.as_numpy. - Script to add TFDS metadata files to existing TF-record (see doc).
- [TESTING]
tfds.testing.mock_datanow supports:- non-scalar tensors with dtype
tf.string; builder_from_filesand path-based community datasets.
- non-scalar tensors with dtype
- [Documentation] Catalog now exposes links to KnowYourData visualisations.
- [Documentation] Guide on common implementation gotchas.
- Many new reinforcement learning datasets. ### Changed
- [API] Dataset generated with
disable_shuffling=Trueare now read in generation order.
Fixed
- File format automatically restored (for datasets generated with
tfds.builder(..., file_format=)). - Dynamically set number of worker threads during extraction.
- Update progress bar during download even if downloads are cached.
- Misc bug fixes.
[4.3.0] - 2021-05-06
Added
- [API]
dataset.info.splits['train'].num_shardsto expose the number of shards to the user. - [API]
tfds.features.Datasetto have a field containing sub-datasets (e.g. used in RL datasets). - [API] dtype and
tf.uint16support intfds.features.Video. - [API]
DatasetInfo.licensefield to add redistributing information. - [API]
.copy,.formatmethods to GPath objects. - [Performances]
tfds.benchmark(ds)(compatible with any iterator, not justtf.data, better colab representation). - [Performances] Faster
tfds.as_numpy()(avoid extratf.Tensor<>np.arraycopy). - [Testing] Support for custom
BuilderConfiginDatasetBuilderTest. - [Testing]
DatasetBuilderTestnow has adummy_dataclass property which can be used insetUpClass. - [Testing]
add_tfds_idand cardinality support totfds.testing.mock_data. - [Documentation] Better
tfds.as_dataframevisualisation (Sequence, ragged tensor, semantic masks withuse_colormap). - [Experimental] Community datasets support. To allow dynamically import datasets defined outside the TFDS repository.
- [Experimental] Hugging-face compatibility wrapper to use Hugging-face datasets directly in TFDS.
- [Experimental] Riegeli format support.
- [Experimental]
DatasetInfo.disable_shufflingto force examples to be read in generation order. - New datasets.
Fixed
- Many bugs.
[4.2.0] - 2021-01-06
Added
- [CLI]
tfds buildto the CLI. See documentation. - [API]
tfds.features.Datasetto represent nested datasets. - [API]
tfds.ReadConfig(add_tfds_id=True)to add a unique id to the exampleex['tfds_id'](e.g.b'train.tfrecord-00012-of-01024__123'). - [API]
num_parallel_callsoption totfds.ReadConfigto overwrite to defaultAUTOTUNEoption. - [API]
tfds.ImageFoldersupport fortfds.decode.SkipDecoder. - [API] Multichannel audio support to
tfds.features.Audio. - [API]
try_gcstotfds.builder(..., try_gcs=True) - Better
tfds.as_dataframevisualization (ffmpeg video if installed, bounding boxes,...). - [TESTING] Allow
max_examples_per_splits=0intfds build --max_examples_per_splits=0to test_split_generatorsonly (without_generate_examples). - New datasets.
Changed
- [API] DownloadManager now returns Pathlib-like objects.
- [API] Simpler
BuilderConfigdefinition: classVERSIONandRELEASE_NOTESare applied to allBuilderConfig. Config description is now optional. - [API] To guarantee better deterministic, new validations are performed on
the keys when creating a dataset (to avoid filenames as keys
(non-deterministic) and restrict key to
str,bytesandint). New errors likely indicates an issue in the dataset implementation. - [API]
tfds.core.benchmarknow returns apd.DataFrame(instead of adict). - [API]
tfds.unitsis not visible anymore from the public API. - Datasets updates.
Deprecated
Removed
- Configs for all text datasets. Only plain text version is kept. For example:
multi_nli/plain_text->multi_nli.
Fixed
- [API] Datasets returned by
tfds.as_numpyare compatible withlen(ds). - Support 0-len sequence with images of dynamic shape (Fix #2616).
- Progression bar correctly updated when copying files.
- Better debugging and error message (e.g. human readable size,...).
- Many bug fixes (GPath consistency with pathlib, s3 compatibility, TQDM visual artifacts, GCS crash on windows, re-download when checksums updated, ...).
[4.1.0] - 2020-11-04
Added
- It is now easier to create datasets outside TFDS repository (see our updated dataset creation guide).
- When generating a dataset, if download fails for any reason, it is now possible to manually download the data. See doc.
tfds.core.as_pathto create pathlib.Path-like objects compatible with GCS (e.g.tfds.core.as_path('gs://my-bucket/labels.csv').read_text()).verify_ssl=option totfds.download.DownloadConfigto disable SSH certificate during download.- New datasets. ### Changed
- All dataset inherit from
tfds.core.GeneratorBasedBuilder. Converting a dataset to beam now only require changing_generate_examples(see example and doc). _split_generatorsshould now returns{'split_name': self._generate_examples(), ...}(but current datasets are backward compatible).- Better
pathlib.Path,os.PathLikecompatibility:dl_manager.manual_dirnow returns a pathlib-Like object. Example:python text = (dl_manager.manual_dir / 'downloaded-text.txt').read_text()Note: Otherdl_manager.download,.extract,... will return pathlib-like objects in future versions.FeatureConnector,... and most functions should acceptPathLikeobjects. Let us know if some functions you need are missing. --record_checksumsnow assume the new dataset-as-folder model.
Deprecated
tfds.core.SplitGenerator,tfds.core.BeamBasedBuilderare deprecated and will be removed in a future version.
Fixed
BuilderConfigare now compatible with Beam datasets #2348tfds.features.Imagescan accept encodedbytesimages directly (useful when used withimg_name, img_bytes = dl_manager.iter_archive('images.zip')).- Doc API now show deprecated methods, abstract methods to overwrite are now documented.
- You can generate
imagenet2012with only a single split (e.g. only the validation data). Other split will be skipped if not present.
[4.0.1] - 2020-10-09
Fixed
tfds.loadwhen generation code isn't present.- GCS compatibility.
[4.0.0] - 2020-10-06
Added
- Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository.
tfds.loadcan now load dataset without using the generation class. Sotfds.load('my_dataset:1.0.0')can work even ifMyDataset.VERSION == '2.0.0'(See #2493).- TFDS CLI (see https://www.tensorflow.org/datasets/cli for detail).
tfds.testing.mock_datadoes not require metadata files anymore!tfds.as_dataframe(ds, ds_info)with custom visualisation (example).tfds.even_splitsto generate subsplits (e.g.tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...].DatasetBuilder.RELEASE_NOTESproperty.tfds.features.Imagenow supports PNG with 4-channels.tfds.ImageFoldernow supports custom shape, dtype.- Downloaded URLs are available through
MyDataset.url_infos. skip_prefetchoption totfds.ReadConfig.as_supervised=Truesupport fortfds.show_examples,tfds.as_dataframe.- tfds.features can now be saved/loaded, you may have to overwrite
FeatureConnector.from_json_content
and
FeatureConnector.to_json_contentto support this feature. - Script to detect dead-urls.
- New datasets.
Changed
tfds.as_numpy()now returns an iterable which can be iterated multiple times. To migrate:next(ds)->next(iter(ds)).- Rename
tfds.features.text.Xyz->tfds.deprecated.text.Xyz.
Removed
DatasetBuilder.IN_DEVELOPMENTproperty.tfds.core.disallow_positional_args(should use Py3*,instead).- Testing against TF 1.15. Requires Python 3.6.8+.
Fixed
- Better archive extension detection for
dl_manager.download_and_extract. - Fix
tfds.__version__in TFDS nightly to be PEP440 compliant - Fix crash when GCS not available.
- Improved open-source workflow, contributor guide, documentation.
- Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,...
- Datasets updates.
[3.2.1] - 2020-08-12
Fixed
- Issue with GCS on Windows.
[3.2.0] - 2020-07-10
Added
- [API]
tfds.ImageFolderandtfds.TranslateFolderto easily create custom datasets with your custom data. - [API]
tfds.ReadConfig(input_context=)to shard dataset, for better multi-worker compatibility (#1426). - [API] The default
data_dircan be controlled by theTFDS_DATA_DIRenvironment variable. - [API] Better usability when developing datasets outside TFDS: downloads are always cached, checksums are optional.
- Scripts to help deployment/documentation (Generate catalog documentation, export all metadata files, ...).
- [Documentation] Catalog display images (example).
- [Documentation] Catalog shows which dataset have been recently added and are
only available in
tfds-nightlynights_stay. - [API]
tfds.show_statistics(ds_info)to display FACETS OVERVIEW. Note: This require the dataset to have been generated with the statistics.
Deprecated
tfds.features.textencoding API. Please use tensorflow_text instead.
Removed
tfds.load('image_label_folder')in favor of the more user-friendlytfds.ImageFolder.
Fixed
- Fix deterministic example order on Windows when path was used as key (this only impacts a few datasets). Now example order should be the same on all platforms.
- Misc performances improvements for both generation and reading (e.g. use
__slot__, fix parallelisation bug intf.data.TFRecordReader, ...). - Misc fixes (typo, types annotations, better error messages, fixing dead links, better windows compatibility, ...).
[3.1.0] - 2020-04-29
Added
- [API]
tfds.builder_cls(name)to access a DatasetBuilder class by name - [API]
info.split['train'].filenamesfor access to the tf-record files. - [API]
tfds.core.add_data_dirto register an additional data dir. - [Testing] Support for custom decoders in
tfds.testing.mock_data. - [Documentation] Shows which datasets are only present in
tfds-nightly. - [Documentation] Display images for supported datasets.
Changed
- Rename
tfds.core.NamedSplit,tfds.core.SplitBase->tfds.Split. Nowtfds.Split.TRAIN,... are instance oftfds.Split. - Rename
interleave_parallel_reads->interleave_cycle_lengthfortfds.ReadConfig. - Invert ds, ds_info argument orders for
tfds.show_examples.
Deprecated
tfds.features.textencoding API. Please usetensorflow_textinstead.
Removed
num_shardsargument fromtfds.core.SplitGenerator. This argument was ignored as shards are automatically computed.- Most
ds.with_optionswhich where applied by TFDS. Now usetf.datadefault.
Fixed
- Better error messages.
- Windows compatibility.
[3.0.0] - 2020-04-16
Added
DownloadManageris now pickable (can be used inside Beam pipelines).tfds.features.Audio:- Support float as returned value.
- Expose sample_rate through
info.features['audio'].sample_rate. - Support for encoding audio features from file objects.
- More datasets.
Changed
- New
image_classificationsection. Some datasets have been move there fromimages. DownloadConfigdoes not append the dataset name anymore (manual data should be in<manual_dir>/instead of<manual_dir>/<dataset_name>/).- Tests now check that all
dl_manager.downloadurls has registered checksums. To opt-out, addSKIP_CHECKSUMS = Trueto yourDatasetBuilderTestCase. tfds.loadnow always returnstf.compat.v2.Dataset. If you're using still usingtf.compat.v1:- Use
tf.compat.v1.data.make_one_shot_iterator(ds)rather thands.make_one_shot_iterator(). - Use
isinstance(ds, tf.compat.v2.Dataset)instead ofisinstance(ds, tf.data.Dataset).
- Use
Deprecated
- The
tfds.features.textencoding API is deprecated. Please use tensorflow_text instead. num_shardsargument oftfds.core.SplitGeneratoris currently ignored and will be removed in the next version.
Removed
- Legacy mode
tfds.experiment.S3has been removed in_memoryargument has been removed fromas_dataset/tfds.load(small datasets are now auto-cached).tfds.Split.ALL.
Fixed
- Various bugs, better error messages, documentation improvements.
[2.1.0] - 2020-02-25
Added
- Datasets expose
info.dataset_sizeandinfo.download_size. - Auto-caching small datasets.
- Datasets expose their cardinality
num_examples = tf.data.experimental.cardinality(ds)(Requires tf-nightly or TF >= 2.2.0) - Get the number of example in a sub-splits with:
info.splits['train[70%:]'].num_examples
Changes
- All datasets generated with 2.1.0 cannot be loaded with previous version
(previous datasets can be read with
2.1.0however).
Deprecated
in_memoryargument is deprecated and will be removed in a future version.
[2.0.0] - 2020-01-24
Added
- Several new datasets. Thanks to all the contributors!
- Support for nested
tfds.features.Sequenceandtf.RaggedTensor - Custom
FeatureConnectors can override thedecode_batch_examplemethod for efficient decoding when wrapped inside atfds.features.Sequence(my_connector). - Beam datasets can use a
tfds.core.BeamMetadataDictto store additional metadata computed as part of the Beam pipeline. - Beam datasets'
_split_generatorsaccepts an additionalpipelinekwargs to define a pipeline shared between all splits.
Changed
- The default versions of all datasets are now using the S3 slicing API. See the guide for details.
shuffle_filesdefaults to False so that dataset iteration is deterministic by default. You can customize the reading pipeline, including shuffling and interleaving, through the newread_configparameter intfds.load.urlskwargs renamedhomepageinDatasetInfo
Deprecated
- Python2 support: this is the last version of TFDS that will support Python 2. Going forward, we'll only support and test against Python 3.
- The previous split API is still available, but is deprecated. If you wrote
DatasetBuilders outside the TFDS repository, please make sure they do not useexperiments={tfds.core.Experiment.S3: False}. This will be removed in the next version, as well as thenum_shardskwargs fromSplitGenerator.
Fixed
- Various other bug fixes and performance improvements. Thank you for all the reports and fixes!
[1.3.0] - 2019-10-21
Fixed
- Misc bugs and performance improvements.
[1.2.0] - 2019-08-19
Added
Features
- Add
shuffle_filesargument totfds.loadfunction. The semantic is the same as inbuilder.as_datasetfunction, which for now means that by default, files will be shuffled forTRAINsplit, and not for other splits. Default behaviour will change to always be False at next major release. - Most datasets now support the new S3 API (documentation).
- Support for uint16 PNG images.
Datasets
- AFLW2000-3D
- Amazon_US_Reviews
- binarized_mnist
- BinaryAlphaDigits
- Caltech Birds 2010
- Coil100
- DeepWeeds
- Food101
- MIT Scene Parse 150
- RockYou leaked password
- Stanford Dogs
- Stanford Online Products
- Visual Domain Decathlon
Fixed
- Crash while shuffling on Windows
- Various documentation improvements
[1.1.0] - 2019-07-23
Added
Features
in_memoryoption to cache small dataset in RAM.- Better sharding, shuffling and sub-split.
- It is now possible to add arbitrary metadata to
tfds.core.DatasetInfowhich will be stored/restored with the dataset. Seetfds.core.Metadata. - Better proxy support, possibility to add certificate.
decoderskwargs to override the default feature decoding (guide).
Datasets
- downsampled_imagenet.
- patch_camelyon.
- coco 2017 (with and without panoptic annotations).
- uc_merced.
- trivia_qa.
- super_glue.
- so2sat.
- snli.
- resisc45.
- pet_finder.
- mnist_corrupted.
- kitti.
- eurosat.
- definite_pronoun_resolution.
- curated_breast_imaging_ddsm.
- clevr.
- bigearthnet.
[1.0.2] - 2019-05-01
Added
- Apache Beam support.
- Direct GCS access for MNIST (with
tfds.load('mnist', try_gcs=True)). - More datasets.
- Option to turn off tqdm bar (
tfds.disable_progress_bar()).
Fixed
- Subsplit do not depends on the number of shard anymore (https://github.com/tensorflow/datasets/issues/292).
- Various bugs.
[1.0.1] - 2019-02-15
Added
- Dataset
celeb_a_hq.
Fixed
- Bug #52 that was putting the process in Eager mode by default.
[1.0.0] - 2019-02-14
Added
- 25 datasets.
- Ready to be used
tensorflow-datasets.