TESSERA: A Blueprint for Earth Observation Foundation Models
A few months ago, TESSERA was released, an open source foundation model for Earth Observation. A 128-bit embedding for every 10m on earth. And yet, it barely made a ripple. Around the same time, AlphaEarth, backed by Google DeepMind, dominated the conversation and captured the GeoAI spotlight. In the noise, Tessera faded to the margins.
This is unfortunate.
Not because Tessera “solves” Earth intelligence. But because it may be one of the most instructive works in the Earth Observation foundation model space to date. It doesn’t just present a model, it presents a blueprint. A set of design choices that others can study, critique, and build upon.
In this article, we unpack what Tessera actually does, analyze its embeddings and lessons to build the next generation EO foundation models.
TL;DR
Actually open source. Not gated APIs. From data pipelines to model weights, everything is out. Remarkably easy to access and run quick experiments.
The training data is geographically diverse, which is refreshing. The downstream evaluations, though, don’t quite live up to that ambition
A blueprint, not a press release. The paper discusses scaling laws, efficiency vs. performance trade-offs, regional foundation models, impact of finetuning, etc.
Training larger models with more data may not be the solution. Sampling and regularisation could be. Quality >> Quantity.
Pixel-level embeddings were the right bet.
But let’s be honest
The sampling strategy is weak. There’s obvious redundancy. Smarter spatial–temporal curation could have saved serious compute.
The sensor scope feels dated. Sentinel-1 and Sentinel-2 are great, but the ecosystem has moved on. The world has more than two satellites.
And the community needs better benchmarks. We have to move beyond geographically narrow crop datasets and start evaluating models under real-world complexity.
Architecture Breakdown
Keeping it simple often works. We don’t always have to compute attention between ‘n’ different representations. Tessera, uses branched encoders, one for each modality, the latent representations/embeddings from them are passed through an MLP layer to generate a 128- dimensional fused embedding which are quantised to 8 bits. During training, there is a projector network that expands the embeddings to 16,384 (Whoa!).
There are two interesting concepts that are used
‘D-pixel’ – From data cubes of shape [ T x C x H x W ] we extract [ T x C x 1 ]. The tensor now has temporal and spectral information of a location/point on earth (10m pixel).
Global Pixel shuffling - D-pixels from thousands of MGRS tiles are first aggregated into a global pool. The dataset of d-pixels are organized into chunks and streamed by data workers to form well-shuffled, globally diverse training batches as shown in Figure 2.
Pre-training Strategy
Again keeping it simple works. Tessera uses concepts from Barlow twins to train the model. By using Barlow Twins, the model learns to make embeddings invariant to different cloud-free subsets of the same location, essentially learning 'what the ground looks like' regardless of when the satellite passed over.
TESSERA is pre-trained on ∼800 million d-pixels sampled from 3,012 global MGRS tiles (2017–2024), using 16 AMD MI300X GPUs. I am not sure if it was a conscious choice to use MI300X instead of H100s but the massive 192 GB makes it a good choice for training EO/Spatial FMs. The memory on those chips is likely why they could handle such massive global shuffling buffers without hitting a bottleneck.
Another question that got me thinking is do we really need to train the model with 800 million pixels? Wouldn’t there be a lot of redundant data? Can we have a more intelligent sampling strategy? Or did redundancy help here since it was trained for just 1 epoch?
Downstream applications and validation
The tables in the paper clearly show that tessera outperforms other foundation models including AlphaEarth. However, this is validated over very small datasets that do not reflect any of the challenges faced in realworld scenarios. I strongly believe that global/planetary scale models should be validated on larger more diverse datasets that reflect some of the challenges and we should demonstrate how training a model for 4600 GPU hours with 800 million pixels helped to solve them.
Embedding Analysis
Looks pretty good. Even just visualising just a couple of bands as RGB shows clear distinguishable characteristics for different land covers. Figure 3 shows some of the visualisations on embeddings generated for an AOI in Punjab, India. Prominent features like waterbodies, cluster of buildings, road network, and agri land are clearly distinguishable. It would be really interesting to check if the clusters observed on agri land corresponds to different crop types.

Lessons to take forward
The discussions and ablation study section of the paper is a gem!
Should we fine-tune the encoder? Looks like we don’t have to
Should we train a country/region specific FMs? Again, looks like we don’t have to but focus on certain use cases and domains might be useful.
Should we just keep adding more parameters? Performance in downstream tasks does improve but doesn’t scale linearly. Do we really need to spend 1000 extra GPU hours for 2% improvement? Maybe not.
Definitely need a more robust and truly “global” benchmarking dataset to evaluate global models.
Ease of access and use. The supporting library ‘geotessera’ and interactive notebook makes things so easy for anyone to get a flavour of embedding. This should be a standard for all future model releases.





