An fMRI Dataset for Concept Representation with semantic feature annotations

BIDS Validation

1 Warning Valid



This fMRI dataset for concept representation with semantic feature annotations contains:

  1. fMRI data collected on the 11 participants while they were listening to 672 Chinese concepts;

  2. high-resolution structural (T1) for each subject;

  3. rich linguistic annotations for the stimuli, including part-of-speech tags, category, semantic feature collected on the 30 participants while they were rating the each feature of each word, and various types of word and character embeddings.

More details about the dataset are described as follows.


For MRI data collection, all 11 participants were recruited from universities in Beijing, of which 3 were female, and 8 were male, with an age range 22-28 year. They completed fMRI visits, all participants were right-handed adults with Mandarin Chinese as native language who reported having normal hearing and no history of neurological disorders. For the semantic feature annotation, 126 participants are recruited (72 females, mean age 22.72 years ± 2.13 SD) and each participant can fill in as much surveys (mean = 12.86) as long as they pass the quality evaluation every time. Those who failed the quality evaluation once will not be allowed to do more surveys and the failed surveys are not included in the analysis. They were paid and gave written informed consent. The study was conducted under the approval of the Institute of Psychology from Chinese Academy of Sciences.

Experimental Procedures

Before each scanning, participants first completed a simple informational survey form and an informed consent form. During fMRI scanning, the participants are instructed to read attentively the presented words and think about its related concepts with the guidance of the accompanying images. Stimulus presentation was implemented using Psychtoolbox-3. Specifically, at the beginning of each run, the instruction “The experiment is about to start, please pay attention” appeared on the screen, followed by a fixation screen for 2 seconds. Then, each stimulus was presented for 3 s followed by a 2 s fixation period. The fMRI recording was split into 4 visits for sub01-sub05 and 6 visits for sub06-sub11. Within each scanning session, the 672 words were divided into four or six sets of 168 or 112 words (done separately for each participant) and distributed across 12 runs. Each participant saw 6 repetitions of a word with a different picture and it took 2 runs to see a single repetition of the whole 168 or 112 words (i.e., 84 or 56 words in a single run). Each run thus took 450 s for sub01-sub05 and 310 s for sub06-sub11.

For the semantic feature annotation, we use the questionnaire collection platform at and recruit participants of college students. The participants were given the 672 words and instructions such as: “To what degree do you think of this thing as a characteristic or defining visual texture or surface pattern? (for the attribute Pattern)” with some examples. To avoid invalid surveys and ensure the effectiveness of the data, we calculate the correlation between the obtained vectors by each participant and vectors by rest participants using the reliability analysis tool in Jamovi, and remove the result that the correlation is less than 0.5 and supplement results of new participants until vectors of each participant meets the requirements.


Stimuli are 672 words chosen from the Synonymy Thesaurus published by Harbin Institute of Technology (HITST), including nouns, verbs, adjectives and words of other grammatical classes and were at various levels of concreteness. Since we selected words from all semantic categories of HITST with no bias, the chosen words should cover a broad semantic space. In addition, each word were paired with 6 different images in the experiment. To get corresponding images that represent the meaning of the word, we use Baidu Search to query the target word and choose images manually.


Rich annotations of audios and texts are provided, including:

  1. Semantic features: The semantic feature of each word in the experiment are provided in the "stimuli/semantic feature" folder. The semantic feature includes 14 domains, i.e., vision, somatic, audition, gustation, olfaction, motor, spatial, temporal, causal, social, cognition, emotion, drive, attention. Each domain involves several attributes (1-15).

  2. Syntactic annotations: The POS tag of each word, and the category of each word are provided in the "stimuli/syntactic_annotations" folder. The word category was annotated by linguists on the basis of the Harbin Institute of Technology. We also provide the Part-of-speech (POS) tag of each word from PKU Chinese Treebank.

  3. Embeddings: Different types of word embeddings including static word embeddings, contextual word embeddings and visual embeddings are provided in the "stimuli/embeddings" folder. Specifically, fasttext model and Glove model were used to compute static word embeddings. Bert, Ernie, GPT2 and Electra models were adopted as the contextual word embeddings. In addition, the two ConvNet-based models as Resent and Densenet and two two transformer-based model as Vision Transformer (Vit) and Bidirectional Encoder representation model (Beit) were used to computed image embeddings.


The MRI data including anatomical and functional images were automatically preprocessed using fMRIPrep.

An fMRI Dataset for Concept Representation with semantic feature annotations
6075 27.59GB
An fMRI Dataset for Concept Representation with semantic feature annotations
  •   dataset_description.json
  •   participants.json
  •   participants.tsv
  •   README
  • code
  • derivatives
  • stimuli
  • sub-01
  • sub-02
  • sub-03
  • sub-04
  • sub-05
  • sub-06
  • sub-07
  • sub-08
  • sub-09
  • sub-10
  • sub-11


Please sign in to contribute to the discussion.